AI for Cyber-Physical Systems
Cheatsheet Content
### Reinforcement Learning #### The RL Framework **Reinforcement Learning (RL)** is learning what to do (map states to actions) to maximize a cumulative numerical reward by trial and error. #### Markov Decision Process (MDP) - **Markov Property:** The future depends only on the current state, not on the history: $P(S_{t+1} | S_t, S_{t-1}, \dots) = P(S_{t+1} | S_t)$. - An MDP is defined by: (1) decision epochs, (2) state set $S$, (3) action set $A$, (4) transition probability $P[s' | s, a]$, (5) reward $r(s,a)$. - **Observation vs. State:** State = complete description; Observation = partial description. - **Action space:** Discrete or Continuous. #### Reward and Return - **Cumulative (undiscounted) return:** $G_t = R_{t+1} + R_{t+2} + \dots$ (may diverge!) - **Discounted Return:** $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$, where $\gamma \in (0,1)$ is the **discount factor**. - $\gamma \approx 0 \implies$ short-sighted; $\gamma \approx 1 \implies$ far-sighted. - **Episodic tasks:** Have a terminal state. - **Continuing tasks:** Run forever. #### Policy A **policy** $\pi: S \to A$ is a rule telling the agent what action to take in each state. - **Deterministic:** $a = \pi(s)$. - **Stochastic:** $\pi(a | s) = P[A_t = a | S_t = s]$. - **Exploration vs. Exploitation:** - **Exploitation:** Choose the best known action. - **Exploration:** Try new actions. - **$\epsilon$-greedy:** Explore with probability $\epsilon$, exploit with probability $1-\epsilon$. #### Value Functions: V(s) vs Q(s,a) - **State Value Function $V(s)$:** "How good is it to be in this state?" - $V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R_{t+1} \mid S_0 = s \right]$. Expected cumulative reward starting from state $s$, following policy $\pi$. - Tells you the long-term worth of a state. - **State-Action Value (Q-Function) $Q(s,a)$:** "How good is it to take action $a$ in state $s$?" - $Q^\pi(s, a) = \mathbb{E}_\pi \left[ \sum_{k=0}^\infty \gamma^k R_{t+k+1} \mid S_t = s, A_t = a \right]$. Expected cumulative reward starting from state $s$, taking action $a$, then following policy $\pi$. - Tells you the long-term worth of a specific action in a state. ##### What Gets Chosen? - **Action Selection:** Actions are chosen based on $Q(s,a)$, not $V(s)$. - Optimal policy: $\pi^*(s) = \arg\max_a Q(s,a)$. - **Why not $V(s)$?** $V(s)$ only tells you how good a state is; it doesn't directly tell you which action to take to get there. To use $V(s)$ for action selection, you'd need the transition model $P(s'|s,a)$: $$\pi^*(s) = \arg\max_a \sum_{s'} P(s'|s,a)[R + \gamma V(s')]$$ - **Relationship:** $V(s) = \max_a Q(s,a)$. | Feature | $V(s)$ (State Value) | $Q(s,a)$ (Action Value) | | :------------------- | :------------------------------------------------- | :----------------------------------------------------- | | **Input** | State $s$ | State $s$ + Action $a$ | | **Primary Use** | Evaluate states, evaluate policies (model-based) | Select actions (model-free), policy evaluation | | **Needs Model for Action Selection?** | Yes (to choose action) | No (argmax Q directly) | | **Example Algorithms** | Value Iteration, TD(0), Monte Carlo (for V) | Q-learning, SARSA, DQN, Monte Carlo (for Q) | **Bottom line:** $Q(s,a)$ is directly used for action selection; $V(s)$ is used for policy evaluation or when a model is available. #### Bellman Equation For any policy $\pi$ and MDP with discount factor $\gamma \in (0,1)$, the value function satisfies: $$V^\pi(s) = \mathbb{E}[R(s, \pi(s)) + \gamma V^\pi(S') \mid S' \sim P(\cdot | s, \pi(s))]$$ In words: The value of a state equals the immediate reward plus the discounted value of whatever state we land in next. #### Dynamic Programming (DP) DP solves complex optimization problems by breaking them into simpler sub-problems and solving them recursively from the end backwards. - **Principle of Optimality (Bellman):** The tail of an optimal policy is itself optimal for the corresponding sub-problem. - **DP Algorithm (Finite Horizon):** - **Terminal:** $J_N(x_N) = g_N(x_N)$ - **Recursion:** $J_k(x_k) = \min_{u_k \in U_k(x_k)} \mathbb{E}_{w_k} [g_k(x_k, u_k, w_k) + J_{k+1}(f_k(x_k, u_k, w_k))]$ - **Open-loop vs. Feedback:** Open-loop $u_k$ is a fixed sequence; Feedback $\mu_k(x_k)$ is a function of current state. Feedback policies adapt to disturbances. #### Linear Quadratic Regulator (LQR) - **Dynamics:** $x_{t+1} = Ax_t + Bu_t$ (linear). - **Cost:** $\sum_{i=0}^{N-1} (x_i^T Q x_i + u_i^T R u_i) + x_N^T Q_N x_N$ (quadratic). - **Goal:** Regulate states to the origin ($x_t \to 0$). - **Solution via DP:** Optimal control $u_t = -K_t x_t$, where $K_t = (B^T Q_{t+1} B + R)^{-1} B^T Q_{t+1} A$. - **Riccati recursion for $Q_t$:** $Q_t = A^T Q_{t+1} A + Q - A^T Q_{t+1} B (B^T Q_{t+1} B + R)^{-1} B^T Q_{t+1} A$. - This equation describes the balance between state penalty ($Q$), control effort penalty ($R$), and the future cost propagation ($A^T Q_{t+1} A$). - **Infinite horizon (Limiting case):** If Riccati converges to $P \ge 0$, we get the Discrete Algebraic Riccati Equation (DARE): $P = A^T PA + Q - A^T P B (B^T P B + R)^{-1} B^T P A$. #### Value-Based RL Methods ##### Stochastic Approximation The theoretical foundation for TD learning and Q-learning updates, describing how an iterative update $\mu_{m+1} = (1-\alpha_m)\mu_m + \alpha_m x_m$ converges to $\mathbb{E}[X]$. ##### Learning V(s) 1. **Dynamic Programming (Value Iteration):** Needs full model. $$V(s) \leftarrow \max_a \sum_{s'} P(s'|s,a)[R + \gamma V(s')]$$ - *When to use:* Known full model, offline planning. 2. **TD Learning (TD(0)):** Model-free, learns from experience. Updates at every step using an estimate (bootstrapping). $$V(s) \leftarrow V(s) + \alpha[r + \gamma V(s') - V(s)]$$ - *When to use:* No model, online, single-step updates. 3. **Monte Carlo:** Model-free, learns from full episodes. $$V(s) \leftarrow V(s) + \alpha[G_t - V(s)]$$ - *When to use:* No model, episodic tasks only, waits for episode to finish. 4. **TD($\lambda$) / Eligibility Traces:** Blends TD and MC. 5. **Function Approximation:** Uses a neural network to approximate $V(s)$ directly (e.g., in Actor-Critic methods). ##### Learning Q(s,a) 1. **Q-Learning (Off-policy TD):** Learns the optimal $Q^*$ directly. - **Update:** $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t))$. - **Off-policy:** Agent acts using $\epsilon$-greedy, but update uses $\max_a Q$ (greedy/optimal). - *When to use:* Most common model-free method. 2. **SARSA (On-policy TD):** - **Update:** $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t))$. - **On-policy:** Uses the action actually chosen next $A_{t+1}$ by the current policy. - *When to use:* Safer in environments with penalties (e.g., cliffs) as it accounts for its own exploration. 3. **Monte Carlo:** - **Update:** $Q(s,a) \leftarrow Q(s,a) + \alpha[G_t - Q(s,a)]$. - *When to use:* No model, episodic tasks only. 4. **Deep Q-Network (DQN):** Replaces the Q-table with a neural network. - **Loss Function:** $L(\theta) = \mathbb{E} \left[ \left( R_{t+1} + \gamma \max_{a'} Q_{\theta^-}(S_{t+1}, a') - Q_\theta(S_t, A_t) \right)^2 \right]$. - **Tricks:** Experience Replay Buffer (breaks correlations, data efficient) and Target Network (stabilizes training). - *When to use:* Large/continuous state spaces (e.g., Atari games). 5. **Double DQN, Dueling DQN:** Improvements over DQN. ##### Key Differences: TD vs. Monte Carlo | Feature | Temporal Difference (TD) | Monte Carlo (MC) | | :------------------ | :----------------------------------------------------- | :---------------------------------------------------- | | **Updates** | Every step (online) | End of episode (offline) | | **Needs full episode?** | No | Yes | | **Bootstraps?** | Yes (uses estimated value of next state $V(s')$ or $Q(s',a')$) | No (uses actual return $G_t$) | | **Bias/Variance** | Higher bias, lower variance | Lower bias, higher variance | | **Continuous Tasks?** | Yes | No (requires terminal state) |