Tiny math font fix.

Pro-flynn · Jun 12, 2019 · 0495ab1 · 0495ab1
1 parent 7065009
commit 0495ab1
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/16_Reinforcement_Learning.ipynb b/16_Reinforcement_Learning.ipynb
@@ -123,11 +123,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state *t* the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state *t* and *t+1* because NOOP means \"No Operation\".\n",
+    "The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state $t$ the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state $t$ and $t+1$ because NOOP means \"No Operation\".\n",
     "\n",
-    "In state *t+1* the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state *t+1* is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
+    "In state $t+1$ the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state $t+1$ is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
     "\n",
-    "Now that we know the reward of taking the NOOP action from state *t* to *t+1*, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
+    "Now that we know the reward of taking the NOOP action from state $t$ to $t+1$, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
     "\n",
     "$$\n",
     "    Q(state_{t},NOOP) \\leftarrow \\underbrace{r_{t}}_{\\rm reward} + \\underbrace{\\gamma}_{\\rm discount} \\cdot \\underbrace{\\max_{a}Q(state_{t+1}, a)}_{\\rm estimate~of~future~rewards} = 1.0 + 0.97 \\cdot 1.830 \\simeq 2.775\n",
@@ -4456,7 +4456,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.1"
+   "version": "3.6.8"
   }
  },
  "nbformat": 4,