Skip to content

Commit

Permalink
Tiny math font fix.
Browse files Browse the repository at this point in the history
  • Loading branch information
Hvass-Labs committed Jun 12, 2019
1 parent 7065009 commit 0495ab1
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions 16_Reinforcement_Learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state *t* the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state *t* and *t+1* because NOOP means \"No Operation\".\n",
"The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state $t$ the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state $t$ and $t+1$ because NOOP means \"No Operation\".\n",
"\n",
"In state *t+1* the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state *t+1* is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
"In state $t+1$ the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state $t+1$ is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
"\n",
"Now that we know the reward of taking the NOOP action from state *t* to *t+1*, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
"Now that we know the reward of taking the NOOP action from state $t$ to $t+1$, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
"\n",
"$$\n",
" Q(state_{t},NOOP) \\leftarrow \\underbrace{r_{t}}_{\\rm reward} + \\underbrace{\\gamma}_{\\rm discount} \\cdot \\underbrace{\\max_{a}Q(state_{t+1}, a)}_{\\rm estimate~of~future~rewards} = 1.0 + 0.97 \\cdot 1.830 \\simeq 2.775\n",
Expand Down Expand Up @@ -4456,7 +4456,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
"version": "3.6.8"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 0495ab1

Please sign in to comment.