Solving the environment require an average total reward of over 300 over 100 consecutive episodes. Training of BipedalWalker is considered as difficult task, in particular, it is very difficult to train BipedalWalker by DDPG and PPO (with one agent). We solve the environment by usage of the SAC algorithm, see the basic paper SAC: Off-Policy Maximum Entropy Deep RL with a Stochastic Actor. For another solution (based on the single agent) see BipedalWalker-TD3.
Agent uses the following hyperparameters:
gamma=0.99 # discount
mini_batch=256 # optimizer and backward mechisms work after sampling BATCH elements
lr = 0.0001 # learning rate
eps=0.2 # the clipping parameter using for calculation of the action loss
A central feature of SAC is entropy regularization.
The major difference with common RL algorithms is training to maximize a trade-off between
expected return and entropy, a measure of randomness in the policy. This has a close connection
to the exploration-exploitation trade-off: increasing entropy results in more exploration,
which can accelerate learning later on. It can also prevent the policy from prematurely
converging to a bad local optimum.
Soft Actor Critic isn’t a direct successor to TD3 (having been published roughly concurrently),
but it incorporates the clipped double-Q trick:
qf1, qf2 = self.critic(state_batch, action_batch)
qf1_loss = F.mse_loss(qf1, next_q_value)
qf2_loss = F.mse_loss(qf2, next_q_value)
pi, log_pi, _ = self.policy.sample(state_batch)
qf1_pi, qf2_pi = self.critic(state_batch, pi)
min_qf_pi = torch.min(qf1_pi, qf2_pi)
policy_loss = ((self.alpha * log_pi) - min_qf_pi).mean()
Two Q-functions are used to mitigate the positive bias in the policy improvement step.
SAC is an off-policy algorithm. In other words, the SAC algorithm allows reusing the already collected data. In the agent.update_parameters we get the batch of (state, action, reward, next_state, mask) of the length = batch_size:
memory = ReplayMemory(replay_size)
....
# Sample a batch from memory, _batch_size_ = 256
state_batch, action_batch, reward_batch, next_state_batch, mask_batch = memory.sample(batch_size=batch_size)
state_batch = torch.FloatTensor(state_batch).to(self.device)
next_state_batch = torch.FloatTensor(next_state_batch).to(self.device)
action_batch = torch.FloatTensor(action_batch).to(self.device)
reward_batch = torch.FloatTensor(reward_batch).to(self.device).unsqueeze(1)
mask_batch = torch.FloatTensor(mask_batch).to(self.device).unsqueeze(1)
We compute the average of policy_loss and alpha_loss (entropy factor) over all elements of the
batch and perform the backward propogation of these averages:
self.policy_optim.zero_grad()
policy_loss.backward()
self.policy_optim.step()
...
self.alpha_optim.zero_grad()
alpha_loss.backward()
self.alpha_optim.step()
See video Four BipedalWalker Gaits demonsrating 4 different BipedalWalker-walks related with 4 different sets of SAC-hyperparameters.
We train the agent to understand that it can use information from its surroundings to inform the next best action. The score 300.5 was achieved
- in the episode 408 after training 7 hours 29 minutes.
lr = 0.00008.
- in the episode 540 after training 4 hours 41 minutes.
lr = 0.0005
- in the episode 756 after training 13 hours 5 minutes.
lr = 0.0001
...
Ep.: 397, Total Steps: 433858, Ep.Steps: 921, Score: 305.46, Avg.Score: 299.45, Time: 07:19:50
Ep.: 398, Total Steps: 434802, Ep.Steps: 944, Score: 304.52, Avg.Score: 299.57, Time: 07:20:46
Ep.: 399, Total Steps: 435723, Ep.Steps: 921, Score: 305.65, Avg.Score: 299.69, Time: 07:21:40
Ep.: 400, Total Steps: 436684, Ep.Steps: 961, Score: 303.97, Avg.Score: 299.78, Time: 07:22:38
Ep.: 401, Total Steps: 437608, Ep.Steps: 924, Score: 306.29, Avg.Score: 299.90, Time: 07:23:33
Ep.: 402, Total Steps: 438541, Ep.Steps: 933, Score: 304.28, Avg.Score: 299.99, Time: 07:24:28
Ep.: 403, Total Steps: 439491, Ep.Steps: 950, Score: 304.15, Avg.Score: 300.09, Time: 07:25:24
Ep.: 404, Total Steps: 440417, Ep.Steps: 926, Score: 305.15, Avg.Score: 300.21, Time: 07:26:19
Ep.: 405, Total Steps: 441320, Ep.Steps: 903, Score: 305.71, Avg.Score: 300.27, Time: 07:27:13
Ep.: 406, Total Steps: 442235, Ep.Steps: 915, Score: 306.88, Avg.Score: 300.44, Time: 07:28:07
Ep.: 407, Total Steps: 443146, Ep.Steps: 911, Score: 305.48, Avg.Score: 300.56, Time: 07:29:01
Solved environment with Avg Score: 300.56121912236614
Full log is available in the jupyter notebook file.
Based on Pranjal Tandon's code (https://github.com/pranz24).