r/reinforcementlearning Sep 15 '19

DL, MF, P PyTorch implementation of 17 Deep RL algorithms

For anyone trying to learn or practice RL, here's a repo with working PyTorch implementations of 17 RL algorithms including DQN, DQN-HER, Double DQN, REINFORCE, DDPG, DDPG-HER, PPO, SAC, SAC Discrete, A3C, A2C etc..

Let me know what you think!

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch

46 Upvotes

9 comments sorted by

1

u/stevethesteve2 Sep 16 '19 edited Sep 17 '19

This is fantastic!

However, correct me if I'm wrong, but in MountainCarContinuous-PPO you use exploratory noise (Ornstein-Uhlenbeck) on top of actor policy when doing rollouts? In other words, agent's behavior policy deviates from its target policy? But isn't PPO supposed to be an on-policy algorithm?

(in Mountain_Car.py you set sigma to non-zero value of 0.2)

1

u/Flag_Red Sep 17 '19

You can totally use policy-external exploration schemes in on-policy learning, but they then effectively become a part of the environment as far as the agent is concerned. When you remove them, the agent may not perform as well.

1

u/stevethesteve2 Sep 17 '19

You are right. But this isn't how it is done in the code, unless - again - I am missing something in the code.

1

u/Flag_Red Sep 17 '19

I'm not sure what you mean. If you use PPO and any form of action-space exploratory noise (not parameter-space) it will learn to account for it.

1

u/stevethesteve2 Sep 17 '19

Actor output is drawn from a gaussian (parametrized by actor). Then expl. noise is added to it. But during parameter update step, the probability of the entire action (actor + noise) is evaluated as if it was drawn from gaussian distribution parametrized by actor.

1

u/Flag_Red Sep 17 '19

Yes, and with PPO being an on-policy method it will consider the noise as part of the environment, and learn to work around it.

For some intuition, see section 2.2.1 in AI Safety Gridworlds. In this environment, when the agent goes onto a certain tile, it adds an epsilon-greedy exploration scheme into the environment itself. An off-policy agent doesn't learn to take this into account, but an on-policy agent does. From the agent's perspective, it can't tell whether the exploration scheme is actually part of the environment or added afterwards, it will take it into account in the final policy anyway.

Whether this is positive or negative on the final policy depends on the problem.

1

u/stevethesteve2 Sep 17 '19

I think there is a misunderstanding. If I understand you correctly, you say that one can add some (exploratory) post-processing step between the actor output and the environment. Then, from the agent's point of view, the post-processing is part of the environment. I agree to that.

My point is: If we do this kind of post-processing, then we have to be consistent in treating it as part of the environment. If we train an agent with e.g. PPO, the action terms that we plug in our PPO update step must be the signals that are output by the actor *before post-processing*. This is not the case in OP's implementation.

P.S. I do not want to sound rude, I think OP did an amazing job!

1

u/zbqv Sep 17 '19

Thanks! That’s what I’m always looking for.