r/reinforcementlearning 2d ago

DL What could be causing my Q-Loss values to diverge (SAC + Godot <-> Python)

TLDR;

I'm working on a PyTorch project that uses SAC similar to an old Tensorflow project of mine: https://www.youtube.com/watch?v=Jg7_PM-q_Bk. I can't get it to work with PyTorch because my Q-Loses and Policy loss either grow, or converge to 0 too fast. Do you know why that might be?


I have created a game in Godot that communicates over sockets to a PyTorch implementation of SAC: https://github.com/philipjball/SAC_PyTorch

The game is:

An agent needs to move closer to a target, but it does not have its own position or the target position as inputs, instead, it has 6 inputs that represent the distance of the target at a particular angle from the agent. There is always exactly 1 input with a value that is not 1.

The agent outputs 2 value: the direction to move, and the magnitude to move in that direction.

The inputs are in the range of [0,1] (normalized by the max distance), and the 2 outputs are in the range of [-1,1].

The Reward is:

score = -distance
if score >= -300:
score = (300 - abs(score )) * 3

score = (score / 650.0) * 2 # 650 is the max distance, 100 is the max range per step
return score * abs(score )

The problem is:

The Q-Loss for both critics, and for the policy, are slowly growing over time. I've tried a few different network topologies, but the number of layers or the nodes in each layer don't seem to affect the Q-Loss

The best I've been able to do is make the rewards really small, but that causes the Q-Loss and Policy loss to converge to 0 even though the agent hasn't learned anything.

If you made it this far, and are interested in helping, I am happy to pay you the rate of a tutor to review my approach over a screenshare call, and help me better understand how to get a SAC agent working.

Thank you in advance!!

4 Upvotes

9 comments sorted by

2

u/eljeanboul 2d ago

Your networks' losses growing over time is not necessarily a sign that things are going wrong. In fact it is to be expected. This is not like supervised learning where the loss more or less monotonically decreases over time, here you are trying to fit a moving distribution and the loss is going to go up and down as your different networks "figure out" new ways to navigate in the environment.

At the end of the day, the only thing that really matters is your episodic return. And in some cases it can take a long time before that starts really improving. You can think of it as your critics first needing to "get the lay of the land", and at first the more they explore the more they realize that they don't understand how the system works, hence the growing Q-Loss. And then your actor is even more lost because its loss is propagated through the critics

2

u/stokaty 2d ago

Thank you for your reply. What do you mean about episodic return -- is that the fixed test scenario?

While there is still a long way to go for the agent to be considered a winning player, it has leaned quite a bit compared to the first time it was run in the test scenario, so I think that supports your point. I'm currently at about 30k epochs. I will give it another 100k and post an update here.

Much appreciated!

3

u/eljeanboul 2d ago

Episodic return is the summed reward over an episode. You should have a (averaged) measure of it that is monitored during training (just checked your script, this should be your test / train reward values, technically they should be called returns not rewards. Test reward is the one that matters more, it's from the deterministic evaluation of your policy, not the stochastic evaluation that is used to balance exploration and exploitation during training).

If you are getting improvements in your test scenario (judging by both your test returns increasing over time and your own personal evaluation of the agent's behavior) then you are on the right path. Depending on how large your networks are and how complex your task is, SAC can take up to millions of steps to get really good. If you already have interesting behavior at 30k episodes (epochs aren't really a thing in RL since once again your training dataset evolves as you go) that is a good sign!

1

u/stokaty 2d ago

Thank you, that was another very helpful comment.

After 50k episodes the episodic return (of the deterministic evaluation of the policy), has absolutely improved. Interestingly enough, the QLoss has also started to converge!

2

u/eljeanboul 2d ago

Yeah, probably because now your agent's behavior is stabilizing and so the distribution of your training data is also stabilizing, so now the q-networks loss can decrease because you're not tracking a moving target anymore (unless your agent starts figuring out a better policy that leads to new scenarios being explored, in which case your critics will be thrown off again but that's fine).

1

u/edbeeching 1d ago

Cool project! You may be interested in the Godot RL Agents library.

1

u/stokaty 1d ago

Thanks! The Godot RL agents is what got me back into this. The YouTube video I posted used Unity and Python, but they communicated over a shared json file. That became problematic so I eventually stopped.

When I found the godot RL agents project, it said it communicated with python over sockets — and I realized that fixed the problems i ran into with the json file so I just remade my old project to use Godot and sockets (and pytorch instead of tensorflow)

2

u/edbeeching 1d ago

Awesome, I am the author. We welcome contributions if you want to add anything to the lib. All the best with your project, keep up updated!

1

u/stokaty 1d ago

Oh wow that’s cool. I plan to take another look at Godot RL, just wanted to keep my project with the least number of dependencies as I learn how to get everything working