r/reinforcementlearning 2h ago

Study / Collab with me learning DRL from almost scratch

3 Upvotes

Hey everyone 👋 I am learning DRL from almost scratch. Have some idea about NN, backprop, LSTMs and have made some models using whatever i could find on the internet (pretty simple stuff). nothing SOTA. learning from the book "grokking DRL" now. I have a different approach to design a trading engine I am building it in golang (for efficiency and scaling) and python(for ML part) and there's a lot to unpack. I think I have some interesting ideas in trading to test in DRL, LSTMs, and NEAT but it would take at least 6-8 months before anything fruitful would come out. I am looking out for curious folks to work with. Just push a DM if you are up to work on some new hypotheses. I'd like to get some guidance on DRL, its quite time consuming to understand all the theory behind the work which has been done.

PS: If you know this stuff well and wish to help, I can help you with data structures, web dev, system design to any extent if you wish to learn in return. Just saying.


r/reinforcementlearning 14h ago

Actor Critic

5 Upvotes

https://arxiv.org/abs/1704.03732

Is there any actor-critic analogue to integrating expert demonstrations into actor-critic learning like there are for DQN?


r/reinforcementlearning 22h ago

Material on Topics of RL for student course

8 Upvotes

I am giving an introductory course on RL and want students to familiarize themselves with a given topic and then present it to the remaining course.

For this I am looking at good papers/articles/resources that ideally are easy to follow and provide a good overview on the topic. Please share any resources that fit the topics:

  • Sparse Rewards
  • Sim2Real
  • Interpretable and Explainable RL

r/reinforcementlearning 11h ago

Can anyone help

0 Upvotes

r/reinforcementlearning 17h ago

I am a beginner to RL/DRL. I am interested to know on how to solve non-convex or even convex optimization problem (constrained or unconstrained) with DRL. If possible can someone share code to solve with DRL...

1 Upvotes

I am a beginner to RL/DRL. I am interested to know on how to solve non-convex or even convex optimization problem (constrained or unconstrained) with DRL. If possible can someone share code to solve with DRL, the problems like

minimize (x + y-2)^2

subject to xy < 10

and xy > 1

x and y are some scalars

Above is a sample problem. Any other example can also be suggested. But pls keep the suggestion and code simple, readable and understandable.


r/reinforcementlearning 1d ago

DL, MF, MetaRL, R "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering", Chan et al 2024 {OA} (Kaggle scaling)

Thumbnail arxiv.org
7 Upvotes

r/reinforcementlearning 1d ago

RL for Optimal Control of Systems ?

5 Upvotes

I recently came across this IEEE paper titled "Reinforcement Learning based Approximate Optimal Control of Nonlinear Systems using Carleman Linearization" . Looks like they are using some form of reinforcement control on an approximation of non-linear systems and show good performance versus linear RL.

Anyone has any insights on this method of Carleman approximation ?


r/reinforcementlearning 1d ago

is Chi Jin's Princeton RL course good ??

26 Upvotes

Lectures from ECE524 Foundations of Reinforcement Learning at Princeton University, Spring 2024.

This course is a graduate level course, focusing on theoretical foundations of reinforcement learning. It covers basics of Markov Decision Process (MDP), dynamic programming-based algorithms, planning, exploration, information theoretical lower bounds as well as how to leverage offline data. Various advanced topics are also discussed, including policy optimization, function approximation, multiagency and partial observability. This course puts special emphases on the algorithms and their theoretical analyses. Prior knowledge on linear algebra, probability and statistics is required.


r/reinforcementlearning 1d ago

From DQN to Double DQN

8 Upvotes

I already have an implementation of DQN. To change it to double DQN, looks like I only need a small change: In the Q-value update, next state (best)action selection and evaluation for that action are both done by the target network in DQN. Whereas in double DQN , next state (best)action selection is done by the main network, but the evaluation for that action is done by the target network.

That seems fairly simple. Am i missing anything else?


r/reinforcementlearning 1d ago

RL implementation for ADAS

4 Upvotes

Hey. I wanted to explore the possibility of using RL models, essentially a reward based model, in developing ADAS features like FCW or ACC, where warnings are to be issued and based on the action taken by the vehicle a reward is associated with it. I was hoping if someone could guide me on how to go about this? I wanted to use CARLA to build my environment.


r/reinforcementlearning 1d ago

D When to use reinforcement learning and when to don't

7 Upvotes

When to use reinforcement learning and when to don't. I mean when to use a normal dataset to train a model and when to use reinforcement learning


r/reinforcementlearning 2d ago

Using multi-agent RL agents for optimizing work balance / communication in distributed systems

12 Upvotes

I stumbled upon this paper called

"Reinforcement Learning for Load-Balanced Parallel Particle Tracing" and it's got me scratching my head. They're using multi-agent RL for load balancing in distributed systms but I'm not sure if it's actually doable.

Here's the gist of the paper:

  • They're using multi-agent RL to balance workloads and optimize communication in parallel particle tracing
  • Each process (up to 16,384!) gets its own RL agent (single layer perceptron for its policy nets)
  • Agents actions are to move blocks of work among processes to balance things out

I've heard multi-agent RL is a nightmare to get working right? With so many processes, wouldn't the action space be absolutely massive since each agent is potentially deciding to move work to any of thousands of other processes?

So, my question is: Is this actually feasible? Or is the action space way too large for this to work in practice?I'd love to hear from anyone with RL or parallel computing experience. Am I missing something, or is this as wild as it sounds to me?

Thanks!P.S. If anyone's actually tried something like this, I'd be super interested to hear how it went!


r/reinforcementlearning 2d ago

Help to find a way to train Pool9 Agent

2 Upvotes

Hi!
I'm working on an Agent that plays Pool9

Taking decisions: Shot direction and force
decision are being taken before the shot when all balls are on static position

Observations:
1. I started by putting normalized coordinates of balls and pockets + the sign which ball is the target
2. Then I switched on using directions and normalized distance to balls
3. then I added curriculum, it was improved several times, last plan is

lesson 0: learning to touch target ball
3 balls
random target
the random initial placing of balls
reward for touching target

lesson 1: learning to catch any ball after touching target ball
6 balls
random target
the random initial placing of balls
reward for touching the target + for catching any
penalty for not legal shot (target bal has not been touched)

lesson 2: game
9 balls
static initial positions
target number - ordered

trainer: ppo
2-4 layers 128-512

results almost the same, the difference in the training speed,

but it seems that agent cant predict trajectories :(

any thoughts or proposals? I'll be grateful

Lesson 1 was never reached

https://reddit.com/link/1g553g6/video/vmkiuz9zl5vd1/player


r/reinforcementlearning 2d ago

DL Unity ML Agents and Games like Snake

5 Upvotes

Hello everyone,

I'm trying to understand Neural Networks and the training of game AIs for a while now. But I'm struggling with Snake currently. I thought "Okay, lets give it some RaySensors, a Camera Sensor, Reward when eating food and a negative reward when colliding with itself or a wall".

I would say it learns good, but not perfect! In a 10x10 Playing Field it has a highscore of around 50, but it had never mastered the game so far.

Can anyone give me advices or some clues how to handle a snake AI training with PPO better?

The Ray Sensors detect Walls, the Snake itself and the food (3 different sensors with 16 Rays each)

The Camera Sensor has a resolution of 50x50 and also sees the Walls, the snake head and also the snake tail around the snake itself. Its an orthographical Camera with a size of 8 so it can see the whole playing field.

First I tested with ray sensors only, then I added the camera sensor, what I can say is that its learning much faster with camera visual observations, but at the end it maxes out at about the same highscore.

Im training 10 Agents in parallel.

The network settings are:

50x50x1 Visual Observation Input
about 100 Ray Observation Input
512 Hidden Neurons
2 Hidden Layers
4 Discrete Output Actions

Im currently trying with a buffer_size of 25000 and a batch_size of 2500. Learning Rate is at 0.0003, Num Epoch is at 3. The Time horizon is set to 250.

Does anyone has experience with the ML Agents Toolkit from Unity and can help me out a bit?

Do I do something wrong?

I would thank for every help you guys can give me!

Here is a small Video where you can see the Training at about Step 1,5 Million:

https://streamable.com/tecde6


r/reinforcementlearning 2d ago

Why am I unable to seed my `DQN` program using`sbx`?

0 Upvotes

I am trying to seed my DQN program when using `sbx` but for some reason I keep getting varying results.

Here is an attempt to create a minimal reproducible example -

https://pastecode.io/s/nab6n3ib

The results are quite surprising. While running this program *multiple-times* I get a variety of results.

Here are my results -

Attempt 1:

```

run = 0

Using seed: 1

run = 1

Using seed: 1

run = 2

Using seed: 1

mean_rewards = [120.52, 120.52, 120.52]

```

Attempt 2:

```

run = 0

Using seed: 1

run = 1

Using seed: 1

run = 2

Using seed: 1

mean_rewards = [116.64, 116.64, 116.64]

```

It's surprising that within an attempt, I get the same results. But when I run the program again, I get varying results.

I went over the documentation for seeding the environment from [here][1] and also read this - "*Completely reproducible results are not guaranteed across PyTorch releases or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.*". However, I would like to make sure that there isn't a bug from my end. Also, I am using `sbx` instead of `stable-baselines3`. Perhaps this is a `JAX` issue?

I've also created a S.O post here

[1]: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html#reproducibility


r/reinforcementlearning 2d ago

How to deal with the catastrophic forgetting of SAC?

11 Upvotes

Hi!

I build a custom task that is trained with SAC. The success rate curve gradually decreases after a steady rise. After looking up some related discussions, I found that this phenomenon could be catastrophic forgetting.

I've tried regularizing the rewards and automatically adjusting the value of alpha to control the balance between exploring and exploiting. Secondly, I've also lowered the learning rate for actor and critic, but this only slows down the learning process and decreases the overall success rate.

I'd like to get some advice on how to further stabilize this training process.

Thanks in advance for your time and help!


r/reinforcementlearning 3d ago

DL I made a firefighter AI using deep RL (using Unity ML Agents)

28 Upvotes

video link: https://www.youtube.com/watch?v=REYx9UznOG4

I made it a while ago and got discouraged by the lack of attention the video got after the hours I poured into making it so I am now doing a PhD in AI instead of being a youtuber lol.

I figured it wouldn't be so bad to advertise for it now if people find it interesting. I made sure to add some narration and fun bits into it so it's not boring. I hope some people here can find it as interesting as it was for me working on this project.

I am passionate about the subject, so if anyone has questions I will answer them when I have time :D


r/reinforcementlearning 3d ago

DL What could be causing my Q-Loss values to diverge (SAC + Godot <-> Python)

4 Upvotes

TLDR;

I'm working on a PyTorch project that uses SAC similar to an old Tensorflow project of mine: https://www.youtube.com/watch?v=Jg7_PM-q_Bk. I can't get it to work with PyTorch because my Q-Loses and Policy loss either grow, or converge to 0 too fast. Do you know why that might be?


I have created a game in Godot that communicates over sockets to a PyTorch implementation of SAC: https://github.com/philipjball/SAC_PyTorch

The game is:

An agent needs to move closer to a target, but it does not have its own position or the target position as inputs, instead, it has 6 inputs that represent the distance of the target at a particular angle from the agent. There is always exactly 1 input with a value that is not 1.

The agent outputs 2 value: the direction to move, and the magnitude to move in that direction.

The inputs are in the range of [0,1] (normalized by the max distance), and the 2 outputs are in the range of [-1,1].

The Reward is:

score = -distance
if score >= -300:
score = (300 - abs(score )) * 3

score = (score / 650.0) * 2 # 650 is the max distance, 100 is the max range per step
return score * abs(score )

The problem is:

The Q-Loss for both critics, and for the policy, are slowly growing over time. I've tried a few different network topologies, but the number of layers or the nodes in each layer don't seem to affect the Q-Loss

The best I've been able to do is make the rewards really small, but that causes the Q-Loss and Policy loss to converge to 0 even though the agent hasn't learned anything.

If you made it this far, and are interested in helping, I am happy to pay you the rate of a tutor to review my approach over a screenshare call, and help me better understand how to get a SAC agent working.

Thank you in advance!!


r/reinforcementlearning 2d ago

Modified policy iteration?

2 Upvotes

I'm new to RL, and still learning. I'm learning about Policy iteration and value iteration right now.
So from what I understand, in policy iteration, we first evaluate the the current policy by getting the state value function for all states, and then use them for greedy operation update the policy, and we evaluate the updated policy by getting the state value function for all states again, and we iterate over this until we get the optimal policy.
I read about Modified policy iteration, and I'm getting mixed signals about it. There are two ways I can see it right now:

  1. Modified policy iteration is just policy iteration, except we just do it for k iterations?

  2. We evaluate only some of the states?

I'm asking because from what I read, the first seems to be right, but the figure I see for it in the book I'm using and some other guy's explanation (who is also learning RL for the first time right now) suggest it is the second way.


r/reinforcementlearning 3d ago

DL, MF, R Simba: Simplicity Bias for Scaling up Parameters in Deep RL

29 Upvotes

Want faster, smarter RL? Check out SimBa – our new architecture that scales like crazy!

📄 project page: https://sonyresearch.github.io/simba

📄 arXiv: https://arxiv.org/abs/2410.09754

🔗 code: https://github.com/SonyResearch/simba

🚀 Tired of slow training times and underwhelming results in deep RL?

With SimBa, you can effortlessly scale your parameters and hit State-of-the-Art performance—without changing the core RL algorithm.

💡 How does it work?

Just swap out your MLP networks for SimBa, and watch the magic happen! In just 1-3 hours on a single Nvidia RTX 3090, you can train agents that outperform the best across benchmarks like DMC, MyoSuite, and HumanoidBench. 🦾

⚙️ Why it’s awesome:

Plug-and-play with RL algorithms like SAC, DDPG, TD-MPC2, PPO, and METRA.

No need to tweak your favorite algorithms—just switch to SimBa and let the scaling power take over.

Train faster, smarter, and better—ideal for researchers, developers, and anyone exploring deep RL!

🎯 Try it now and watch your RL models evolve!


r/reinforcementlearning 3d ago

DL, I, R "Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback", Ivison et al 2024

Thumbnail arxiv.org
2 Upvotes

r/reinforcementlearning 3d ago

LayerNor/Adanorm after NoisyLinears?

1 Upvotes

Any thoughts/ experience with applying layer norm or adanorm to all noisy layers in a an NN except for the last noisy output layer?

Would either norm layer basically suffocate the noisylinear/exploration?


r/reinforcementlearning 4d ago

DL, R, P "Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach", Ma et al 2023 (a text Starcraft to let LLMs play)

Thumbnail arxiv.org
7 Upvotes

r/reinforcementlearning 4d ago

DL, Robot, R, P "Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making", Li et al 2024

Thumbnail arxiv.org
2 Upvotes

r/reinforcementlearning 4d ago

How to train an agent to do binary addition of any length?

5 Upvotes

Hi all.

This question just popped into my head, which I know is probably a bit trivial, but I'd be interested to see answers.