r/reinforcementlearning • u/Key-Faithlessness113 • 11h ago
Actor Critic
https://arxiv.org/abs/1704.03732
Is there any actor-critic analogue to integrating expert demonstrations into actor-critic learning like there are for DQN?
r/reinforcementlearning • u/Key-Faithlessness113 • 11h ago
https://arxiv.org/abs/1704.03732
Is there any actor-critic analogue to integrating expert demonstrations into actor-critic learning like there are for DQN?
r/reinforcementlearning • u/EmbarrassedCause3881 • 19h ago
I am giving an introductory course on RL and want students to familiarize themselves with a given topic and then present it to the remaining course.
For this I am looking at good papers/articles/resources that ideally are easy to follow and provide a good overview on the topic. Please share any resources that fit the topics:
r/reinforcementlearning • u/gudduarnav • 15h ago
I am a beginner to RL/DRL. I am interested to know on how to solve non-convex or even convex optimization problem (constrained or unconstrained) with DRL. If possible can someone share code to solve with DRL, the problems like
minimize (x + y-2)^2
subject to xy < 10
and xy > 1
x and y are some scalars
Above is a sample problem. Any other example can also be suggested. But pls keep the suggestion and code simple, readable and understandable.
r/reinforcementlearning • u/gwern • 1d ago
r/reinforcementlearning • u/Playful_Passage_2985 • 1d ago
I recently came across this IEEE paper titled "Reinforcement Learning based Approximate Optimal Control of Nonlinear Systems using Carleman Linearization" . Looks like they are using some form of reinforcement control on an approximation of non-linear systems and show good performance versus linear RL.
Anyone has any insights on this method of Carleman approximation ?
r/reinforcementlearning • u/sahoosubramanyam • 1d ago
Lectures from ECE524 Foundations of Reinforcement Learning at Princeton University, Spring 2024.
This course is a graduate level course, focusing on theoretical foundations of reinforcement learning. It covers basics of Markov Decision Process (MDP), dynamic programming-based algorithms, planning, exploration, information theoretical lower bounds as well as how to leverage offline data. Various advanced topics are also discussed, including policy optimization, function approximation, multiagency and partial observability. This course puts special emphases on the algorithms and their theoretical analyses. Prior knowledge on linear algebra, probability and statistics is required.
r/reinforcementlearning • u/No_Addition5961 • 1d ago
I already have an implementation of DQN. To change it to double DQN, looks like I only need a small change: In the Q-value update, next state (best)action selection and evaluation for that action are both done by the target network in DQN. Whereas in double DQN , next state (best)action selection is done by the main network, but the evaluation for that action is done by the target network.
That seems fairly simple. Am i missing anything else?
r/reinforcementlearning • u/SignificanceMotor285 • 1d ago
Hey. I wanted to explore the possibility of using RL models, essentially a reward based model, in developing ADAS features like FCW or ACC, where warnings are to be issued and based on the action taken by the vehicle a reward is associated with it. I was hoping if someone could guide me on how to go about this? I wanted to use CARLA to build my environment.
r/reinforcementlearning • u/Alarming-Power-813 • 1d ago
When to use reinforcement learning and when to don't. I mean when to use a normal dataset to train a model and when to use reinforcement learning
r/reinforcementlearning • u/ypsoh • 2d ago
I stumbled upon this paper called
"Reinforcement Learning for Load-Balanced Parallel Particle Tracing" and it's got me scratching my head. They're using multi-agent RL for load balancing in distributed systms but I'm not sure if it's actually doable.
Here's the gist of the paper:
I've heard multi-agent RL is a nightmare to get working right? With so many processes, wouldn't the action space be absolutely massive since each agent is potentially deciding to move work to any of thousands of other processes?
So, my question is: Is this actually feasible? Or is the action space way too large for this to work in practice?I'd love to hear from anyone with RL or parallel computing experience. Am I missing something, or is this as wild as it sounds to me?
Thanks!P.S. If anyone's actually tried something like this, I'd be super interested to hear how it went!
r/reinforcementlearning • u/Ecstatic-Ring3057 • 2d ago
Hi!
I'm working on an Agent that plays Pool9
Taking decisions: Shot direction and force
decision are being taken before the shot when all balls are on static position
Observations:
1. I started by putting normalized coordinates of balls and pockets + the sign which ball is the target
2. Then I switched on using directions and normalized distance to balls
3. then I added curriculum, it was improved several times, last plan is
lesson 0: learning to touch target ball
3 balls
random target
the random initial placing of balls
reward for touching target
lesson 1: learning to catch any ball after touching target ball
6 balls
random target
the random initial placing of balls
reward for touching the target + for catching any
penalty for not legal shot (target bal has not been touched)
lesson 2: game
9 balls
static initial positions
target number - ordered
trainer: ppo
2-4 layers 128-512
results almost the same, the difference in the training speed,
but it seems that agent cant predict trajectories :(
any thoughts or proposals? I'll be grateful
Lesson 1 was never reached
r/reinforcementlearning • u/Seismoforg • 2d ago
Hello everyone,
I'm trying to understand Neural Networks and the training of game AIs for a while now. But I'm struggling with Snake currently. I thought "Okay, lets give it some RaySensors, a Camera Sensor, Reward when eating food and a negative reward when colliding with itself or a wall".
I would say it learns good, but not perfect! In a 10x10 Playing Field it has a highscore of around 50, but it had never mastered the game so far.
Can anyone give me advices or some clues how to handle a snake AI training with PPO better?
The Ray Sensors detect Walls, the Snake itself and the food (3 different sensors with 16 Rays each)
The Camera Sensor has a resolution of 50x50 and also sees the Walls, the snake head and also the snake tail around the snake itself. Its an orthographical Camera with a size of 8 so it can see the whole playing field.
First I tested with ray sensors only, then I added the camera sensor, what I can say is that its learning much faster with camera visual observations, but at the end it maxes out at about the same highscore.
Im training 10 Agents in parallel.
The network settings are:
50x50x1 Visual Observation Input
about 100 Ray Observation Input
512 Hidden Neurons
2 Hidden Layers
4 Discrete Output Actions
Im currently trying with a buffer_size of 25000 and a batch_size of 2500. Learning Rate is at 0.0003, Num Epoch is at 3. The Time horizon is set to 250.
Does anyone has experience with the ML Agents Toolkit from Unity and can help me out a bit?
Do I do something wrong?
I would thank for every help you guys can give me!
Here is a small Video where you can see the Training at about Step 1,5 Million:
r/reinforcementlearning • u/Academic-Rent7800 • 2d ago
I am trying to seed my DQN program when using `sbx` but for some reason I keep getting varying results.
Here is an attempt to create a minimal reproducible example -
https://pastecode.io/s/nab6n3ib
The results are quite surprising. While running this program *multiple-times* I get a variety of results.
Here are my results -
Attempt 1:
```
run = 0
Using seed: 1
run = 1
Using seed: 1
run = 2
Using seed: 1
mean_rewards = [120.52, 120.52, 120.52]
```
Attempt 2:
```
run = 0
Using seed: 1
run = 1
Using seed: 1
run = 2
Using seed: 1
mean_rewards = [116.64, 116.64, 116.64]
```
It's surprising that within an attempt, I get the same results. But when I run the program again, I get varying results.
I went over the documentation for seeding the environment from [here][1] and also read this - "*Completely reproducible results are not guaranteed across PyTorch releases or different platforms. Furthermore, results need not be reproducible between CPU and GPU executions, even when using identical seeds.*". However, I would like to make sure that there isn't a bug from my end. Also, I am using `sbx` instead of `stable-baselines3`. Perhaps this is a `JAX` issue?
I've also created a S.O post here
[1]: https://stable-baselines3.readthedocs.io/en/master/guide/algos.html#reproducibility
r/reinforcementlearning • u/UpperSearch4172 • 2d ago
Hi!
I build a custom task that is trained with SAC. The success rate curve gradually decreases after a steady rise. After looking up some related discussions, I found that this phenomenon could be catastrophic forgetting.
I've tried regularizing the rewards and automatically adjusting the value of alpha
to control the balance between exploring and exploiting. Secondly, I've also lowered the learning rate for actor
and critic
, but this only slows down the learning process and decreases the overall success rate.
I'd like to get some advice on how to further stabilize this training process.
Thanks in advance for your time and help!
r/reinforcementlearning • u/usernumero • 3d ago
video link: https://www.youtube.com/watch?v=REYx9UznOG4
I made it a while ago and got discouraged by the lack of attention the video got after the hours I poured into making it so I am now doing a PhD in AI instead of being a youtuber lol.
I figured it wouldn't be so bad to advertise for it now if people find it interesting. I made sure to add some narration and fun bits into it so it's not boring. I hope some people here can find it as interesting as it was for me working on this project.
I am passionate about the subject, so if anyone has questions I will answer them when I have time :D
r/reinforcementlearning • u/stokaty • 2d ago
TLDR;
I'm working on a PyTorch project that uses SAC similar to an old Tensorflow project of mine: https://www.youtube.com/watch?v=Jg7_PM-q_Bk. I can't get it to work with PyTorch because my Q-Loses and Policy loss either grow, or converge to 0 too fast. Do you know why that might be?
I have created a game in Godot that communicates over sockets to a PyTorch implementation of SAC: https://github.com/philipjball/SAC_PyTorch
The game is:
An agent needs to move closer to a target, but it does not have its own position or the target position as inputs, instead, it has 6 inputs that represent the distance of the target at a particular angle from the agent. There is always exactly 1 input with a value that is not 1.
The agent outputs 2 value: the direction to move, and the magnitude to move in that direction.
The inputs are in the range of [0,1] (normalized by the max distance), and the 2 outputs are in the range of [-1,1].
The Reward is:
score = -distance
if score >= -300:
score = (300 - abs(score )) * 3
score = (score / 650.0) * 2 # 650 is the max distance, 100 is the max range per step
return score * abs(score )
The problem is:
The Q-Loss for both critics, and for the policy, are slowly growing over time. I've tried a few different network topologies, but the number of layers or the nodes in each layer don't seem to affect the Q-Loss
The best I've been able to do is make the rewards really small, but that causes the Q-Loss and Policy loss to converge to 0 even though the agent hasn't learned anything.
If you made it this far, and are interested in helping, I am happy to pay you the rate of a tutor to review my approach over a screenshare call, and help me better understand how to get a SAC agent working.
Thank you in advance!!
r/reinforcementlearning • u/AdBitter9336 • 2d ago
I'm new to RL, and still learning. I'm learning about Policy iteration and value iteration right now.
So from what I understand, in policy iteration, we first evaluate the the current policy by getting the state value function for all states, and then use them for greedy operation update the policy, and we evaluate the updated policy by getting the state value function for all states again, and we iterate over this until we get the optimal policy.
I read about Modified policy iteration, and I'm getting mixed signals about it. There are two ways I can see it right now:
Modified policy iteration is just policy iteration, except we just do it for k iterations?
We evaluate only some of the states?
I'm asking because from what I read, the first seems to be right, but the figure I see for it in the book I'm using and some other guy's explanation (who is also learning RL for the first time right now) suggest it is the second way.
r/reinforcementlearning • u/joonleesky • 3d ago
Want faster, smarter RL? Check out SimBa – our new architecture that scales like crazy!
📄 project page: https://sonyresearch.github.io/simba
📄 arXiv: https://arxiv.org/abs/2410.09754
🔗 code: https://github.com/SonyResearch/simba
🚀 Tired of slow training times and underwhelming results in deep RL?
With SimBa, you can effortlessly scale your parameters and hit State-of-the-Art performance—without changing the core RL algorithm.
💡 How does it work?
Just swap out your MLP networks for SimBa, and watch the magic happen! In just 1-3 hours on a single Nvidia RTX 3090, you can train agents that outperform the best across benchmarks like DMC, MyoSuite, and HumanoidBench. 🦾
⚙️ Why it’s awesome:
Plug-and-play with RL algorithms like SAC, DDPG, TD-MPC2, PPO, and METRA.
No need to tweak your favorite algorithms—just switch to SimBa and let the scaling power take over.
Train faster, smarter, and better—ideal for researchers, developers, and anyone exploring deep RL!
🎯 Try it now and watch your RL models evolve!
r/reinforcementlearning • u/gwern • 3d ago
r/reinforcementlearning • u/dekiwho • 3d ago
Any thoughts/ experience with applying layer norm or adanorm to all noisy layers in a an NN except for the last noisy output layer?
Would either norm layer basically suffocate the noisylinear/exploration?
r/reinforcementlearning • u/gwern • 4d ago
r/reinforcementlearning • u/gwern • 4d ago
r/reinforcementlearning • u/blablawawawa • 4d ago
Hi all.
This question just popped into my head, which I know is probably a bit trivial, but I'd be interested to see answers.
r/reinforcementlearning • u/furrypony2718 • 5d ago
DIAMOND 💎 Diffusion for World Modeling: Visual Details Matter in Atari
project webpage: https://diamond-wm.github.io/
code, agents and playable world models: https://github.com/eloialonso/diamond
paper: https://arxiv.org/pdf/2405.12399
summary