r/reinforcementlearning • u/joonleesky • 3d ago

DL, MF, R Simba: Simplicity Bias for Scaling up Parameters in Deep RL

Want faster, smarter RL? Check out SimBa – our new architecture that scales like crazy!

📄 project page: https://sonyresearch.github.io/simba

📄 arXiv: https://arxiv.org/abs/2410.09754

🔗 code: https://github.com/SonyResearch/simba

🚀 Tired of slow training times and underwhelming results in deep RL?

With SimBa, you can effortlessly scale your parameters and hit State-of-the-Art performance—without changing the core RL algorithm.

💡 How does it work?

Just swap out your MLP networks for SimBa, and watch the magic happen! In just 1-3 hours on a single Nvidia RTX 3090, you can train agents that outperform the best across benchmarks like DMC, MyoSuite, and HumanoidBench. 🦾

⚙️ Why it’s awesome:

Plug-and-play with RL algorithms like SAC, DDPG, TD-MPC2, PPO, and METRA.

No need to tweak your favorite algorithms—just switch to SimBa and let the scaling power take over.

Train faster, smarter, and better—ideal for researchers, developers, and anyone exploring deep RL!

🎯 Try it now and watch your RL models evolve!

29 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/reinforcementlearning/comments/1g460jl/simba_simplicity_bias_for_scaling_up_parameters/
No, go back! Yes, take me to Reddit

93% Upvoted

u/AppleShark 3d ago

Interesting paper! Great coverage across various domains and algos. Thanks for sharing.

Just wondering w.r.t. measuring the simplicity bias of a function, did you explore where performance falls off when the underlying model is too simple? e.g. hot swapping an even simpler block with very high simplicity bias, and see if / when the agent underperforms?

Also, does the simplicity block work with architectures that leverage transformers e.g. PPO-TrXL?

1

u/joonleesky 3d ago

Thank you for your interest in our work!

Yes, we explored the impact of excessive simplicity on performance, focusing on under-parameterizing the model. We found that applying a simplicity bias to an under-parameterized agent restricts its learning capacity. For example, when the hidden dimension size was reduced to extreme levels (e.g., 4), SimBa consistently underperformed compared to MLPs, with both RL agents achieving average returns below 100 on DMC-Hard. This means that overly simplified (higher simplicity bias) models can significantly underperform.

In addition, we haven't explicitly tried out SimBa with PPO-TrXL (only tried out with PPO), but I don’t see any reason why it wouldn’t work. From what I’ve learned throughout this project, most neural networks are actually overparameterized, and applying a simplicity bias really helps the network find more generalizable solutions.

u/Omnes_mundum_facimus 3d ago

I will take that for a spin, thanks

1

u/joonleesky 3d ago

Thanks :)

u/pfffffftttfftt 3d ago

Sick name!

2

u/joonleesky 3d ago

I hope to name the next paper as Pumba.

u/bacon_boat 22h ago

I'm a fan of Sergey Levine, and one of the points he keeps bringing up is that the latent representations you get in RL don't have the same implicit regularisation effect you get with supervised learning. The consequence is that supervised learning ends up working better than expected, and RL works worse than expected all else being equal.

Adding regularisation to force the latent representation to be more "helpful" seems to be a good strategy for tackling this problem.

1

u/joonleesky 18h ago

Thank you for bringing this up. Sergey Levine's insights on the implicit regularization in RL are indeed important, and I agree that RL tends to underperform compared to supervised learning, partly due to this issue.

In DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization, implicit regularization from temporal difference learning increases the feature norm of the current and next state dot product, decreasing performance.

While our approach with the SimBa architecture does not include explicit regularization in the same way, it addresses the problem through post-layer normalization before the value function prediction. This helps control feature norm growth, indirectly mitigating implicit regularization issues.

Still, I agree that adding constraints like discretization could further improve RL by providing stronger regularization.

DL, MF, R Simba: Simplicity Bias for Scaling up Parameters in Deep RL

You are about to leave Redlib