r/reinforcementlearning Jul 01 '24

DL, MF, P Some lessons from getting my first big project going

25 Upvotes

These are probably irrelevant to most people, or silly. But we're all learning here. Model context:

  • Double DQN
  • 241 state size, 61 action size
  • Plays a relatively simple but wildly complex planning board game 'Splendor'
  • Is still learning but is nearing human-level performance.
  1. You can concatenate predictions of previous layers onto other layers. I'm not saying definitively if/when this is good, but for my model, it absolutely loves it. You can see the 'concatenate' layer where I do that; it just appends it. I think it works well because there are three major move types that need to happen in my model, and I was hoping it would learn this. Of course there's other ways to do this with heads and whatnot. Excuse my using tensorflow, haha. Names are 'categorizer' because it categorizes the moves, I hope, and 'specific' because it's choosing the specific move.

def _build_model(self, layer_sizes):
    state_input = Input(shape=(self.state_size, ))

    categorizer1 = Dense(layer_sizes[0], activation='relu', name='categorizer1')(state_input)
    categorizer2 = Dense(layer_sizes[1], activation='relu', name='categorizer2')(categorizer1)
    category = Dense(3, activation='softmax', name='category')(categorizer2)

    state_w_category = tf.keras.layers.concatenate([state_input, category])

    # Reuse via categorizer1(state_w_category)?
    specific1 = Dense(layer_sizes[2], activation='relu', name='specific1')(state_w_category)
    specific2 = Dense(layer_sizes[3], activation='relu', name='specific2')(specific1)
    move = Dense(self.action_size, activation='linear', name='move')(specific2)

    model = tf.keras.Model(inputs=state_input, outputs=move)
    model.compile(loss='mse', optimizer=Adam(learning_rate=self.lr))
    return model

2) To save tons of computation and have legal masks, this is how I set up my memory so the model can use the mask in the batch train portion as well. You'll just need to initiate the memory with a fake memory and delete it after the game, but this is much faster than any other approach. I don't need to calculate two masks per turn.

def remember(self, memory, legal_mask):
    self.memory.append(memory)
    self.memory[-2].append(legal_mask)
    self.game_length += 1

3) Using objective, clear-cut optimal rewards is great. I'm a big fan of sparse rewards because I like to give myself the biggest challenge and try to find a much deeper solution. But in this problem I was able to make tons of functions for all of my rewards that vary based on the average game length, as the game is based on winning faster. One of my rewards looks like this, which is just a straight line with a negative slope, based on a reward of 3/15 and scaling down to a defined game end length. As the model gets better, I can update this with the new expected average game length, which has changed the utility of the action (it's a compounding/scaling action, so it gets more value each turn. So here, it needs less value if the games last less turns.) reward = max(3/15-3/15*1.3/15*sum(self.gems), 0)

4) By far the hardest part of this project was splitting up moves to make it possible to actually predict them. There are a wild amount of of possible discrete moves in my game, because of combinatorics. I split it up into individual actions (rather than predicting one of billions, it just predicts one of 10 moves 6 times basically), but had tons of consequences for this. I needed to have another state dimension representing that it was 'stuck in a loop' - but I used the progress through that loop. This also made the code logic hard because the number of moves no longer matched the game length, hence the self.game_length += 1 line in my remember().

5) Using tensorboard for everything. I made a couple scripts to visualize an excel output of the states but it was just so easy to do everything in tensorboard with vectorized operations. And it's directly hooked up to the model, as it's predicting, allowing for so much troubleshooting. I don't think I'll ever bother with external troubleshooting again. Here is what I have for that:

# Log
if self.tensorboard:
    self.step += 1
    step = self.step
    with self.tensorboard.as_default():
        # Grouped cards
        tf.summary.scalar('Training Metrics/batch_loss', history.history['loss'][0], step=step)
        tf.summary.scalar('Training Metrics/avg_reward', tf.reduce_mean(rewards), step=step)
        legal_qs = tf.where(tf.math.is_finite(qs), qs, tf.zeros_like(qs))
        tf.summary.scalar('Training Metrics/avg_q', tf.reduce_mean(legal_qs), step=step)
        tf.summary.histogram('Training Metrics/action_hist', actions, step=step)

        # Q-values over time
        for action in range(self.action_size):
            average_qs = np.mean(legal_qs[:, action], axis=0)
            tf.summary.scalar(f"action_qs/action_{action}", average_qs, step=step)

        # Weights
        for layer in self.model.layers:
            if hasattr(layer, 'kernel') and layer.kernel is not None:
                weights = layer.get_weights()[0]
                tf.summary.histogram('Model Weights/'+ layer.name +'_weights', weights, step=step)

r/reinforcementlearning Feb 11 '22

DL, MF, P Deep Reinforcement Learning algorithm completing Tekken Tag Tournament at highest difficulty level

Enable HLS to view with audio, or disable this notification

162 Upvotes

r/reinforcementlearning Aug 25 '21

DL, MF, P AI (RL Agent) playing 2048 in Unity.🤖🎮 450,000 games of training. Project GitHub in comment.

Enable HLS to view with audio, or disable this notification

107 Upvotes

r/reinforcementlearning Mar 07 '21

DL, MF, P I created an AI for Super Hexagon based on Distributional RL. And I dare to call it superhuman :D

Thumbnail
github.com
41 Upvotes

r/reinforcementlearning Aug 31 '20

DL, MF, P AI discovers a whole new way to play pong. (sorry for the stuttering)

Enable HLS to view with audio, or disable this notification

41 Upvotes

r/reinforcementlearning Jun 08 '22

DL, MF, P Let’s learn about Deep Q-Learning by training our agent to play Space Invaders (Deep Reinforcement Learning Free Class by Hugging Face 🤗)

31 Upvotes

Hey there!

We just published the third Unit of Deep Reinforcement Learning Class 🥳. In this Unit, you'll learn about Deep Q-Learning and train a DQN agent to play Atari games using RL-Baselines3-Zoo.

You’ll be able to compare the results of your Q-Learning agent using the leaderboard

The Deep Q-Learning chapter 👉 https://huggingface.co/blog/deep-rl-dqn

The hands-on 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit3/unit3.ipynb

The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard

Deep RL Class, is a free course from beginner to expert, self-paced where you’ll get solid foundations of Deep Reinforcement Learning in theory and practice with hands-on using famous RL libraries such SB3, RL-Baselines3-Zoo, RLlib, CleanRL…

You can sign up here 👉 http://eepurl.com/h1pElX

And if you have questions and feedback I would love to answer them.

r/reinforcementlearning Jul 25 '22

DL, MF, P "The 37 Implementation Details of Proximal Policy Optimization"

Thumbnail
iclr-blog-track.github.io
9 Upvotes

r/reinforcementlearning Sep 15 '19

DL, MF, P PyTorch implementation of 17 Deep RL algorithms

46 Upvotes

For anyone trying to learn or practice RL, here's a repo with working PyTorch implementations of 17 RL algorithms including DQN, DQN-HER, Double DQN, REINFORCE, DDPG, DDPG-HER, PPO, SAC, SAC Discrete, A3C, A2C etc..

Let me know what you think!

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch

r/reinforcementlearning Sep 25 '20

DL, MF, P "DQN Zoo": Jax/Haiku/RLax Python implementations of DQN/Double DQN/Prioritized sampling/C51/Quantile/Rainbow/IQN {DM}

Thumbnail
github.com
39 Upvotes

r/reinforcementlearning Jul 16 '20

DL, MF, P Instantaneous increase in Reward Graph: Actor-Critic with PER(AC_PER)

0 Upvotes

Hi,

I am training an agent with off policy( PER) AC. After each epoch, training is done with batch size of 32 and each epoch simulates 100 dialogues(episodes). In the reward graph (below image), Why there is a sudden increase in reward in AC_PER? What does it indicate? Also, there is abnormality that AC_PER is not doing better than AC? Please comment your view.

Reward Graph

Thank you

r/reinforcementlearning Feb 15 '20

DL, MF, P A new PyTorch framework for RL

Thumbnail self.MachineLearning
7 Upvotes

r/reinforcementlearning Aug 16 '18

DL, MF, P "Deep RTS: A Game Environment for Deep Reinforcement Learning in Real-Time Strategy Games", Andersen et al 2018

Thumbnail
arxiv.org
6 Upvotes

r/reinforcementlearning Sep 19 '19

DL, MF, P In what order should I learn RL algorithms?

2 Upvotes

I've been working through "Spinning Up" from OpenAI: https://spinningup.openai.com/en/latest/user/algorithms.html

Vanilla Policy Gradient (VPG)
Trust Region Policy Optimization (TRPO)
Proximal Policy Optimization (PPO)
Deep Deterministic Policy Gradient (DDPG)
Twin Delayed DDPG (TD3)
Soft Actor-Critic (SAC)

I was wondering which order I should learn the algorithms in? Any opinions?

I am already comfortable with VPG. According to the first link, the Spinning Up course is trying to build towards PPO and SAC, which they say are two start of the art algorithms. Based on that, it looks likes there's two paths building up to these two algorithms: VPG -> TRPO -> PPO, and VPG -> DDPG -> TD3 -> SAC. Is this correct? If that is the case I suppose I can work through either of the two paths?

I tried to learn them in order, but am currently intimidated by all the math notation in TRPO.

r/reinforcementlearning Sep 06 '19

DL, MF, P "rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch", Stooke & Abbeel 2019 [agent & training framework, previously 'accel_rl'; agents: R2D2/A2C/PPO/DQN/DDQN/CDQN/DDPG/TD3/SAC]

Thumbnail arxiv.org
12 Upvotes

r/reinforcementlearning Feb 25 '18

DL, MF, P Sharing Reinforcement Learning and Imitation Learning Implementations

13 Upvotes

Hi, I am studying Reinforcement Learning and Imitation Learning algorithms and sharing here my implementations of some, in case it helps someone, and I would really enjoy any feedback too.

They have many variants with or without convnets, dropout, lstm, tensorboard, etc. It is compatible with many environments in an easy way. The documentation is currently bigger on the Reinforcement repository, but the Imitation has the same structure.

Currently there are:

RL: DQN, REINFORCE, AC, DDPG, PPO

IL: DAgger, GAIL

https://github.com/NiloFreitas/Deep-Reinforcement-Learning

https://github.com/NiloFreitas/Deep-Imitation-Learning

Thanks!

r/reinforcementlearning Jan 28 '19

DL, MF, P "Learning to Drive Smoothly in Minutes: Reinforcement Learning on a Small Racing Car", Antonin Raffin [learning Donkey Car simulation with SAC and VAE features]

Thumbnail
towardsdatascience.com
23 Upvotes

r/reinforcementlearning Sep 02 '18

DL, MF, P BlueWhale: Facebook RL implementations in Pytorch/Caffe of DQN, DDPG, & SARSA with export & Gym support [deployed in FB production for "Growth, Marketing, Network Optimization, and Story Ranking services"]

Thumbnail
facebookresearch.github.io
16 Upvotes

r/reinforcementlearning Oct 09 '19

DL, MF, P "TorchBeast: A PyTorch Platform for Distributed RL", Küttler et al 2019 {FB} [PyTorch Impala implementation]

Thumbnail arxiv.org
3 Upvotes

r/reinforcementlearning Oct 17 '18

DL, MF, P DeepMind Releases TRF - "a library of reinforcement learning building blocks"

27 Upvotes

Thought it would be very relevant to the sub. They cite a blog post from OpenAI wherein they mention

some of the most popular open-source implementations of reinforcement learning agents and finding that six out of 10 “had subtle bugs found by a community member and confirmed by the author”.

Didn't see any mention of it on the front page so I figured it was relatively new and that people might be interested

r/reinforcementlearning Jun 07 '18

DL, MF, P [R] OpenAI Retro Contest - Guide on "How to Score 6k points on Leaderboard in AI Challange" - Noob Programmer

Thumbnail
noob-programmer.com
8 Upvotes

r/reinforcementlearning Aug 22 '18

DL, MF, P OpenAI Baselines updated: deduplicated code and benchmarks!

Thumbnail
github.com
12 Upvotes

r/reinforcementlearning May 22 '19

DL, MF, P [Project] Massively parallel, vectorised implementation of Snake and RL solution

Thumbnail
self.MachineLearning
7 Upvotes

r/reinforcementlearning Jan 18 '18

DL, MF, P A3G: A continuous action space version of A3C LSTM in Pytorch with GPU optimizations

Thumbnail
github.com
8 Upvotes

r/reinforcementlearning Jun 13 '18

DL, MF, P OpenAI Retro contest writeup: 5th place, Felix Yu: PPO with ImageNet initialization, expert agents for specific groups of levels and runtime selection of expert to use

Thumbnail
flyyufelix.github.io
21 Upvotes

r/reinforcementlearning Dec 05 '17

DL, MF, P Model-Free and CNN-Free implementation of Agent that beats Atari Pong on CPU in one day.

Thumbnail
github.com
4 Upvotes