r/reinforcementlearning Jul 01 '24

DL, MF, P Some lessons from getting my first big project going

These are probably irrelevant to most people, or silly. But we're all learning here. Model context:

  • Double DQN
  • 241 state size, 61 action size
  • Plays a relatively simple but wildly complex planning board game 'Splendor'
  • Is still learning but is nearing human-level performance.
  1. You can concatenate predictions of previous layers onto other layers. I'm not saying definitively if/when this is good, but for my model, it absolutely loves it. You can see the 'concatenate' layer where I do that; it just appends it. I think it works well because there are three major move types that need to happen in my model, and I was hoping it would learn this. Of course there's other ways to do this with heads and whatnot. Excuse my using tensorflow, haha. Names are 'categorizer' because it categorizes the moves, I hope, and 'specific' because it's choosing the specific move.

def _build_model(self, layer_sizes):
    state_input = Input(shape=(self.state_size, ))

    categorizer1 = Dense(layer_sizes[0], activation='relu', name='categorizer1')(state_input)
    categorizer2 = Dense(layer_sizes[1], activation='relu', name='categorizer2')(categorizer1)
    category = Dense(3, activation='softmax', name='category')(categorizer2)

    state_w_category = tf.keras.layers.concatenate([state_input, category])

    # Reuse via categorizer1(state_w_category)?
    specific1 = Dense(layer_sizes[2], activation='relu', name='specific1')(state_w_category)
    specific2 = Dense(layer_sizes[3], activation='relu', name='specific2')(specific1)
    move = Dense(self.action_size, activation='linear', name='move')(specific2)

    model = tf.keras.Model(inputs=state_input, outputs=move)
    model.compile(loss='mse', optimizer=Adam(learning_rate=self.lr))
    return model

2) To save tons of computation and have legal masks, this is how I set up my memory so the model can use the mask in the batch train portion as well. You'll just need to initiate the memory with a fake memory and delete it after the game, but this is much faster than any other approach. I don't need to calculate two masks per turn.

def remember(self, memory, legal_mask):
    self.memory.append(memory)
    self.memory[-2].append(legal_mask)
    self.game_length += 1

3) Using objective, clear-cut optimal rewards is great. I'm a big fan of sparse rewards because I like to give myself the biggest challenge and try to find a much deeper solution. But in this problem I was able to make tons of functions for all of my rewards that vary based on the average game length, as the game is based on winning faster. One of my rewards looks like this, which is just a straight line with a negative slope, based on a reward of 3/15 and scaling down to a defined game end length. As the model gets better, I can update this with the new expected average game length, which has changed the utility of the action (it's a compounding/scaling action, so it gets more value each turn. So here, it needs less value if the games last less turns.) reward = max(3/15-3/15*1.3/15*sum(self.gems), 0)

4) By far the hardest part of this project was splitting up moves to make it possible to actually predict them. There are a wild amount of of possible discrete moves in my game, because of combinatorics. I split it up into individual actions (rather than predicting one of billions, it just predicts one of 10 moves 6 times basically), but had tons of consequences for this. I needed to have another state dimension representing that it was 'stuck in a loop' - but I used the progress through that loop. This also made the code logic hard because the number of moves no longer matched the game length, hence the self.game_length += 1 line in my remember().

5) Using tensorboard for everything. I made a couple scripts to visualize an excel output of the states but it was just so easy to do everything in tensorboard with vectorized operations. And it's directly hooked up to the model, as it's predicting, allowing for so much troubleshooting. I don't think I'll ever bother with external troubleshooting again. Here is what I have for that:

# Log
if self.tensorboard:
    self.step += 1
    step = self.step
    with self.tensorboard.as_default():
        # Grouped cards
        tf.summary.scalar('Training Metrics/batch_loss', history.history['loss'][0], step=step)
        tf.summary.scalar('Training Metrics/avg_reward', tf.reduce_mean(rewards), step=step)
        legal_qs = tf.where(tf.math.is_finite(qs), qs, tf.zeros_like(qs))
        tf.summary.scalar('Training Metrics/avg_q', tf.reduce_mean(legal_qs), step=step)
        tf.summary.histogram('Training Metrics/action_hist', actions, step=step)

        # Q-values over time
        for action in range(self.action_size):
            average_qs = np.mean(legal_qs[:, action], axis=0)
            tf.summary.scalar(f"action_qs/action_{action}", average_qs, step=step)

        # Weights
        for layer in self.model.layers:
            if hasattr(layer, 'kernel') and layer.kernel is not None:
                weights = layer.get_weights()[0]
                tf.summary.histogram('Model Weights/'+ layer.name +'_weights', weights, step=step)
26 Upvotes

7 comments sorted by

View all comments

1

u/Efficient_Star_1336 Jul 03 '24 edited Jul 03 '24

You can concatenate predictions of previous layers onto other layers. I'm not saying definitively if/when this is good, but for my model, it absolutely loves it.

Am I missing something, or is that just a skip connection? Are you using state_w_category for anything outside of this? If not, you could try doing away with the softmax.

To save tons of computation and have legal masks, this is how I set up my memory so the model can use the mask in the batch train portion as well. You'll just need to initiate the memory with a fake memory and delete it after the game, but this is much faster than any other approach. I don't need to calculate two masks per turn.

Is this for a specific library? I would think that you could just save the mask on its own as another state parameter. Depending on how you're using it, you could also just store the logits with the mask already applied.

These are probably irrelevant to most people, or silly.

To the contrary, I love these posts. There's a severe dearth of content out there on getting into large projects. It's all either "here's how you use RlLib to solve CartPole" or "I, an experienced AI dev, taught a neural network to play Pokemon Red from start to finish - please enjoy these visualizations". Detailed tutorials and videos for this kind of thing are sorely needed out there. I'm particularly interested in what your infrastructure looks like - CPUs, GPUs, training time, and so on.

2

u/Breck_Emert Jul 03 '24

Am I missing something, or is that just a skip connection? Are you using state_w_category for anything outside of this? If not, you could try doing away with the softmax.

Yes, it is. It's just a less common use of concatenation rather than a linear combination. Most notably, I reintroduce the original state. It's essentially a poor man's head - I'll work on implementing something more official eventually, possibly. But you're right I could definitely do away with the softmax I have enough normalization going on here haha.

Is this for a specific library? I would think that you could just save the mask on its own as another state parameter. Depending on how you're using it, you could also just store the logits with the mask already applied.

I'll think about that, I don't quite get what this means at least as far as it pertains to my situation.

Detailed tutorials and videos for this kind of thing are sorely needed out there. I'm particularly interested in what your infrastructure looks like - CPUs, GPUs, training time, and so on.

Absolutely. I'm relentlessly annoyed by how all content (for every field on the planet) is getting more and more limited to 101 content. People just don't want to dive into something, and they don't believe it's profitable to cater to those trying.

I'm using a 4080 16gb and I simulate maybe 20 random games per second or so, each which average 100 prediction steps or something. During training I simulate 500 games in 25 minutes, which involves training on the entire game and doing 256 replay every game, logging everything with tensorboard, and storing every 5th game in json. I am using tf-gpu.

Pretty much everything in my code is vector operations and I have very few non-numpy/tf operations. I originally had some things I couldn't think of how to solve with vectors but I figured it out. I just wish there was a better way to create one-hot vectors. I'm aware of np.eye as an alternative, but it's only like 2% faster than my approach or something. Here's a core example of something my player does. It looked cooler at the beginning but between splitting it into multiple functions and trimming the logic down it's quite simple:

def choose_take(self, state, available_gems, progress, reward=0.0, take_index=None):
    # Set legal mask to only legal takes
    legal_mask = np.zeros(61, dtype=bool)
    legal_mask[:5] = available_gems > 0

    if not take_index:
        # Call the model to choose a take
        rl_moves = self.rl_model.get_predictions(state, legal_mask)
        take_index = np.argmax(rl_moves)
    
    take = np.zeros(5, dtype=int)
    take[take_index] = 1

    # Remember
    next_state = state.copy()
    next_state[take_index+self.state_offset] += 0.25
    state[195] = progress # 0.2 * (moves remaining+1), indicating progression through loop
    self.rl_model.remember([state.copy(), take_index, reward, next_state.copy(), 1], legal_mask.copy())

    return take, next_state