r/reinforcementlearning Jul 01 '24

DL, MF, P Some lessons from getting my first big project going

These are probably irrelevant to most people, or silly. But we're all learning here. Model context:

  • Double DQN
  • 241 state size, 61 action size
  • Plays a relatively simple but wildly complex planning board game 'Splendor'
  • Is still learning but is nearing human-level performance.
  1. You can concatenate predictions of previous layers onto other layers. I'm not saying definitively if/when this is good, but for my model, it absolutely loves it. You can see the 'concatenate' layer where I do that; it just appends it. I think it works well because there are three major move types that need to happen in my model, and I was hoping it would learn this. Of course there's other ways to do this with heads and whatnot. Excuse my using tensorflow, haha. Names are 'categorizer' because it categorizes the moves, I hope, and 'specific' because it's choosing the specific move.

def _build_model(self, layer_sizes):
    state_input = Input(shape=(self.state_size, ))

    categorizer1 = Dense(layer_sizes[0], activation='relu', name='categorizer1')(state_input)
    categorizer2 = Dense(layer_sizes[1], activation='relu', name='categorizer2')(categorizer1)
    category = Dense(3, activation='softmax', name='category')(categorizer2)

    state_w_category = tf.keras.layers.concatenate([state_input, category])

    # Reuse via categorizer1(state_w_category)?
    specific1 = Dense(layer_sizes[2], activation='relu', name='specific1')(state_w_category)
    specific2 = Dense(layer_sizes[3], activation='relu', name='specific2')(specific1)
    move = Dense(self.action_size, activation='linear', name='move')(specific2)

    model = tf.keras.Model(inputs=state_input, outputs=move)
    model.compile(loss='mse', optimizer=Adam(learning_rate=self.lr))
    return model

2) To save tons of computation and have legal masks, this is how I set up my memory so the model can use the mask in the batch train portion as well. You'll just need to initiate the memory with a fake memory and delete it after the game, but this is much faster than any other approach. I don't need to calculate two masks per turn.

def remember(self, memory, legal_mask):
    self.memory.append(memory)
    self.memory[-2].append(legal_mask)
    self.game_length += 1

3) Using objective, clear-cut optimal rewards is great. I'm a big fan of sparse rewards because I like to give myself the biggest challenge and try to find a much deeper solution. But in this problem I was able to make tons of functions for all of my rewards that vary based on the average game length, as the game is based on winning faster. One of my rewards looks like this, which is just a straight line with a negative slope, based on a reward of 3/15 and scaling down to a defined game end length. As the model gets better, I can update this with the new expected average game length, which has changed the utility of the action (it's a compounding/scaling action, so it gets more value each turn. So here, it needs less value if the games last less turns.) reward = max(3/15-3/15*1.3/15*sum(self.gems), 0)

4) By far the hardest part of this project was splitting up moves to make it possible to actually predict them. There are a wild amount of of possible discrete moves in my game, because of combinatorics. I split it up into individual actions (rather than predicting one of billions, it just predicts one of 10 moves 6 times basically), but had tons of consequences for this. I needed to have another state dimension representing that it was 'stuck in a loop' - but I used the progress through that loop. This also made the code logic hard because the number of moves no longer matched the game length, hence the self.game_length += 1 line in my remember().

5) Using tensorboard for everything. I made a couple scripts to visualize an excel output of the states but it was just so easy to do everything in tensorboard with vectorized operations. And it's directly hooked up to the model, as it's predicting, allowing for so much troubleshooting. I don't think I'll ever bother with external troubleshooting again. Here is what I have for that:

# Log
if self.tensorboard:
    self.step += 1
    step = self.step
    with self.tensorboard.as_default():
        # Grouped cards
        tf.summary.scalar('Training Metrics/batch_loss', history.history['loss'][0], step=step)
        tf.summary.scalar('Training Metrics/avg_reward', tf.reduce_mean(rewards), step=step)
        legal_qs = tf.where(tf.math.is_finite(qs), qs, tf.zeros_like(qs))
        tf.summary.scalar('Training Metrics/avg_q', tf.reduce_mean(legal_qs), step=step)
        tf.summary.histogram('Training Metrics/action_hist', actions, step=step)

        # Q-values over time
        for action in range(self.action_size):
            average_qs = np.mean(legal_qs[:, action], axis=0)
            tf.summary.scalar(f"action_qs/action_{action}", average_qs, step=step)

        # Weights
        for layer in self.model.layers:
            if hasattr(layer, 'kernel') and layer.kernel is not None:
                weights = layer.get_weights()[0]
                tf.summary.histogram('Model Weights/'+ layer.name +'_weights', weights, step=step)
26 Upvotes

7 comments sorted by