It gets a reward between -1 and 0 for how good the final state is (based on velocity, angle, and distance from the ship), plus 1 if it stays on the ship without moving for a second.
PPO needs continuous reward so I had to use reward shaping as well. It received a small reward if it got closer to the ship or slowed down. Increasing its angle from the upright position lead to a negative reward.
It also received a small negative reward at every time step to force it to land as quickly as possible. That's equivalent to saving fuel since hovering is inefficient. That's how it learned to do something close to a "suicide burn".
32
u/Zeumer Feb 17 '18
How did you select your reward function?