Chapter 2 exercise 2.4 give more reward for going to the left than going to the right

Hello,

I have a functional v1 of the cartpole environment and I am now looking to change the reward system.

Guessing that I need to implement the base velocity (positive or negative) as a condition for better reward (lets say 1.5), and guessing that the velocity will be contained in the observation variable, I am not succeeding to implement it in the code below …

I tried to use condition[1] as the velocity of the base without success …

Could you please support me on this one ?

def _compute_reward(self, observations, done):
        """
        Gives more points for staying upright, gets data from given observations to avoid
        having different data than other previous functions
        :return:reward
        """

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        
        else:
            if self.steps_beyond_done == 0:
                logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
            self.steps_beyond_done += 1
            reward = 0.0

        return reward

Hi Laurent,
on the current implementation of the _compute_reward function, the function is providing 1 point if the pole is still up. This is checked in the condition:
if not done:
That means, if the pole is still up (not done) then reward=1

Now you have to change that sentence reward=1 so the reward is not always 1 independently of the direction the robot is compensating to. So the reward should be, let’s say, 1 if the robot went to the right, and 2 if the robot went to the left.

How do you know if the robot went to the right or left on the last action? By checking the observations variable that the _compute_reward function is receiving.

If you check the notes above that example, you will see how the observations vector is constructed (what does it contain). The observations[1] contains the speed applied on the last step. So that value is the one you should check and then assign the reward accordingly.

Let me know if it clear