Chapter 2 exercise 2.4 give more reward for going to the left than going to the right

Hello,

I have a functional v1 of the cartpole environment and I am now looking to change the reward system.

Guessing that I need to implement the base velocity (positive or negative) as a condition for better reward (lets say 1.5), and guessing that the velocity will be contained in the observation variable, I am not succeeding to implement it in the code below …

I tried to use condition[1] as the velocity of the base without success …

Could you please support me on this one ?

def _compute_reward(self, observations, done):
        """
        Gives more points for staying upright, gets data from given observations to avoid
        having different data than other previous functions
        :return:reward
        """

        if not done:
            reward = 1.0
        elif self.steps_beyond_done is None:
            # Pole just fell!
            self.steps_beyond_done = 0
            reward = 1.0
        
        else:
            if self.steps_beyond_done == 0:
                logger.warning("You are calling 'step()' even though this environment has already returned done = True. You should always call 'reset()' once you receive 'done = True' -- any further steps are undefined behavior.")
            self.steps_beyond_done += 1
            reward = 0.0

        return reward

Hi Laurent,
on the current implementation of the _compute_reward function, the function is providing 1 point if the pole is still up. This is checked in the condition:
if not done:
That means, if the pole is still up (not done) then reward=1

Now you have to change that sentence reward=1 so the reward is not always 1 independently of the direction the robot is compensating to. So the reward should be, let’s say, 1 if the robot went to the right, and 2 if the robot went to the left.

How do you know if the robot went to the right or left on the last action? By checking the observations variable that the _compute_reward function is receiving.

If you check the notes above that example, you will see how the observations vector is constructed (what does it contain). The observations[1] contains the speed applied on the last step. So that value is the one you should check and then assign the reward accordingly.

Let me know if it clear

Hi rtellez,

Thanks a lot for your reply, it is cristal clear !

I’ll be able to move to the next step.

Cheers !

This topic was automatically closed after 22 hours. New replies are no longer allowed.