Hi what do i exactly have to do? Do i have to rewrite the start_training.py file?
No. You only have to change the
start_training.py file to use the
Sarsa algorithm instead of the
It is very easy, please do not overthink. Go and see how the Q-learn Python code is loaded in the
start_training.py. Then identify where in the
start_training.py it is instantiated. Then the only thing you need to do is to change those two points by the case of
Hi, for the sarsa algo there is an open point it’s not clear for me.
when we tweak slightly the oroginal start_training.py, we only have to do:
import sarsa# qlearn
qlearn = sarsa.Sarsa(actions=range(env.action_space.n),
alpha=Alpha, gamma=Gamma, epsilon=Epsilon)
QLearn and Sarsa have the same parameters so we can use the same ones defined in the YAML file.
My doubt is here for the methods:
Qlearn:: learn(self, state1, action1, reward, state2))
in the example: qlearn.learn(state, action, reward, nextState) --> makes sense and all arguments are preprocessed before the call
Sarsa::learn(self, state1, action1, reward, state2, action2)
here if we you the same analogy of start_training.py we don’t have the preprocessed action2 that should correspond to the next action.
I would expect a “nextAction” to be preprocessed here as for “nextState = ‘’.join(map(str, observation))”.
In other words, I would expect to make a call like this for Sarsa:
qlearn.learn(state, action, reward, nextState,nextAction) but how to preprocess nextAction ?
Where is my missunderstanding ? Do you have the correction of Exercice 2.2
thanks in advance
very intelligent question, which actually scaped us in the creation of the content.
As you indicate, the Sarsa algorithm shares almost everything with Q-Learn, except that the Sarsa takes an additional step before updating the policy:
- In Q-learn, once we perform the action a and receive the reward r and the next state s’ we move straight to the update of the policy. So we provide to the learn function (s,a,r,s’)
- In Sarsa, after executing the action a and getting the r and s’, before updating the policy, we should ask the current policy which would be the best next action a’ on that s’ new state. Only then, we update the policy, providing to the learn function (s,a,r,s’,a’)
Very well spotted Sugreev!!!
Now, how do we implement this in the start_training.py script:
You need to do two changes:
- After you get your nextState, and before calling the learn function, you need to call the chooseAction function with the nextState to obtain the s’.
- Modify the call to the learn function to include the s’ as parameter.
We did not have this modification in the solutions of the exercises, so we will update it thanks to you Sugreev.
Let me know if still something is not clear.