Yes, the options sampled from the policy over options can and should change during learning.

Jan 31, 2024

Yes, the options sampled from the policy over options can and should change during learning. This adaptability is a fundamental aspect of the Hierarchical Reinforcement Learning (HRL) framework, particularly in the Option-Critic architecture discussed in the blog. The higher-level meta-policy, denoted as Q_Omega, informs the lower-level policies by selecting options in an epsilon-greedy manner. As learning progresses, the Q_Omega table is updated based on the feedback from the environment and the performance of the options. This update process allows the policy over options to adapt by favoring options that lead to better outcomes, thus changing the options sampled during learning. This dynamic adjustment is crucial for the system to improve its decision-making process and efficiently achieve its goals, especially in complex environments where static policies would be less effective.

Written by Ankita Sinha