Soft Q learning

Ankita Sinha
2 min readApr 27, 2022

Let us start by understanding Q Learning and then extend it to Soft Q Learning.

What is Q-Learning?

Q-learning (image by author)

Q Learning is one of the most popular RL algorithm that is used to solve Markov Decision Processes. In an RL environment, in a state, the RL agent takes an action. This returns a reward and takes the agent to the next state according to the environment rules.

Since it goes on in a cycle, the agent must chose actions that maximise its rewards at the end of the episode. Q-Learning learns how much long term reward it will get for each action-value pair. (s,a)

Q-Learning uses Bellman’s equation to solve this problem which is as follows.

Image by author

Q(Sₜ, Aₜ) is the state-action value for a particular state. It’s value is calculated by the sum of the reward agent gets when it goes to the next state and the maximum value amongst all the actions that can be taken in the next state. ɑ is the step size and 𝛄 is the discount factor. We get these values by following a policy for example an epsilon-greedy policy that always takes the maximum return actions except with a small probability epsilon, with which it explores and takes new actions to find new states.

How is Soft Q Learning different from Q Learning?

Instead of always taking the optimal action i.e. action with the maximum value, this approach involves choosing an action with weighted probabilities. This is done by using a softmax function over the estimates of value for each action. Thus the action with the maximum value is most likely to be chosen but it is not guaranteed to be always chosen. This improves exploration in the training phase.

The Temperature Parameter.

The temperature parameter (τ) controls the spread of the softmax function and is generally decreased with time.

This is an example of SoftQLearner class in a tabular setting. Alpha is the temperature parameter and LR is the learning rate. While choosing an action (choose_action function), we get the current q value from our Q value table. To get the soft q value of the action, we use logsumexp(q_value/temperature). We then perform a bellman update to get the final action probablities.

Reference:

Reinforcement Learning with Deep Energy-Based Policies

--

--