P-tuning vs Prefix-tuning vs Prompt-tuning.

6 min readFeb 19, 2024

With the advent of Large Language models which are trained to perform a variety of general tasks, there is a huge boost in finding ways to make these models perform well on domain-specific data and train them for downstream tasks.

Fine-tuning might come as the first solution in our minds but fine-tuning requires updating and storing all the parameters of the LM. This would mean storing multiple copies of the modified copy of all the LMs parameters for each task. This can be extremely expensive given that now models like GPT-2 have 774 parameters and GPT-3 has 175B parameters.

And so new research areas have sprung up which is between writing and searching for hard discrete prompts and fine-tuning the entire model. These are divided into 2 broad categories.

Soft-Prompts.
Adapters (LoRA is the most commonly used amongst them)

Soft-Prompts

Prompting is the approach of adding extra information for the model to condition on during its generation of the output.

example: in “generate the sentiment of the following review: I really loved the way the actors act”.
“generate the sentiment of the following review :” is the prompt and “I really loved …… “ is the input.

In Soft-Prompts, the model weights of the LM are frozen and there are separate learnable tensors concatenated with the model weights and are trained for the specific downstream task. We want to optimally learn teh prompts that give us the best results. All the methods work well with a small enough labeled dataset.

Today we will look at a couple of these methods.

Prefix Tuning:

Prefix tuning has been specifically created for Natural Language Generation tasks.

In contrast with fine-tuning, in prefix tuning, we only store 1 copy of the transformer and multiple copies of the prefix parameters. Thus 1 LM can be reused by prepending the learnt prefix parameters to it.

Prefix tuning prepends a sequence of continuous task-specific vectors to the input and all the model layers.

This image from the prefix-tuning paper helps us understand this. In an auto-regressive model like GPT, it only prepends the prefix once. [PREFIX; x; y]. In an encoder-decoder model, it prepends the prefix to both the encoder and decoder. [PREFIX; x; PREFIX′; y].

h_i is calculated by this equation where h_i is a function of the trainable parameter P_theta and is derived by a trainable matrix otherwise it is calculated by the transformer itself. h_i (as we can understand from the image above) is the extra learnable vector that was prepended to the model layers while P_idx was prepended to the input.

The matrix P_theta is trained using an MLP to stabilize the learning. The input to the MLP is a smaller matrix.

Example: if the vocab length is 1000, the input dimension would be n * 1000 where n is the number of words in input. We create a smaller matrix of size say n * 10 and pass it through an MLP to transform it to a n*1000 matrix. The n*1000 is stored and n*10 is discarded.

P_θ[i, :] = MLP_θ(P_θ′[i, :])

Prompt-tuning:

P-tuning simplifies prefix tuning. And only appends learnable parameters to the input.

This helps us treat the LLM as a Black-Box and thus can be used even when we have no access to the LLM (ex GPT-4).

They show that prompt-tuning alone is sufficient to be competitive with model-tuning (mind-blowing right !!!!).

It is also important to note that prompt-tuning becomes more competitive with scale i.e. size of the model used.

Similar to prefix-tuning, prompt-tuning considers all tasks as text generation tasks.

The weights of the model remain frozen.

Let's understand how this works:

As we saw above (section: soft-prompts), there are 2 parts to the input given to an LLM. The prompt (instruction and maybe some examples) and the input. We encode the input and get the embeddings. We append a learnable matrix of a fixed size to the input which denotes our learnable prompt. The entire text then goes through the transformer as usual and we then use a loss (Cross entropy loss in most cases) from the given labeled dataset do backpropagation and only update the weights of the prompt vector.

P-Tuning:

The prompt is modified to add trainable continuous vectors along with the discrete prompts. P-Tuning has been designed for NLU tasks and not NLG tasks.

P-tuning treats prompt as a set of learnable parameters that are updated by backpropagation. This method is different from both the above methods and is more in tune with prompt optimization but the prompts are vectors instead of discrete prompts. A prompt encoder is used which can be an LSTM or a Multi-Layer Perceptron.

Let us understand how this works by an example.

Where is DELHI located? INDIA.

Where INDIA is the label (Y) and DELHI (X) is the other information that is very important and should remain constant. We can tweak the language and each tweak will give us a different result as we see in the image below.

We convert every discrete prompt as a template.

T = {[P0:i] , x , [P(i+1):j] , y , [P(j+1):k]}.

Thus, the template for the above example will be —

T = {[P0:i] , INDIA , [P(i+1):j] , DELHI , [P(j+1):k]}

This converts the task to finding and filling in the blanks in the input text to find the prompt with the best result.

In the above figure, P_i is the continuous prompt embedding that has to be learned and h_i is the model input. An encoder is used to map P_i to h_i.

They have experimented with various encoder models (identity function i.e. using the embeddings themselves, MLP, and LSTM with LSTM giving the best performance). The encoder layer maps the learnable vectors into the transformer model input which is then used to generate the output. The error is then backpropagated.

This method is extremely similar to prompt-tuning since both of them keep the LLM as a black box and don't add any parameters to it. The difference lies in the extra encoder and allowing us a concatenation of discrete and learnable prompts.

As the paper states at the beginning, framing a prompt in different ways, leads to a huge change in the output quality, keeping the most important information constant, we want to learn the other tokens in a manner that gives us the highest accuracy.

We don’t touch the model at all. The model parameters are frozen and only the continuous prompt is tuned.

If the prompt is -> What is the capital of Britain? Label -> that is Y -> London.

LoRA:

LoRA is an adapter-based technique that is very different from everything we discussed above. In LoRA, We add mode layers in the transformer structure, changing the architecture.

I am not going to go into a lot of detail here, since I feel this blog already does a very good job of it.