Intuition for Encoder-Based Models vs Decoder-Based Models

Ankita Sinha
8 min readFeb 5, 2024

We will discuss the difference between Encoder-Based models and Decoder-based models and Everything in between.

Transformer Architecture

Both Encoders and Decoders have their different use cases and very specific purposes. With the advent of so many new LLMs, it is becoming difficult to know how they are different from each other (Apart from being trained on bigger and bigger datasets). In this blog, I will try to explain when we need an encoder-based model and when a decoder-based model comes to our rescue. I will also delve into how the training and inference of both these models are different.

Encoder

Encoders are generally used for tasks that require understanding the text. BERT (Bidirectional Encoder Representations from Transformers) is one common model that comes to mind when we think about Encoders. Let’s use BERT to understand the training and inference for encoder-based models.

BERT has been trained for 2 objectives in the same training procedure.

  1. Predicting tokens that have been masked.
  2. Predicting if two sentences follow each other.

To achieve this the input has 2 other special tokens apart from [PAD] (Padding is done to ensure all inputs in a batch are of the same length).

  1. [CLS] — This token is used to specify the beginning of the input(single sentence or sentence pair). When training the model, this token learns the representation of the sentence and is used for sentence-level tasks like Classification and Sentiment Analysis.
  2. [SEP] — This token is used to separate different sentences when the input is a sentence pair.

Example — Consider “What is BERT?” and “BERT is a model developed by Google.” The formatted input would look something like this:

[CLS] What is BERT? [SEP] BERT is a model developed by Google.[SEP] [PAD] …[PAD]

Training

We first tokenize the data using a tokenizer like Word-Piece for BERT and then create a randomly initialized embedding weight matrix. An Embedding Matrix is a mapping between the tokens to a vector that represents the token and we learn an optimized representation of that token. (A side-effect of that is also similar words like King and Queen or Good and Better have a vector space representation that is closer to each other). Each word in the vocabulary is represented by a vector of length 768 in BERT.

While training the model has been trained on a large corpus of data and been asked to predict the randomly masked words. There is a Fully Connected layer on top of the encoder and then a softmax later which gives the probabilities of the next word. We then get the probability of the masked token and use it to calculate the loss which is the negative log of this probability. We then back-propagate the error refining the weights of the embedding matrix and all the other weights. (These are the attention weights- K, Q, and V and the weights of the linear layers. I haven't gone in-depth but here is a good blog to understand the entire concept of transformers). The encoder processes the entire sentence at once, randomly masking the words, instead of processing them one by one making the computation efficient.

The entire sentence can be processed at once by an encoder
The entire sentence can be processed at once by the encoder.

Fine-Tuning

The BERT model cannot be zero-shot since it has been trained on tasks that are not of practical importance. We always need to fine-tune it.

Once the model is pre-trained on a large corpus, we can now fine-tune it for our specific task. Before fine-tuning, we remove the top dense layer and add a new Dense Layer on top of it.

There are 2 ways to now fine-tune BERT.

  1. We fine-tune the entire model (BERT + newly added dense layer) and modify the weights according to our task-specific dataset.
  2. We freeze the BERT model weights and only train the newly added dense layer.

While fine-tuning, if the task is a sentence-level task, we only use the output corresponding to the [CLS] token. (Sentence-level tasks are classification tasks like Sentiment Analysis or Spam Detection).

For a token-level task like POS tagging, we similarly take the other embedding matrix values except for the [CLS] token. We of course need a considerable amount of data specific to the task.

Inference

Now that we have fine-tuned our model on the specific data, our model is ready for use.

Let's now delve into Decoders and understand what makes decoders useful for Generation Tasks.

Decoders

Decoders are used for tasks that require Text Generation. The goal of the decoder (as is the name) is to decode i.e convert vectors and embeddings back into a language that makes sense to humans. For example next sentence generation, Language translation. Let's understand why they are good at these jobs by understanding how and what decoders are trained for.

A decoder can have 2 types of architectures.

  1. A decoder-only architecture like GPT.
GPT Architecture

As you can see the GPT Decoder module has only 1 masked Self-Attention Module.

2. Or it can be used with an Encoder block. Example Google-T5.

Model structure. From: Jay Alammar’s blog

T5 has the standard transformer architecture with N Encoder Blocks and N Decoder Blocks. The Encoder block has 1 self-attention module while the decoder block has 1 self-attention module and 1 encoder-decoder attention module.

Both these architectures can be used for any type of task.

Training

The training process for decoders in both these architectures involves:

  1. Causal masking or Masked Language Modelling: We mask the future tokens so that the transformer doesn't see and base its prediction on the future tokens. this happens in the self-attention module.
  2. Teacher Forcing: To predict the next token, the model takes as input the correct (target) previous token. This makes sure the output is conditioned on the correct previous tokens and prevents compounding errors.
Causal Masking and Teacher forcing

As we can see in the image, at the beginning of training, the Decoder only gets the [START] token and generates the 1st word, lets say “Me”. Cross-Entropy loss is calculated between teh predicted word and the actual word in the target which is then backpropagated to improve weights. Then “I” from the target word is sent as input. (Teacher Forcing). But at each time we are sending the entire sentences and masking the future words. This lets the decoder process inputs in batches improving computation. (Causal masking)/

Putting these 2 concepts together, while training a decoder, we feed it the entire sentence but mask all the future tokens. This allows processing the entire sentence in 1 go making the computation faster. The decoder only sees the past tokens and predicts the current one based on them. Following the Teacher Forcing method, the past tokens are the correct target tokens. Once it predicts teh current token, we calculate the loss from the actual token, backpropagate the loss, and update all the weights. As in an encoder model, a Decoder also has the embedding matrix, the attention weights and the weights of the linear projection layers.

GPT Architecture

Now let's go in-depth on what happens when we train an encoder-decoder transformer model vs a decoder-only model.

Encoder-Decoder Model

Encoder-Decoder Model.

In an encoder-decoder model, Attention is applied at 2 places.

  1. Masked Multi-Headed Attention — Applies attention within the outputs that are given to the decoder as inputs after shifting them to the right. The attention model then masks the future tokens and applies attention to the present token.
  2. Multi-Headed Attention — Also known as encoder-decoder attention because it applies attention between encoders and decoders.

This enables the Decoder model to first understand the relationship between words in the decoder input (The output that we want the decoder to produce) and then relate that to the embedding of the inputs that we get from the encoder by doing an encoder-decoder cross-attention.

These models are mostly used for language translation tasks. For example, we want to translate an English sentence into French. We input the English sentence to the encoder that understands it, and creates its embedding.

The Decoder takes the French sentence as input (shifted right). The first attention module understands it and the second attention module learns to map it to the English embeddings. The decoder finally decodes and outputs the French sentence one token at a time calculates the loss and keeps improving the model. Here, The Decoder has its own Embedding Matrix that learns the embedding of the French words while the Encoder’s embedding matrix consists of the embedding of the English Vocabulary.

If the Task is the Generate tokens in the same language, the Embedding matrix is shared between the encoder and the decoder thus optimizing the space.

Inference

If we want to use the model for the same task it has been trained on, we can now zero-shot it.

If we want to add domain-specific data or use it for a different task, we need to fine-tune the model which is done in a very similar way to an Encoder model.

Now, You might think GPT too is good at translation but it is a Decoder-Only model. how does that work? Let's look into that.

Decoder-Only Model

A decoder-only model like GPT has only 1 attention layer which does attention across itself. The self-attention block here applies Causal Attention masking the future words and only being able to “see” the present words. GPT has been trained to generate the next word looking at the previous word. It has its own embedding matrix which updates its weights and learns the embedding as and when the model is improving its accuracy on the generation task.

If we want to use GPT for language translation task, we need to finetune it with a sufficient amount of labeled translation data.

English: Hello, how are you?
French: Bonjour, comment ça va?

During fine-tuning, We need to modify the embedding matrix to encompass the tokens from both the input and the target language models. As we fine-tune the model, the weights from the embedding model and thus the French words keep improving along with that of the other layers( attention module and linear projection layers)

In conclusion, an Encoder is good at tasks that need an understanding of the data (It encodes the data) and Decoders is good at decoding data. It is extremely important to remember that a pre-trained model will work well only on the task it has been pre-trained on. And it is always better to fine-tune your model for a new task.

--

--