Under the Hood: Transformer Architecture

Drew Worden
11 min readApr 28, 2024

--

Introduction

In the realm of natural language processing (NLP) and machine learning, the transformer architecture has emerged as a game-changer, revolutionizing the way we approach sequence-to-sequence tasks such as machine translation, text summarization, and language modeling. Introduced in the seminal paper “Attention is All You Need” by Vaswani et al. (2017), the transformer has rapidly become the go-to architecture for a wide range of NLP applications.

This article aims to provide a comprehensive understanding of the transformer architecture, delving into the mathematical underpinnings, implementing it from scratch using Python, and finally, walking through a real-world project to solidify your understanding.

Part I: The Mathematics Behind the Transformer

Attention Mechanism

At the heart of the transformer lies the attention mechanism, a powerful concept that allows the model to selectively focus on relevant parts of the input sequence when computing the output. Unlike traditional sequence models like recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, which process the input sequentially, the attention mechanism allows the transformer to consider the entire input sequence at once, enabling better capture of long-range dependencies.

The attention mechanism can be formulated as a weighted sum of the input sequence, where the weights are determined by the relevance of each input element to the current output element being computed. Mathematically, the attention score for each input element is calculated using a compatibility function, typically a dot product or a scaled dot product between the query (current output element) and the key (input element).

Given an input sequence X = (x₁, x₂, …, xₙ) and a query vector q, the attention scores are computed as:

where k_i is the key vector corresponding to the i-th input element x_i.

The attention scores are then normalized using a softmax function to obtain the attention weights:

The output of the attention mechanism is a weighted sum of the value vectors v_i corresponding to each input element, weighted by their respective attention weights α_i:

The attention mechanism allows the transformer to dynamically adjust the focus on different parts of the input sequence, enabling better modeling of long-range dependencies and capturing semantic relationships more effectively.

Multi-Head Attention

While the basic attention mechanism is powerful, the transformer employs a more sophisticated variant called multi-head attention. In multi-head attention, the input sequence is projected into multiple subspaces, and attention is computed in parallel for each subspace. This allows the model to attend to different aspects of the input sequence simultaneously, capturing diverse features and relationships.

Mathematically, multi-head attention is computed as follows:

where Q, K, and V are the query, key, and value matrices, respectively, and h is the number of heads. Each head_i is computed as:

where W_i^Q, W_i^K, and W_i^V are learnable projection matrices for the i-th head.

The outputs from all heads are concatenated and linearly transformed using the weight matrix W^O to produce the final multi-head attention output.

Multi-head attention allows the model to capture different types of relationships and dependencies within the input sequence, enhancing the overall representational power of the transformer.

Positional Encoding

Since the transformer does not process the input sequence sequentially like RNNs, it lacks an inherent notion of positional information. To address this, the transformer incorporates positional encoding, which adds a positional signal to the input embeddings, enabling the model to capture the order and position of elements in the sequence.

The positional encoding is typically computed using a combination of sine and cosine functions, which are designed to be unique for each position and have desirable properties such as being relatively periodic and allowing the model to generalize to sequences of different lengths.

Mathematically, the positional encoding for the i-th position and the j-th dimension is computed as:

where pos is the position, dmodel is the dimensionality of the input embeddings, and j ranges from 0 to dmodel/2 — 1.

The positional encoding vectors are added to the input embeddings, enabling the transformer to capture positional information and model sequential relationships effectively.

Feed-Forward Network

In addition to the attention mechanism, the transformer employs a feed-forward network (FFN) as part of its architecture. The FFN is applied to each position in the sequence, independently and identically, allowing the model to capture non-linear transformations and interactions within each position.

The FFN typically consists of two linear layers with a rectified linear unit (ReLU) activation function in between:

where W_1, b_1, W_2, and b_2 are learnable parameters.

The FFN introduces non-linearity and allows the transformer to model complex relationships within each position, enhancing its representational power and ability to handle diverse input sequences.

Encoder and Decoder Stacks

The transformer architecture is composed of two main components: the encoder and the decoder. The encoder processes the input sequence and generates a sequence of encoded representations, while the decoder generates the output sequence based on the encoded representations and the previous outputs.

Both the encoder and decoder are built using stacks of identical layers, where each layer consists of a multi-head attention sublayer and a feed-forward sublayer. In the encoder, the multi-head attention sublayer performs self-attention, attending to different positions within the same input sequence. In the decoder, there are two multi-head attention sublayers: one for self-attention on the previously generated outputs, and another for attending to the encoded representations from the encoder (known as encoder-decoder attention).

The stacked layers in the encoder and decoder allow the transformer to capture hierarchical representations and refine the understanding of the input and output sequences iteratively.

Part II: Implementing the Transformer from Scratch in Python

Now that we have explored the mathematical foundations of the transformer, let’s dive into implementing it from scratch using Python. We will build a simplified version of the transformer for the task of machine translation.

Importing Required Modules

import math
import numpy as np
import torch.nn as nn

Positional Encoding

We’ll start by implementing the positional encoding function, which adds positional information to the input embeddings.

def positional_encoding(max_len, d_model):
pe = np.zeros((max_len, d_model))
for pos in range(max_len):
for i in range(0, d_model, 2):
pe[pos, i] = math.sin(pos / (10000 ** ((2 * i) / d_model)))
pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1)) / d_model)))
return pe

Scaled Dot-Product Attention

Next, we’ll implement the scaled dot-product attention mechanism, which is the core component of the transformer’s attention mechanism.

def scaled_dot_product_attention(q, k, v, mask=None):
# Compute the attention scores
scores = np.matmul(q, k.transpose()) / np.sqrt(k.shape[-1])

# Apply masking (optional)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)

# Compute the attention weights
attn_weights = nn.softmax(scores, dim=-1)

# Compute the attention output
output = np.matmul(attn_weights, v)

return output, attn_weights

Multi-Head Attention

Building upon the scaled dot-product attention, we’ll implement the multi-head attention mechanism.

class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.d_model = d_model
self.num_heads = num_heads
self.head_dim = d_model // num_heads

self.q_linear = nn.Linear(d_model, d_model)
self.k_linear = nn.Linear(d_model, d_model)
self.v_linear = nn.Linear(d_model, d_model)
self.out_linear = nn.Linear(d_model, d_model)

def forward(self, q, k, v, mask=None):
batch_size = q.size(0)

# Perform linear projections
q = self.q_linear(q).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
k = self.k_linear(k).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
v = self.v_linear(v).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)

# Compute scaled dot-product attention
scores = np.matmul(q, k.transpose(-2, -1)) / np.sqrt(self.head_dim)

if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)

attn_weights = nn.softmax(scores, dim=-1)
output = np.matmul(attn_weights, v).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

# Apply output linear transformation
output = self.out_linear(output)

return output, attn_weights

Encoder Layer

The encoder layer consists of a multi-head attention sublayer and a feed-forward sublayer, with residual connections and layer normalization applied after each sublayer.

class EncoderLayer(nn.Module):
def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
super(EncoderLayer, self).__init__()
self.mha = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, dff),
nn.ReLU(),
nn.Linear(dff, d_model)
)
self.layernorm1 = nn.LayerNorm(d_model)
self.layernorm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout_rate)
self.dropout2 = nn.Dropout(dropout_rate)

def forward(self, x, mask=None):
attn_output, _ = self.mha(x, x, x, mask)
attn_output = self.dropout1(attn_output)
out1 = self.layernorm1(x + attn_output)

ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output)
out2 = self.layernorm2(out1 + ffn_output)

return out2

Decoder Layer

The decoder layer is similar to the encoder layer but includes an additional multi-head attention sublayer for attending to the encoded representations from the encoder (encoder-decoder attention).

class DecoderLayer(nn.Module):
def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
super(DecoderLayer, self).__init__()
self.mha1 = MultiHeadAttention(d_model, num_heads)
self.mha2 = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, dff),
nn.ReLU(),
nn.Linear(dff, d_model)
)
self.layernorm1 = nn.LayerNorm(d_model)
self.layernorm2 = nn.LayerNorm(d_model)
self.layernorm3 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout_rate)
self.dropout2 = nn.Dropout(dropout_rate)
self.dropout3 = nn.Dropout(dropout_rate)

def forward(self, x, enc_output, look_ahead_mask=None, padding_mask=None):
attn1, attn_weights_1 = self.mha1(x, x, x, look_ahead_mask)
attn1 = self.dropout1(attn1)
out1 = self.layernorm1(attn1 + x)

attn2, attn_weights_2 = self.mha2(enc_output, enc_output, out1, padding_mask)
attn2 = self.dropout2(attn2)
out2 = self.layernorm2(attn2 + out1)

ffn_output = self.ffn(out2)
ffn_output = self.dropout3(ffn_output)
out3 = self.layernorm3(ffn_output + out2)

return out3, attn_weights_1, attn_weights_2

Encoder

The encoder consists of a stack of identical encoder layers, with an embedding layer and positional encoding applied to the input sequence.

class Encoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
maximum_position_encoding, dropout_rate=0.1):
super(Encoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers

self.embedding = nn.Embedding(input_vocab_size, d_model)
self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
self.dropout = nn.Dropout(dropout_rate)
self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, dff, dropout_rate)
for _ in range(num_layers)])

def forward(self, x, mask=None):
seq_len = x.shape[1]
x = self.embedding(x) * math.sqrt(self.d_model)
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x)

for layer in self.layers:
x = layer(x, mask)

return x

Decoder

The decoder also consists of a stack of identical decoder layers, with an embedding layer and positional encoding applied to the input sequence. Additionally, it incorporates a linear and softmax layer for generating the output sequence.

class Decoder(nn.Module):
def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
maximum_position_encoding, dropout_rate=0.1):
super(Decoder, self).__init__()
self.d_model = d_model
self.num_layers = num_layers

self.embedding = nn.Embedding(target_vocab_size, d_model)
self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
self.dropout = nn.Dropout(dropout_rate)
self.layers = nn.ModuleList([DecoderLayer(d_model, num_heads, dff, dropout_rate)
for _ in range(num_layers)])
self.fc = nn.Linear(d_model, target_vocab_size)

def forward(self, x, enc_output, look_ahead_mask=None, padding_mask=None):
seq_len = x.shape[1]
x = self.embedding(x) * math.sqrt(self.d_model)
x += self.pos_encoding[:, :seq_len, :]
x = self.dropout(x)

attn_weights = []
for layer in self.layers:
x, attn_weight_1, attn_weight_2 = layer(x, enc_output, look_ahead_mask, padding_mask)
attn_weights.append(attn_weight_1)
attn_weights.append(attn_weight_2)

output = self.fc(x)

return output, attn_weights

Transformer

Finally, we combine the encoder and decoder to create the complete transformer model.

class Transformer(nn.Module):
def __init__(self, num_encoder_layers, num_decoder_layers, d_model, num_heads,
dff, input_vocab_size, target_vocab_size, max_len_input,
max_len_output, dropout_rate=0.1):
super(Transformer, self).__init__()
self.encoder = Encoder(num_encoder_layers, d_model, num_heads, dff,
input_vocab_size, max_len_input, dropout_rate)
self.decoder = Decoder(num_decoder_layers, d_model, num_heads, dff,
target_vocab_size, max_len_output, dropout_rate)

def forward(self, inp, tar, enc_padding_mask, look_ahead_mask, dec_padding_mask):
enc_output = self.encoder(inp, enc_padding_mask)
dec_output, attn_weights = self.decoder(tar, enc_output, look_ahead_mask, dec_padding_mask)

return dec_output, attn_weights

This completes the implementation of the transformer architecture. In the next section, we’ll explore how to use this implementation in a real-world project.

Part III: Using the Transformer in a Real-World Project

Data Preparation

We’ll start by loading and preparing the English-Spanish translation dataset using PyTorch’s built-in utilities.

import torch
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

# Load the English-Spanish translation dataset
multi30k = Multi30k(language_pair=('en', 'es'), root='.data')

# Define the text preprocessing pipeline
src = Field(tokenize='spacy', init_token='<sos>', eos_token='<eos>', lower=True, batch_first=True)
trg = Field(tokenize='spacy', init_token='<sos>', eos_token='<eos>', lower=True, batch_first=True)

# Build vocabularies
src.build_vocab(multi30k, max_size=10000)
trg.build_vocab(multi30k, max_size=10000)

# Create iterator objects
train_iter, val_iter = BucketIterator.splits((multi30k.train, multi30k.valid),
batch_sizes=(64, 64),
device='cuda',
sort_key=lambda x: len(x.src),
sort_within_batch=True)

Model Configuration

We’ll define the configuration parameters for our transformer model.

num_encoder_layers = 4
num_decoder_layers = 4
d_model = 128
num_heads = 8
dff = 512
input_vocab_size = len(src.vocab)
target_vocab_size = len(trg.vocab)
max_len_input = 40
max_len_output = 40
dropout_rate = 0.1

Model Instantiation and Training

With our data prepared and model configuration defined, we can instantiate our transformer model and train it on the English-Spanish translation task.

transformer = Transformer(num_encoder_layers, num_decoder_layers, d_model,
num_heads, dff, input_vocab_size, target_vocab_size,
max_len_input, max_len_output, dropout_rate)
transformer.cuda()

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=trg.vocab.stoi['<pad>'])
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

# Training loop
for epoch in range(num_epochs):
for batch in train_iter:
inp, tar = batch.src, batch.trg

enc_padding_mask = create_padding_mask(inp, input_vocab_size)
look_ahead_mask = create_look_ahead_mask(tar.shape[1])
dec_padding_mask = create_padding_mask(tar, target_vocab_size)

predictions, _ = transformer(inp, tar[:, :-1], enc_padding_mask,
look_ahead_mask, dec_padding_mask)
loss = criterion(predictions.permute(0, 2, 1), tar[:, 1:])

optimizer.zero_grad()
loss.backward()
optimizer.step()

Inference and Evaluation

After training our transformer model, we can use it for inference and evaluate its performance on the validation set.

# Inference loop
transformer.eval()
with torch.no_grad():
for batch in val_iter:
inp, tar = batch.src, batch.trg

enc_padding_mask = create_padding_mask(inp, input_vocab_size)
look_ahead_mask = create_look_ahead_mask(tar.shape[1])
dec_padding_mask = create_padding_mask(tar, target_vocab_size)

predictions, _ = transformer(inp, tar[:, :-1], enc_padding_mask,
look_ahead_mask, dec_padding_mask)

# Compute evaluation metrics (e.g., BLEU score)
...

Model Deployment and Integration

After training and evaluating our transformer model, we can deploy it in a production environment or integrate it into larger applications or pipelines using PyTorch’s built-in deployment options or third-party tools like TorchServe.

Conclusion

The impact of the transformer architecture has been far-reaching, revolutionizing the field of natural language processing and serving as the foundation for numerous state-of-the-art language models and applications, including the groundbreaking work from OpenAI.

One of the most notable examples of transformer-based models is OpenAI’s GPT (Generative Pre-trained Transformer) series, which has been instrumental in advancing the field of large language models. The GPT architecture, built upon the transformer, has enabled the development of powerful language models capable of understanding and generating human-like text across a wide range of domains and tasks.

The success of GPT has paved the way for subsequent iterations, such as GPT-2, GPT-3, and GPT-4 each pushing the boundaries of what is possible with transformer-based language models. GPT-3, in particular, has garnered significant attention for its impressive performance on various natural language tasks, including question-answering, text generation, and language translation.

Beyond OpenAI, the transformer architecture has been widely adopted and adapted by numerous organizations and researchers, leading to the development of various other powerful language models and tools. For instance, Hugging Face’s Transformers library has democratized access to pre-trained transformer models, enabling researchers and developers to leverage these cutting-edge models for a wide range of NLP tasks.

As the field of machine learning continues to evolve, the transformer architecture will undoubtedly play an increasingly pivotal role, enabling researchers and practitioners to tackle ever more complex problems and push the boundaries of what is possible with sequential data processing and language understanding.

--

--

Drew Worden

Just a guy who likes math and the things you can do with it.