In recent years, attention mechanisms have transformed the landscape of AI by powering innovations in natural language processing, computer vision, and audio application such as text-to-speech. From OpenAI’s ChatGPT models to Google’s BERT, attention has become the backbone of some of the most powerful models shaping today’s AI-driven world.
The concept of attention was first introduced by Bahdanau, Cho, and Bengio in their 2014 paper, "Neural Machine Translation by Jointly Learning to Align and Translate." Their work served as an enhancement to recurrent neural networks (RNN) by dynamically focusing or "attending" to relevant parts of the input while addressing the limitations of fixed context encoding in tasks like machine translation.
Building on the work of the "Bahdanau Attention" paper, Vaswani et al. introduced the transformer architecture in their 2017 paper, "Attention is All You Need." Their work extended the attention mechanism to self-attention and multi-head attention while eliminating reliance on recurrent architectures. While Bahdanau et al. introduced the foundational idea of attention, Vaswani et al.’s work on the transformer architecture popularized and generalized it for broader use in modern AI systems.
At its core, attention gives the model the ability to determine which parts of the input are most relevant when processing data. Unlike traditional approaches that treat all inputs equally or rely on rigid fixed contexts, attention assigns weights to different inputs, creating a richer representation of the data thus enabling a nuanced understanding of relationships. This enables AI models to grasp context, manage long-range dependencies, and make more informed decisions.
What makes attention so remarkable is its straightforward and efficient mathematical foundation. By utilizing basic operations like dot products, normalization, and weighted averages, it achieves both computational efficiency and interpretability. These straightforward calculations, however, enable the sophisticated behavior seen in models like Transformers.
In this post, I’ll dig into the math behind attention, exploring how queries, keys, and values come together to generate these context-aware representations.
Self-Attention Computation
The self-attention mechanism in a transformer model can be expressed in matrix form as follows:
Inputs
Query matrix Q: Derived from the linear transformation of input embeddings.
Key matrix K: Derived from the linear transformation of input embeddings.
Value matrix V: Derived from the linear transformation of input embeddings.
These matrices are computed as:
where:
Input matrix X: a D×N input matrix from N input vectors created from text, image, video or audio data embeddings
Weights matrix Ω (Ωv, Ωq, Ωk): Matrix that containing the weights for linear transformation of X
Biases matrix β (βv, βq, βk): Matrix containing biases for linear transformation of X
Ones matrix: An N×1 vector containing ones
For each matrix, V, Q and K, the same weights Ω and biases β are applied to each input X
Scaled Self-Attention Computation
The self-attention is calculated using the following steps:
Similarity Scores: Compute the dot product between K and Q:
\(\mathbf{K^{T}Q}\)Scale the Scores: Scale the dot products to stabilize gradients by dividing by the square root of dimension of queries and keys:
\(\mathbf{\frac{K^{T}Q}{\sqrt{D_{q}}}}\)Apply Softmax: Normalize the scaled scores to probabilities using the softmax function:
\(\mathbf{Softmax\Bigg[\frac{K^{T}Q}{\sqrt{D_{q}}}\Bigg]}\)Compute the Weighted Sum: Use the attention scores to compute the weighted sum of the values V:
\(\mathbf{Sa[X] = V \cdot Softmax\Bigg[\frac{K^{T}Q}{\sqrt{D_{q}}}\Bigg]}\)
For multi-head attention, the self-attention mechanism can be applied independently across h different heads:
For each head, the dot products are scaled, softmax is applied and the weighted sum of V is calculated:
To produce the final output, the outputs from all heads are concatenated and linearly transformed with Ωc.
To see how the math works, let’s start by walking through a couple of simple examples then see how a pretrained model’s attention mechanism attends to the various input tokens.
To start, let’s take a simple sentence and encode it using one-hot encoding.
Next, let’s create a function to calculate the scaled self-attention of encoded sentence.
Using the methods described above, the key, query and value matrices are generated from the input X, the query weights and biases. The dot-product is scaled by the dimension of the query to help prevent excessively large values in the attention scores, which can lead to issues in numerical stability and gradient computation. Scaling the dot-product can help improve the convergence and stability of the model during training.
To see how this works, we can randomly generate the weights (Ω and β) then pass them and the encoded sentence into our function to investigate the shapes of the output and the attention weights. Hint: The input and output shapes should be the same.
Looking at the shapes for our encoded sentence X and the attention output, we can see that both have the same shape while shape of the attention weights is aligned with the description above.
Note: if we were to visualize the attention weights, we would be able to see which tokens were being attended to. Since we are using random initializations in this example our attention weights are essentially random. We will see later that if we use a pretrained BERT model, we can see which tokens are receiving the most attention.
Next, let’s take this example and extend it to multi-headed attention.
The key difference between the single-headed attention mechanism and the multi-headed is the inclusion of… multiple heads! With multiple heads we simply concatenate the attention outputs at the various heads then scale by a final weight, Ωc.
To see this in action, we will again, randomly generate the weights (Ω and β) as well as a final weight Ωc for the concatenated attention head outputs. We will also use D/H for the internal dimensions as it allows for an efficient implementation.
When we pass the encoded sentence as well as the weights and number of heads into the function, we can see that the X input shape is the same as the multi-headed attention output.
All of this is great because it shows that the math works but if we were to visualize the attention weights, they would essentially be random. To help us visualize the tokens that the attention heads are focused on, we can use a pretrained BERT model to encode our sentence.
Grabbing the attention weights from a pretrained BERT model, we can see that it’s focusing or “attending” to tokens such as “otter”, “river”, “swam” and “bank” with the largest weights being between “otter”/“swam” and “river”/“swam” which makes sense and shows the power of the attention mechanism in a transformer model.
Conclusion
The attention mechanism has completely changed how models handle sequences by letting them focus on the most relevant parts of the input. While the math behind it—dot products, scaling, and softmax—is surprisingly simple, it unlocks powerful ways to capture relationships between tokens. Visualizing attention weights, as we did here, gives us a glimpse into what the model "cares about" and how it understands context. From its early days in the Bahdanau et al. paper to becoming the foundation of the Transformer, to say that attention is a game-changer, would be an understatement.