Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Transformer Roadmap

The problem this chapter solves is:

The repository name points toward Transformers, but the current code is a foundation course. This chapter explains exactly how the current objects and morphisms point toward a future attention-based model.

The current code is not a full Transformer.

It teaches the typed pieces you need first:

tokens
vectors
logits
probabilities
loss
training updates
composition

This distinction matters. A roadmap should not pretend the current crate already implements attention, residual blocks, or a full sequence model. It should show how the current typed skeleton can grow without losing the discipline that made the small examples understandable.

Reader orientation: Read this chapter as an engineering migration plan, not as a promise that the current code already contains every Transformer component.

What You Already Know

If you understand the current prediction path, you already know the skeleton a Transformer will extend. Tokens become vectors, vectors move through typed transformations, and probabilities feed a loss. The future work is to replace the one-token middle with sequence-aware structure.

What Exists Now

The current model has this prediction path:

TokenId -> Vector -> Logits -> Distribution

Rust Syntax

The path is implemented with:

Embedding
LinearToLogits
Softmax
Compose

The main domain objects are:

TokenId
Vector
Logits
Distribution
Parameters

The training update is:

TrainStep : Parameters -> Parameters

ML Concept

This is a tiny next-token model.

It predicts from one token at a time.

It does not yet model attention across a sequence.

Still, it already teaches the core path:

discrete token
  -> dense representation
  -> vocabulary scores
  -> next-token probabilities

Category Theory Concept

The current system teaches composition:

TokenId -> Vector -> Logits -> Distribution

and endomorphism:

Parameters -> Parameters

Those two shapes remain central in Transformers.

Step 1: Sequences As First-Class Objects

The future problem:

Attention does not operate on one token alone. It operates on a sequence of hidden states.

Worked Example: Validating Sequence Length

The first-principles Rust move is the same one used throughout the book: do not let a meaningful value travel as a raw primitive once it crosses a conceptual boundary. A future sequence length can start as a small validating type:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
struct SequenceLength(usize);

impl SequenceLength {
    fn new(value: usize) -> Result<Self, &'static str> {
        if value == 0 {
            return Err("sequence length must be positive");
        }

        Ok(Self(value))
    }

    fn value(self) -> usize {
        self.0
    }
}

assert_eq!(SequenceLength::new(3)?.value(), 3);
assert!(SequenceLength::new(0).is_err());
Ok::<(), &'static str>(())
}

Self-Check

Before reading the roadmap steps, explain why a future SequenceLength should not be passed around as a bare usize.

Rust Syntax

A future extension should introduce types such as:

pub struct Position(usize);
pub struct SequenceLength(usize);
pub struct HiddenSequence(Vec<Vector>);
pub struct AttentionMask(/* validated mask representation */);

The important rule is the same as this course:

do not pass raw vectors across architectural boundaries

ML Concept

Attention needs a representation like:

[hidden_0, hidden_1, hidden_2, ...]

plus position and mask information.

Category Theory Concept

The object changes from:

Vector

to:

Sequence(Vector)

The next morphisms operate on structured sequences.

Step 2: Query, Key, And Value Projections

The future problem:

Attention compares tokens by projecting hidden states into query, key, and value spaces.

Rust Syntax

Future morphisms might have shapes:

HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence

Each output type should be distinct.

Queries, keys, and values are all vectors underneath, but they have different roles.

ML Concept

Queries ask:

what am I looking for?

Keys answer:

what do I contain?

Values provide:

what information should be mixed?

Category Theory Concept

These are parallel morphisms out of the same object:

HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence

The future attention block combines their results.

Step 3: Scaled Dot-Product Attention

The future problem:

Convert query-key similarity into a probability distribution over positions, then use it to mix values.

Rust Syntax

A typed shape could be:

QuerySequence x KeySequence -> AttentionScores
AttentionScores -> AttentionWeights
AttentionWeights x ValueSequence -> AttentionOutput

AttentionWeights should be validated like Distribution, but over sequence positions instead of vocabulary tokens.

ML Concept

Attention computes:

scores = QK^T / sqrt(d)
weights = softmax(scores)
output = weights V

This is softmax again, but applied to token-to-token interaction scores.

Category Theory Concept

The attention block is a composition of typed maps with a product input:

(Q, K, V) -> scores -> weights -> mixed values

Step 4: Multi-Head Attention

The future problem:

One attention head sees one interaction pattern. Multiple heads let the model learn several patterns in parallel.

Rust Syntax

Future types might include:

pub struct AttentionHead(/* head parameters */);
pub struct HeadCount(usize);
pub struct MultiHeadOutput(/* concatenated or projected heads */);

HeadCount should reject zero.

ML Concept

Each head performs attention separately.

The outputs are combined and projected back into the model dimension.

Category Theory Concept

This is parallel composition followed by recombination:

head_1 x head_2 x ... x head_n -> combined output

Step 5: Residual Blocks And Normalization

The future problem:

Transformer blocks repeatedly map a hidden sequence back to a hidden sequence.

Rust Syntax

A future block should have shape:

HiddenSequence -> HiddenSequence

That means it can implement an endomorphism-like trait or reuse the existing Morphism<HiddenSequence, HiddenSequence> shape.

ML Concept

Transformer blocks use:

attention
residual connection
normalization
feed-forward network

The block output has the same shape as the input.

Category Theory Concept

This is another endomorphism:

HiddenSequence -> HiddenSequence

Stacking layers is repeated endomorphism application.

Step 6: Training And Evaluation

The future problem:

Once the model has attention parameters, training must update a larger parameter object without losing type structure.

Rust Syntax

The current:

Parameters

would need to grow into a structured parameter type:

pub struct TransformerParameters {
    token_embedding: ...,
    attention_blocks: ...,
    lm_head: ...,
}

The update should still have the shape:

TransformerParameters -> TransformerParameters

ML Concept

The same high-level training loop remains:

predict
compute loss
backpropagate
update parameters

The internal model becomes richer.

Category Theory Concept

The training endomorphism generalizes:

Parameters -> Parameters

to:

TransformerParameters -> TransformerParameters

Core Mental Model

The current course teaches the typed skeleton:

TokenId -> Vector -> Logits -> Distribution
Distribution x TokenId -> Loss
Parameters -> Parameters

A Transformer extension grows the middle:

TokenSequence
  -> HiddenSequence
  -> AttentionOutput
  -> HiddenSequence
  -> Logits
  -> Distribution

The practical rule stays the same:

Make every intermediate object explicit, then compose only arrows whose types actually match.

Where This Leaves Us

The roadmap keeps the book honest. The current implementation is a tiny next-token system, not a production Transformer. Its value is that it gives the future system a typed foundation: tokens become vectors, vectors become logits, logits become probabilities, probabilities become loss, and training updates parameters through a repeatable endomorphism.

A future Transformer should extend that foundation by adding sequence objects, position information, query/key/value projections, attention weights, multi-head structure, residual blocks, normalization, and richer training state. Each new concept should enter the codebase the same way the current concepts did: as a named type, a validated boundary, a typed morphism, a compiled example, and a law or regression test where the concept has a law worth checking.

Retrieval Practice

Recall

Which current pipeline objects would still exist in a future Transformer?

Explain

Why should attention weights be validated like probability distributions?

Apply

Sketch one future morphism for query/key/value attention and name its input and output types.