Transformer Roadmap
The problem this chapter solves is:
The repository name points toward Transformers, but the current code is a foundation course. This chapter explains exactly how the current objects and morphisms point toward a future attention-based model.
The current code is not a full Transformer.
It teaches the typed pieces you need first:
tokens
vectors
logits
probabilities
loss
training updates
composition
This distinction matters. A roadmap should not pretend the current crate already implements attention, residual blocks, or a full sequence model. It should show how the current typed skeleton can grow without losing the discipline that made the small examples understandable.
Reader orientation: Read this chapter as an engineering migration plan, not as a promise that the current code already contains every Transformer component.
What You Already Know
If you understand the current prediction path, you already know the skeleton a Transformer will extend. Tokens become vectors, vectors move through typed transformations, and probabilities feed a loss. The future work is to replace the one-token middle with sequence-aware structure.
What Exists Now
The current model has this prediction path:
TokenId -> Vector -> Logits -> Distribution
Rust Syntax
The path is implemented with:
Embedding
LinearToLogits
Softmax
Compose
The main domain objects are:
TokenId
Vector
Logits
Distribution
Parameters
The training update is:
TrainStep : Parameters -> Parameters
ML Concept
This is a tiny next-token model.
It predicts from one token at a time.
It does not yet model attention across a sequence.
Still, it already teaches the core path:
discrete token
-> dense representation
-> vocabulary scores
-> next-token probabilities
Category Theory Concept
The current system teaches composition:
TokenId -> Vector -> Logits -> Distribution
and endomorphism:
Parameters -> Parameters
Those two shapes remain central in Transformers.
Step 1: Sequences As First-Class Objects
The future problem:
Attention does not operate on one token alone. It operates on a sequence of hidden states.
Worked Example: Validating Sequence Length
The first-principles Rust move is the same one used throughout the book: do not let a meaningful value travel as a raw primitive once it crosses a conceptual boundary. A future sequence length can start as a small validating type:
#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
struct SequenceLength(usize);
impl SequenceLength {
fn new(value: usize) -> Result<Self, &'static str> {
if value == 0 {
return Err("sequence length must be positive");
}
Ok(Self(value))
}
fn value(self) -> usize {
self.0
}
}
assert_eq!(SequenceLength::new(3)?.value(), 3);
assert!(SequenceLength::new(0).is_err());
Ok::<(), &'static str>(())
}
Self-Check
Before reading the roadmap steps, explain why a future SequenceLength should
not be passed around as a bare usize.
Rust Syntax
A future extension should introduce types such as:
pub struct Position(usize);
pub struct SequenceLength(usize);
pub struct HiddenSequence(Vec<Vector>);
pub struct AttentionMask(/* validated mask representation */);
The important rule is the same as this course:
do not pass raw vectors across architectural boundaries
ML Concept
Attention needs a representation like:
[hidden_0, hidden_1, hidden_2, ...]
plus position and mask information.
Category Theory Concept
The object changes from:
Vector
to:
Sequence(Vector)
The next morphisms operate on structured sequences.
Step 2: Query, Key, And Value Projections
The future problem:
Attention compares tokens by projecting hidden states into query, key, and value spaces.
Rust Syntax
Future morphisms might have shapes:
HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence
Each output type should be distinct.
Queries, keys, and values are all vectors underneath, but they have different roles.
ML Concept
Queries ask:
what am I looking for?
Keys answer:
what do I contain?
Values provide:
what information should be mixed?
Category Theory Concept
These are parallel morphisms out of the same object:
HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence
The future attention block combines their results.
Step 3: Scaled Dot-Product Attention
The future problem:
Convert query-key similarity into a probability distribution over positions, then use it to mix values.
Rust Syntax
A typed shape could be:
QuerySequence x KeySequence -> AttentionScores
AttentionScores -> AttentionWeights
AttentionWeights x ValueSequence -> AttentionOutput
AttentionWeights should be validated like Distribution, but over sequence
positions instead of vocabulary tokens.
ML Concept
Attention computes:
scores = QK^T / sqrt(d)
weights = softmax(scores)
output = weights V
This is softmax again, but applied to token-to-token interaction scores.
Category Theory Concept
The attention block is a composition of typed maps with a product input:
(Q, K, V) -> scores -> weights -> mixed values
Step 4: Multi-Head Attention
The future problem:
One attention head sees one interaction pattern. Multiple heads let the model learn several patterns in parallel.
Rust Syntax
Future types might include:
pub struct AttentionHead(/* head parameters */);
pub struct HeadCount(usize);
pub struct MultiHeadOutput(/* concatenated or projected heads */);
HeadCount should reject zero.
ML Concept
Each head performs attention separately.
The outputs are combined and projected back into the model dimension.
Category Theory Concept
This is parallel composition followed by recombination:
head_1 x head_2 x ... x head_n -> combined output
Step 5: Residual Blocks And Normalization
The future problem:
Transformer blocks repeatedly map a hidden sequence back to a hidden sequence.
Rust Syntax
A future block should have shape:
HiddenSequence -> HiddenSequence
That means it can implement an endomorphism-like trait or reuse the existing
Morphism<HiddenSequence, HiddenSequence> shape.
ML Concept
Transformer blocks use:
attention
residual connection
normalization
feed-forward network
The block output has the same shape as the input.
Category Theory Concept
This is another endomorphism:
HiddenSequence -> HiddenSequence
Stacking layers is repeated endomorphism application.
Step 6: Training And Evaluation
The future problem:
Once the model has attention parameters, training must update a larger parameter object without losing type structure.
Rust Syntax
The current:
Parameters
would need to grow into a structured parameter type:
pub struct TransformerParameters {
token_embedding: ...,
attention_blocks: ...,
lm_head: ...,
}
The update should still have the shape:
TransformerParameters -> TransformerParameters
ML Concept
The same high-level training loop remains:
predict
compute loss
backpropagate
update parameters
The internal model becomes richer.
Category Theory Concept
The training endomorphism generalizes:
Parameters -> Parameters
to:
TransformerParameters -> TransformerParameters
Core Mental Model
The current course teaches the typed skeleton:
TokenId -> Vector -> Logits -> Distribution
Distribution x TokenId -> Loss
Parameters -> Parameters
A Transformer extension grows the middle:
TokenSequence
-> HiddenSequence
-> AttentionOutput
-> HiddenSequence
-> Logits
-> Distribution
The practical rule stays the same:
Make every intermediate object explicit, then compose only arrows whose types actually match.
Where This Leaves Us
The roadmap keeps the book honest. The current implementation is a tiny next-token system, not a production Transformer. Its value is that it gives the future system a typed foundation: tokens become vectors, vectors become logits, logits become probabilities, probabilities become loss, and training updates parameters through a repeatable endomorphism.
A future Transformer should extend that foundation by adding sequence objects, position information, query/key/value projections, attention weights, multi-head structure, residual blocks, normalization, and richer training state. Each new concept should enter the codebase the same way the current concepts did: as a named type, a validated boundary, a typed morphism, a compiled example, and a law or regression test where the concept has a law worth checking.
Retrieval Practice
Recall
Which current pipeline objects would still exist in a future Transformer?
Explain
Why should attention weights be validated like probability distributions?
Apply
Sketch one future morphism for query/key/value attention and name its input and output types.