Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Glossary

The problem this chapter solves is:

Abstract terms are easier to remember when each term is tied to a Rust type, an ML role, and a category-theory shape.

Use this glossary as a lookup table while reading the source snapshots.

Do not read it as a separate dictionary. Each entry is deliberately anchored to the codebase. If a definition sounds abstract, jump from the term to the Rust syntax and then back to the chapter where the type or trait appears.

Reader orientation: The glossary uses compact entries, but the entries still follow the book’s main discipline: first the Rust handle, then the ML or software role, then the categorical shape.

How To Use This Glossary

Use each entry as a bridge, not as a final definition.

term -> Rust handle -> ML or software role -> category-theory shape

If a term has no Rust handle in this repository, it is not a core term for this book yet. The goal is not to collect impressive vocabulary. The goal is to make the vocabulary already used by the chapters easier to retrieve and transfer.

When a term appears in a chapter, ask:

What value, function, trait, constructor, method, test, or command makes this
term concrete?

That question keeps the glossary grounded.

Source-Backed Recovery Rules

Use this section when a term feels impressive but not usable yet. The glossary is strongest when a definition can be recovered through four anchors:

term -> source anchor -> Rust evidence -> learner evidence signal

The outside source gives the term a trustworthy boundary. The repository evidence shows the smaller claim this book actually makes. The learner evidence signal tells you what to run, inspect, or explain before moving on.

If this term family is unclearSource anchorLocal evidenceLearner evidence signal
domain object, invariant, smart constructorRust structs, Rust enums, Rust API Guidelines, recoverable ResultTokenId, TokenSequence::new, Distribution::new, Loss::new, and LearningRate::new in src/domain.rscargo run --example 01_domain_objects; cargo test domain::tests; explain which invalid state a constructor rejects
morphism, identity, compositionRust traits, Rust generics, Seven Sketches, Category Theory for ProgrammingMorphism<Input, Output>, Identity<T>, and Compose<F, G, Middle> in src/category.rscargo run --example 02_morphism_composition; cargo test category::tests; name the middle object that makes composition legal
logits, distribution, cross entropy, lossDive into Deep Learning: Softmax Regression, CS231n Linear Classification, PyTorch CrossEntropyLoss, On Calibration of Modern Neural NetworksLogits -> Distribution -> Product<Distribution, TokenId> -> Loss in src/ml.rscargo run --bin category_ml; cargo test ml::tests; point to the line where target token and prediction meet, then explain why normalized probability is not automatically calibrated confidence
training step, parameters, endomorphismBackprop as Functor, D2L Backpropagation, PyTorch optimizersTrainStep : Parameters -> Parameters in src/training.rs and TransformerTrainingState -> TransformerTrainingState in src/attention.rscargo run --example 03_training_endomorphism; cargo run --example 07_transformer_training_state; separate measurement from update
functor, naturality, monoid, chain ruleCategories for the Working Mathematician, Seven Sketches, Category Theory for Programming, D2L Computational GraphsVecFunctor, OptionFunctor, first_or_none_naturality_square, PipelineTrace, and MulOp::backward in src/structure.rs and src/calculus.rscargo run --example 04_structure_and_calculus; cargo test structure::tests --lib; cargo test calculus::tests --lib; explain which small law or local derivative the output checks, and which formal claim the local tests do not prove
query, key, value, mask, attention weightsAttention Is All You Need, PyTorch MultiheadAttention, PyTorch Transformer, PyTorch scaled dot product attention, Hugging Face Transformer courseQuerySequence, KeySequence, ValueSequence, AttentionMask, AttentionScores, and AttentionWeights in src/attention.rscargo run --example 06_attention_scores; cargo test attention::tests; explain why the mask is applied before softmax and why this book’s true -> allowed polarity must not be confused with APIs where true -> blocked
fixed module instance, parameter context, training-state updateD2L Parameter Management, PyTorch optimizers, Rust Book closuresLayerNormalization, PositionWiseFeedForward, and TransformerTrainingState in src/attention.rscargo run --example 07_transformer_training_state; explain why a forward sublayer is an endomorphism only for a fixed module value, while parameter changes belong to training state
finite difference, gradient check, local update evidenceCS231n numerical gradients, CS231n Neural Networks Part 3, PyTorch gradcheckfinite-difference tests for transformer readout, feed-forward, layer norm, attention projection, and block updates in src/attention.rsrun cargo test attention::tests::transformer_block_train_step_matches_finite_difference_for_readout_weight; state that this is local evidence, not a proof of all training
challenge completion, evidence signal, Paper-To-Rust, optimizer stateRustlings Usage, Rustlings Community Exercises, Adam, PyTorch Adam, Rust Book testschallenges/typed-ai-rustlings/, src/challenges/papers/adam.rs, examples/challenge_adam.rs, tests/challenge_typed_ai.rs, and tests/paper_to_rust_adam.rscargo test --test challenge_typed_ai; cargo run --example challenge_adam; explain the source claim, Rust boundary, invariant, and visible compiler, output, or test signal
retrieval, transfer, and misconception repairHow People Learn II, Test-Enhanced Learning, worked-example transitionthe worked examples, partial examples, common misreadings, and exercise evidence map in this bookrecover one term by writing the Rust handle, the protected ML role, and the exact command or test that checks it

These source anchors do not make the glossary a substitute for the chapters. They protect the smaller local claim:

If a term matters here, the reader should be able to point to code, run a
command, inspect a failure signal, or explain a checked boundary.

If you cannot name the Rust handle or evidence signal, treat the term as unrecovered and return to the chapter or source file where it first appears.

Core Term Alignment

Some ideas have a public phrase, a Rust type, and a category-theory reading. Use this table to keep them separate.

Public phraseRust handleUse this wording when precision matters
training pairsTrainingExample values inside TrainingSet“adjacent input-target pairs” for the examples, TrainingSet for the validated Rust object
model stateParameters“parameters” when naming the Rust object, “model state” when explaining the ML role
probabilitiesDistribution“probabilities” for intuition, Distribution when the constructor invariant matters
query sequenceQuerySequence“queries” for intuition, QuerySequence when the attention role matters
key sequenceKeySequence“keys” for intuition, KeySequence when score construction needs the matching head dimension
value sequenceValueSequence“values” for intuition, ValueSequence when attention weights need source rows to mix
target sequence lengthQuerySequence row count“target length” for intuition, L when contrasting target positions with source positions
source sequence lengthKeySequence and ValueSequence row count“source length” for intuition, S when the positions being read may differ from target positions
attention score rowsAttentionScores“scores” for intuition, AttentionScores when the row shape must be validated
attention maskAttentionMask“allowed positions” for intuition, AttentionMask when illegal score positions must be removed before softmax
mask polarityAttentionMask“true means allowed in this Rust type” when comparing with framework APIs whose boolean masks may use the opposite convention
attention weightsAttentionWeights“weights” for intuition, AttentionWeights when each query row must sum to one
attention outputAttentionOutput“mixed values” for intuition, AttentionOutput when one output row per query matters
head countHeadCount“number of heads” for intuition, HeadCount when zero heads must be rejected
head outputsAttentionHeadOutputs“outputs from several heads” for intuition, AttentionHeadOutputs when all heads must share sequence length and width
multi-head outputMultiHeadOutput“concatenated heads” for intuition, MultiHeadOutput when the combined model dimension matters
attention output projectionAttentionOutputProjection“projection after head concatenation” for intuition, AttentionOutputProjection when matrix shape must be validated
projected attention outputProjectedAttentionOutput“projected attention sequence” for intuition, ProjectedAttentionOutput when the post-projection width matters
hidden sequenceHiddenSequence“sequence of hidden vectors” for intuition, HiddenSequence when residual shape must be protected
hidden-to-query projectionHiddenToQuery“make query vectors from hidden rows” for intuition, HiddenToQuery when projection shape must be validated
hidden-to-key projectionHiddenToKey“make key vectors from hidden rows” for intuition, HiddenToKey when projection shape must be validated
hidden-to-value projectionHiddenToValue“make value vectors from hidden rows” for intuition, HiddenToValue when projection shape must be validated
residual connectionResidualConnection“add the sublayer output back” for intuition, ResidualConnection when sequence length and width must match
layer normalizationLayerNormalization“normalize each hidden vector” for intuition, LayerNormalization when feature-wise normalization must preserve shape
layer norm parametersLayerNormParameters“scale, shift, epsilon” for intuition, LayerNormParameters when parameter dimensions must be validated
position-wise feed-forwardPositionWiseFeedForward“same non-linear map at each sequence position” for intuition, PositionWiseFeedForward when two-layer shape checks must preserve hidden width
positional encodingPositionalEncoding“add position rows” for intuition, PositionalEncoding when sequence length and model width must be checked
self-attentionSelfAttentionHead, MultiHeadTransformerBlock“same hidden sequence supplies query, key, and value roles” for intuition, self-attention when source ownership matters
cross-attentionQuerySequence, KeySequence, ValueSequence“target sequence reads a separate source sequence” for intuition; the repository names the boundary but does not implement a full cross-attention block yet
single-head blockSingleHeadTransformerBlock“one block-shaped sketch” for intuition, SingleHeadTransformerBlock when the whole boundary should preserve hidden sequence shape
self-attention headSelfAttentionHead“one query/key/value projection triple” for intuition, SelfAttentionHead when one head’s role dimensions must be validated
multi-head blockMultiHeadTransformerBlock“several heads as one block” for intuition, MultiHeadTransformerBlock when head count and output-projection shape must be validated
masked multi-head blockMaskedMultiHeadTransformerBlock“block with allowed attention positions” for intuition, MaskedMultiHeadTransformerBlock when the mask joins hidden state at the block boundary
fixed mask contextAttentionMask selected before a block call“same mask reused for this run” for intuition, fixed context when an open masked block is viewed as HiddenSequence -> HiddenSequence
fixed module instanceLayerNormalization, PositionWiseFeedForward, or a block value with stored parameters“this specific layer value” for intuition, fixed module instance when a forward call is named HiddenSequence -> HiddenSequence
parameter-changing updateTransformerTrainingState“learning changed the stored parameters” for intuition, training-state endomorphism when scale, shift, weights, biases, learning rate, or step count must stay together
sequence logitsSequenceLogits“vocabulary scores at each sequence position” for intuition, SequenceLogits when sequence length and vocabulary width must be explicit
Transformer readoutTransformerReadout“sequence language-model head” for intuition, TransformerReadout when hidden width and vocabulary width must be validated
tiny Transformer parametersTinyTransformerParameters“position plus block plus readout” for intuition, TinyTransformerParameters when named model roles should move together
Transformer training stateTransformerTrainingState“parameters plus optimizer metadata” for intuition, TransformerTrainingState when step count and learning rate matter
Transformer readout training exampleTransformerReadoutTrainingExample“one fixed hidden sequence with target tokens” for intuition, TransformerReadoutTrainingExample when hidden, mask, and target lengths must match
Transformer readout train stepTransformerReadoutTrainStep“readout-only update” for intuition, TransformerReadoutTrainStep when the state endomorphism matters
Transformer feed-forward training exampleTransformerFeedForwardTrainingExample“one hidden-sequence input and target” for intuition, TransformerFeedForwardTrainingExample when feed-forward local training shape must match
Transformer feed-forward train stepTransformerFeedForwardTrainStep“local feed-forward update” for intuition, TransformerFeedForwardTrainStep when the state endomorphism matters
Transformer block training exampleTransformerBlockTrainingExample“one sequence-to-token supervised example” for intuition, TransformerBlockTrainingExample when hidden, mask, and target lengths must match
Transformer block train stepTransformerBlockTrainStep“composed readout-plus-feed-forward update” for intuition, TransformerBlockTrainStep when a sequence loss updates more than one parameter group
evidence signalcommand output, compiler error, test name, constructor result, or table row“visible evidence” when reporting what happened, evidence signal when the report must point to something inspectable
challenge completion evidencechallenge issue fields and challenge commands“I completed this practice loop” for challenge progress, not “accepted textbook reader feedback” unless it names the first unclear point and smallest useful fix
source claima narrow statement from a source link“the outside claim being translated” before Rust code, source claim when a challenge must name what the paper or documentation actually supports
Rust boundarya named type, function, trait, constructor, test, or command“what the repository actually implements” when separating the local exercise from the larger source
optimizer stateAdamOptimizerState, AdamModelState, or TransformerTrainingState“memory carried between updates” when explaining Adam-style moment estimates, step count, parameters, and learning metadata
Typed AI Rustlingschallenges/typed-ai-rustlings/ and tests/challenge_typed_ai.rs“compiler-fix AI exercise” before abstraction, Typed AI Rustlings when one type mistake is meant to fail visibly
Paper-To-Rustsrc/challenges/papers/adam.rs, examples/challenge_adam.rs, and tests/paper_to_rust_adam.rs“compile one paper idea” before abstraction, Paper-To-Rust when a source claim becomes a Rust boundary, invariant, and test signal
larger claim not implementedlimitation notes in a chapter or challenge“what this tiny example does not prove” when keeping a source-backed claim modest
typed transformationMorphism<Input, Output>“typed transformation” before abstraction, “morphism” once the Rust trait is in view
product-input morphismProduct<A, B> at the input boundary“needs two named inputs” before abstraction, product-input morphism when the arrow shape is A x B -> C
update stepTrainStep“training step” for ML behavior, “endomorphism” for the Parameters -> Parameters shape

This alignment prevents two common confusions. First, not every prose phrase is a Rust type. Second, not every Rust type is a new mathematical concept. The book uses plain phrases for intuition, Rust names for exact code, and category-theory words only when the shape is visible.

Common Misreadings Index

Use this as a small contrast drill. Each row starts with a sentence that sounds plausible, then puts the corrected boundary next to it. The point is not to memorize the table. The point is to notice which Rust object, ML role, or category-theory shape the misreading erased.

Plausible misreadingCorrected boundaryRust evidenceWhat to say instead
TokenId is just a usize.TokenId is a domain object for vocabulary positions.TokenId is a named type consumed by token and embedding stages.The raw number is local machinery; the boundary value says “vocabulary item.”
Logits are probabilities.Logits -> Distribution is a required stage.Softmax consumes Logits and produces Distribution.Scores become probabilities only after row or vocabulary normalization.
A normalized softmax probability is calibrated confidence.Calibration is an empirical reliability claim, not just a Distribution constructor invariant.Distribution::new validates a local probability vector; calibration needs population-level evidence outside this tiny example.Say “normalized model probability” unless you have checked empirical calibration.
Loss only needs the prediction.Distribution x TokenId -> Loss is a product-input boundary.CrossEntropy consumes prediction and target together.The target token tells the loss which probability to judge.
A training step can return changed weights only.Parameters -> Parameters or TransformerTrainingState -> TransformerTrainingState preserves the next update shape.TrainStep and Transformer train steps return complete state objects.The updated object must be ready for the next step without reconstruction.
fmap means any function call.fmap changes inside values while preserving wrapper shape.VecFunctor::fmap returns Vec<B> and OptionFunctor::fmap returns Option<B>.The operation maps the contents and keeps the outer structure.
Returning the left object makes a boundary an endomorphism.Count inputs first: A x B -> A is still product-input. If the product is named as one source object, (A x B) -> A is unary from the product but still not an endomorphism.HiddenSequence x ProjectedAttentionOutput -> HiddenSequence needs two inputs.A unary endomorphism has shape A -> A; an endomorphism on the product would have shape (A x B) -> (A x B).
Self-attention makes Q, K, and V the same role.Self-attention shares source ownership before projection.HiddenToQuery, HiddenToKey, and HiddenToValue produce separate role objects.The same hidden sequence may feed all three projections, but the roles remain distinct.
Masking after softmax is equivalent.AttentionScores x AttentionMask -> AttentionScores -> AttentionWeights.The mask is applied before AttentionSoftmax.Illegal positions should not receive probability mass.
A masked block is automatically an endomorphism because it returns HiddenSequence.MaskedMultiHeadTransformerBlock : HiddenSequence x AttentionMask -> HiddenSequence while the mask is open.The block consumes AttentionMask at the boundary.Keep the mask visible, or explicitly say a fixed mask induces a HiddenSequence -> HiddenSequence view for that run.
A layer endomorphism means the parameters are not part of the story.LayerNormalization : HiddenSequence -> HiddenSequence is a forward call for one fixed layer value; parameter learning is TransformerTrainingState -> TransformerTrainingState.LayerNormalization stores scale and shift; train steps return full TransformerTrainingState.Fixed module context makes a forward endomorphism; changing parameters moves the boundary to training state.
MultiHeadOutput can be added directly to HiddenSequence.MultiHeadOutput -> ProjectedAttentionOutput must happen first.ResidualConnection expects projected model-width rows.Concatenated heads must return to model width before residual addition.
One finite-difference match proves training is correct.A finite-difference check is local evidence for one selected parameter path.Tests compare one inferred update gradient with one numerical slope.The check supports the local implementation; it does not prove every parameter, dataset, or optimizer.
Challenge completion means the textbook section is clear.Challenge completion is practice evidence; textbook feedback needs the first unclear point or an explicit “none.”Challenge completion issues ask for evidence, lesson learned, first unclear point, and smallest useful fix.Say “the challenge ran” for completion; say “this section became clearer because…” for reader feedback.
Paper-To-Rust means reimplement the whole paper.Paper-To-Rust compiles one source claim into one Rust boundary, invariant, and test signal.The Adam challenge uses AdamModelState -> AdamModelState for optimizer memory.Keep the source claim narrow, then name the larger claim not implemented.

When one of these misreadings appears in your own answer, repair it with three questions:

Which object did I erase?
Which ML or software role did that object protect?
Which category-theory shape did I name too early or too loosely?

Category-Theory Terms

Object

Rust syntax:

TokenId
Vector
Logits
Distribution
Loss
Parameters

ML concept:

An object is one kind of value in the pipeline, such as a token, vector, probability distribution, loss, or model state.

Category theory concept:

An object is something a morphism can start from or end at.

First-principles reading:

An object is the kind of thing an arrow is allowed to receive or return. In this book, TokenId and Vector are different objects because the pipeline should not confuse a vocabulary index with a dense numeric representation.

Morphism

Rust syntax:

pub trait Morphism<Input, Output>

ML concept:

A morphism is one transformation stage, such as embedding lookup or softmax.

Category theory concept:

A morphism is a typed arrow:

Input -> Output

First-principles reading:

In this book, “morphism” usually means “a named transformation with an input type, an output type, and a possible typed error.” The abstract name is useful only because the Rust code makes the boundary inspectable.

Identity Morphism

Rust syntax:

Identity<T>

ML concept:

Identity is a stage that leaves a value unchanged. It is useful for testing the idea of neutral transformations.

Category theory concept:

Every object has an identity arrow:

id_A : A -> A

Composition

Rust syntax:

Compose<F, G, Middle>

ML concept:

Composition connects stages:

Embedding then LinearToLogits then Softmax

Category theory concept:

If:

f : A -> B
g : B -> C

then:

g after f : A -> C

First-principles reading:

Composition is the reason the middle type matters. If the first stage produces Vector, the next stage must accept Vector. A compiler error at this point is useful evidence: the pipeline is missing or misordering a stage.

Product Object

Rust syntax:

Product<A, B>

ML concept:

A product stores paired values, such as:

input token x target token
prediction distribution x target token

Category theory concept:

The product object is written:

A x B

Its projections correspond to first() and second().

Product-Input Morphism

Rust syntax:

Product<A, B> -> C
ScaledDotProductScores : QuerySequence x KeySequence -> AttentionScores
WeightedValueMixing : AttentionWeights x ValueSequence -> AttentionOutput

ML or software concept:

Some transformations need two meaningful inputs at the boundary. Attention scoring needs target-side queries and source-side keys. Value mixing needs attention weights and the source values being mixed.

Category theory concept:

A product-input morphism has a product object as its input:

A x B -> C

First-principles reading:

Do not erase the product just because the output has a familiar type. The product names the fact that two inputs must agree before the transformation is legal.

Law

Rust syntax:

assert_eq!(...)
information_order_obeys_preorder_laws()
pipeline_trace_obeys_monoid_laws()

ML or software concept:

A law is expected behavior that should keep working after implementation details change.

Category theory concept:

A law states the structure a model must preserve, such as identity, associativity, reflexivity, transitivity, or composition preservation.

First-principles reading:

A law is not decoration. In this repository, a law should have a nearby test or check. Otherwise the reader has no executable reason to trust the word.

Endomorphism

Rust syntax:

Endomorphism<T>
TrainStep : Parameters -> Parameters

ML concept:

A training step updates parameters and returns parameters again.

Category theory concept:

An endomorphism is an arrow from an object back to itself:

A -> A

Functor

Rust syntax:

Functor<A, B>
VecFunctor
OptionFunctor

ML concept:

Apply a transformation inside a wrapper such as a batch or optional value.

Category theory concept:

A functor maps objects and arrows while preserving structure.

First-principles reading:

For this book, the simplest functor intuition is map: apply a function inside a context without destroying the context. VecFunctor preserves the list shape. OptionFunctor preserves the difference between Some and None.

Functor Map

Rust syntax:

fn map<U>(self, f: impl Fn(T) -> U) -> Distribution<U>

ML concept:

For a probabilistic output, map transforms every possible outcome while leaving the attached probabilities unchanged.

Category theory concept:

map lifts a deterministic function:

T -> U

into a context-aware transformation:

Distribution<T> -> Distribution<U>

Natural Transformation

Rust syntax:

VecToFirstOption : Vec<A> -> Option<A>

ML concept:

Convert one container shape into another consistently, such as many candidates to maybe one selected candidate.

Category theory concept:

A natural transformation converts one functor shape into another and commutes with mapping.

Monoid

Rust syntax:

PipelineTrace
Monoid::empty()
Monoid::combine()

ML concept:

Traces, logs, batches, and metric accumulators often need an empty value and a combine operation.

Category theory concept:

A monoid has an identity element and an associative binary operation.

First-principles reading:

A monoid is the structure behind “start empty, then combine many pieces.” That is why traces, logs, resource bundles, and accumulated updates are good software examples.

Preorder

Rust syntax:

InformationLevel::can_flow_to

ML or software concept:

Information can flow from observation to feature to score to decision.

Category theory concept:

A preorder is reflexive and transitive.

First-principles reading:

In code, a preorder often appears as a “can flow to,” “can supply,” or “is no more than” relation. The important part is not sorting. The important part is that repeated comparisons remain coherent.

Galois Connection

Rust syntax:

abstract_to_layer_budget
concretize_layer_budget

ML or software concept:

Concrete feature counts and abstract layer budgets can be coordinated.

Category theory concept:

Two order-preserving views are connected by a law:

abstract(x) <= y iff x <= concretize(y)

Monoidal Preorder

Rust syntax:

ResourceBundle::tensor
ResourceBundle::can_supply

ML or software concept:

Independent compute and memory resources can be combined.

Category theory concept:

A preorder with a product-like composition operation that preserves order.

Profunctor

Rust syntax:

FeasibilityRelation::relates(requirement, offer)

ML or software concept:

A requirement and implementation offer are related if constraints are satisfied.

Category theory concept:

A profunctor generalizes a relationship between categories. This course uses a small Bool-valued relation as the practical handle.

Functorial Semantics

Rust syntax:

SignalMatrix::compose_after

ML or software concept:

Composed signal-flow stages should have the same meaning as composing their matrix interpretations.

Category theory concept:

Interpretation preserves composition.

Open System

Rust syntax:

OpenCircuit
OpenCircuit::then
OpenCircuit::parallel

ML or software concept:

A component has an external interface plus internal implementation details.

Category theory concept:

An open system composes through typed boundaries.

Commutative Diagram

Rust syntax:

composed_and_direct_prediction_match()
naturality_square_commutes()

ML or software concept:

Two different implementation paths should produce the same result.

Category theory concept:

A commutative diagram says that following one route through a diagram has the same meaning as following another route with the same start and end.

First-principles reading:

In this book, do not imagine a diagram first. Imagine two Rust expressions that should agree. The diagram is the picture of that agreement.

Sheaf-Style Locality

Rust syntax:

SafetyCover::global_truth

ML or software concept:

Local safety checks over time intervals combine into a global safety result.

Category theory concept:

Local facts can determine a global fact when they glue coherently.

Boundary

Rust syntax:

Distribution::new
TrainingSet::new
SignalMatrix::compose_after
OpenCircuit::then

ML or software concept:

A boundary is where invalid structure should be rejected before it spreads through the pipeline.

Category theory concept:

A boundary protects the intended object, morphism, relation, or composition from accepting values outside its domain.

First-principles reading:

Many exercises ask what a type or method prevents. That is a boundary question. Good boundaries make wrong connections hard to express.

Rust Terms

Newtype

Rust syntax:

pub struct TokenId(usize);

ML concept:

The same raw number type can represent different concepts. Newtypes prevent accidental mixing.

Category theory concept:

A newtype names a specific object instead of treating all raw representations as the same object.

First-principles reading:

A newtype is the smallest move from “just data” to “data with a role.” The runtime representation can stay cheap, but the type checker now knows that a token id, vocabulary size, and model dimension are not the same concept.

Smart Constructor

Rust syntax:

pub fn new(value: Raw) -> CtResult<Self>

ML concept:

Invalid training inputs, probabilities, dimensions, or hyperparameters should be rejected early.

Category theory concept:

A smart constructor maps raw data into a validated subobject, using Result when the mapping can fail.

Invariant

Rust syntax:

Distribution must be non-empty, finite, non-negative, and sum to one.

ML concept:

The model can trust a value only if the type protects the rule that makes it meaningful.

Category theory concept:

An invariant describes the subset or structure the object is meant to inhabit.

Typed Error

Rust syntax:

CtError
CtResult<T>

ML concept:

Bad data should fail with a meaningful cause, not with a vague panic later.

Category theory concept:

Result turns a partial construction or morphism into a total error-aware mapping.

Negative Test

Rust syntax:

assert!(matches!(..., Err(...)))

ML or software concept:

A negative test proves that invalid data or an invalid connection is rejected.

Category theory concept:

It checks that a proposed object, relation, or composition is not admitted when the required structure is missing.

First-principles reading:

Positive tests show what works. Negative tests show what the boundary protects. Both are needed when a chapter claims that types make structure explicit.

Machine-Learning Terms

Token

Rust syntax:

TokenId

ML concept:

A token is a discrete symbol from a vocabulary.

Category theory concept:

The vocabulary is a finite discrete set of possible token objects.

Training Example

Rust syntax:

pub type TrainingExample = Product<TokenId, TokenId>;

ML concept:

A training example pairs an input token with the target token that follows it.

Category theory concept:

It is a product object:

TokenId x TokenId

First-principles reading:

The product matters because the loss function needs both parts: the prediction derived from the first token and the target represented by the second token.

Training Set

Rust syntax:

TrainingSet
DatasetWindowing : TokenSequence -> TrainingSet

ML concept:

A training set is a non-empty collection of adjacent next-token examples.

Category theory concept:

It is an object produced by a data-preparation morphism and consumed by the training update.

Embedding

Rust syntax:

Embedding : TokenId -> Vector

ML concept:

An embedding maps a discrete token to a dense numerical representation.

Category theory concept:

It is a morphism from a finite token object into a vector-space-like object.

Logits

Rust syntax:

Logits(Vec<f32>)

ML concept:

Logits are raw scores before softmax.

Category theory concept:

They live in a vector-space-like object:

R^vocab_size

Softmax

Rust syntax:

Softmax : Logits -> Distribution

ML concept:

Softmax turns raw scores into probabilities.

Category theory concept:

It maps from a score vector into the probability simplex.

Distribution

Rust syntax:

Distribution
Distribution::new

ML concept:

A distribution is a probability vector over possible next tokens. Its values must be finite, non-negative, non-empty, and sum to one.

Category theory concept:

It is the object produced by softmax and consumed with a target token to produce loss.

First-principles reading:

A raw vector can contain any numbers. A Distribution is a vector that has earned the right to be read as probabilities.

Cross Entropy

Rust syntax:

CrossEntropy : Product<Distribution, TokenId> -> Loss

ML concept:

Cross entropy measures how much probability the model assigned to the correct target.

Category theory concept:

It is a morphism from prediction-target product into non-negative scalar loss.

Loss

Rust syntax:

Loss
Loss::new

ML concept:

Loss is a scalar penalty. Lower loss means the model assigned more probability to the correct target in this tiny pipeline.

Category theory concept:

Loss is the output object of the evaluation morphism:

Distribution x TokenId -> Loss

Parameters

Rust syntax:

Parameters

ML concept:

The trainable state of the model: embedding table, output head, and bias.

Category theory concept:

The object transformed by the training endomorphism.

First-principles reading:

The word “state” can be vague. In this book, the model state is concrete: embedding table, output head, and bias. Training means returning a new value of the same Parameters type.

Gradient

Rust syntax:

LocalGradient
grad_embedding
grad_lm_head
grad_bias

ML concept:

A gradient tells how parameters should change to reduce loss.

Category theory concept:

Gradient flow is local derivative information composed backward through a composed computation.

Learning Rate

Rust syntax:

LearningRate

ML concept:

The scalar step size in gradient descent.

Category theory concept:

It chooses a specific update morphism from a family of parameter endomorphisms.

End-To-End Pipeline

Rust syntax:

TokenSequence -> TrainingSet
TokenId -> Vector -> Logits -> Distribution
Distribution x TokenId -> Loss
Parameters -> Parameters

ML concept:

The full tiny system turns text into training examples, predicts a next-token distribution, evaluates loss, and updates parameters.

Category theory concept:

The full pipeline is a collection of composable typed transformations, with training represented as a repeatable endomorphism on model state.

Chain Rule

Rust syntax:

MulOp::backward

ML concept:

The chain rule lets local derivatives combine into gradients for a larger computation.

Category theory concept:

It is composition of local derivative maps.

Target And Source Sequence Length

Rust syntax:

QuerySequence
KeySequence
ValueSequence
AttentionScores
AttentionMask

ML concept:

The target sequence length is the number of query positions that ask for information. The source sequence length is the number of key-value positions that can be read. In self-attention they are often the same sequence. In cross-attention they can come from different sequences.

Category theory concept:

The attention boundary keeps two roles visible:

Target positions x Source positions -> attention weights

First-principles reading:

This is why the book uses role-specific names instead of one generic matrix name. A mask of shape L x S answers a concrete question: for each target position, which source positions may be read?

Attention Scores

Rust syntax:

QuerySequence
KeySequence
ScaledDotProductScores : QuerySequence x KeySequence -> AttentionScores
AttentionScores

ML concept:

Attention scores are query-by-key compatibility values before softmax. The scaled dot-product boundary computes one score for each query and key pair.

Category theory concept:

ScaledDotProductScores is a morphism from a product of role-specific sequence objects into a score table. AttentionScores is an object whose rows can be transformed into probability-like attention weights.

First-principles reading:

The shape matters. Query and key sequences may have different lengths, but they must share the same head dimension before dot products make sense. A score table must have at least one query row, at least one key column, and the same number of key columns in every row.

Hidden-To-Role Projections

Rust syntax:

HiddenToQuery : HiddenSequence -> QuerySequence
HiddenToKey : HiddenSequence -> KeySequence
HiddenToValue : HiddenSequence -> ValueSequence

ML concept:

Self-attention begins by projecting hidden states into query, key, and value roles. The rows may all be numbers, but the roles are not interchangeable.

Category theory concept:

These are parallel morphisms from one source object:

HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence

First-principles reading:

The projection constructors validate matrix shape and finite values. The application step checks that the hidden sequence width matches the projection input width before producing role-specific sequence objects.

Self-Attention

Rust syntax:

SelfAttentionHead
MultiHeadTransformerBlock : HiddenSequence -> HiddenSequence

ML concept:

Self-attention means the query, key, and value roles all come from the same hidden sequence. The roles are still distinct after projection, but their source ownership is shared.

Category theory concept:

The internal attention path still contains product-input boundaries:

QuerySequence x KeySequence -> AttentionScores
AttentionWeights x ValueSequence -> AttentionOutput

The surrounding block can have endomorphism shape only after the internal composition returns to the same public object:

HiddenSequence -> HiddenSequence

First-principles reading:

Self-attention is not permission to call every internal step an endomorphism. It is the case where one source hidden sequence is projected into the query, key, and value roles before scoring and mixing.

Cross-Attention

Rust syntax:

QuerySequence
KeySequence
ValueSequence

ML concept:

Cross-attention means the target-side query sequence reads from a separate source-side key-value sequence. The current repository names this boundary for precision, but it does not yet implement a full cross-attention block.

Category theory concept:

The source split makes the product input impossible to hide:

TargetHiddenSequence -> QuerySequence
SourceHiddenSequence -> KeySequence
SourceHiddenSequence -> ValueSequence
QuerySequence x KeySequence -> AttentionScores
AttentionWeights x ValueSequence -> AttentionOutput

First-principles reading:

When the target sequence and source sequence are not the same object, the attention map has target rows and source columns. That is the shape reason to keep L and S separate in explanations, masks, and tests.

Attention Mask

Rust syntax:

AttentionMask
MaskedAttentionScores : AttentionScores x AttentionMask -> AttentionScores

ML concept:

An attention mask marks which key positions each query is allowed to attend to. Disallowed score positions become a large negative value before softmax, so their probability becomes negligible.

Read the mask as a permission table, not as a shorter token sequence. A mask cell answers:

may this query row read this source column?

It selects legal score cells before probability normalization. It does not directly produce AttentionWeights; softmax still turns the remaining score row into weights.

Category theory concept:

MaskedAttentionScores is a typed morphism from a product object back to the score object:

AttentionScores x AttentionMask -> AttentionScores

First-principles reading:

Every mask row must allow at least one key. Otherwise softmax would be asked to choose among no legal positions.

Recovery rule:

mask cells select legal score cells
softmax turns remaining score rows into weights
weights read value rows

Attention Weights

Rust syntax:

AttentionWeights
AttentionSoftmax : AttentionScores -> AttentionWeights

ML concept:

Attention weights are row-wise probabilities over key positions. Each query position receives its own distribution over the positions it can attend to.

Category theory concept:

AttentionSoftmax is a typed morphism from raw score rows to validated probability rows.

First-principles reading:

This is one Transformer-roadmap boundary made executable in the crate. It validates the probability-like score-to-weight step after query-key scoring and masking have produced legal score rows.

Value Mixing

Rust syntax:

ValueSequence
WeightedValueMixing : AttentionWeights x ValueSequence -> AttentionOutput
AttentionOutput

ML concept:

Value mixing uses each query row of attention weights to compute a weighted sum of value vectors. The result has one output vector per query position.

Category theory concept:

WeightedValueMixing is a morphism from a product object to an output object:

AttentionWeights x ValueSequence -> AttentionOutput

First-principles reading:

The key length of the weights must match the number of value rows. If a query has weights over three source positions, the value sequence must provide three source vectors to mix.

Multi-Head Concatenation

Rust syntax:

HeadCount
AttentionHeadOutputs
ConcatenateHeads : AttentionHeadOutputs -> MultiHeadOutput
MultiHeadOutput

ML concept:

Several attention heads can produce one output sequence each. Concatenation combines the feature vectors at each sequence position so later layers can read all head outputs together.

Category theory concept:

ConcatenateHeads is a recombination morphism:

AttentionHeadOutputs -> MultiHeadOutput

First-principles reading:

The constructor checks that every head has the same sequence length and head dimension before concatenation. The resulting model dimension is the head count multiplied by the head dimension. This is the typed boundary where separate head outputs become one combined object.

Attention Output Projection

Rust syntax:

AttentionOutputProjection
AttentionOutputProjection : MultiHeadOutput -> ProjectedAttentionOutput
ProjectedAttentionOutput

ML concept:

After head outputs are concatenated, a learned linear projection mixes features across heads and returns the sequence to the width expected by the surrounding model block.

Category theory concept:

AttentionOutputProjection is a morphism:

MultiHeadOutput -> ProjectedAttentionOutput

First-principles reading:

The projection validates its matrix and bias before use. It also checks that the MultiHeadOutput width matches the projection input width. This keeps the post-concatenation linear map from becoming an untyped matrix multiply hidden inside the example.

Residual Connection

Rust syntax:

HiddenSequence
ResidualConnection : HiddenSequence x ProjectedAttentionOutput -> HiddenSequence

ML concept:

A residual connection adds a sublayer output back to the hidden sequence it came from. The addition is only meaningful when every sequence position has the same hidden width on both sides.

Category theory concept:

ResidualConnection is a product-to-object morphism:

HiddenSequence x ProjectedAttentionOutput -> HiddenSequence

The larger Transformer block can still have endomorphism shape:

HiddenSequence -> HiddenSequence

First-principles reading:

Residual addition is not just vector arithmetic. It is a shape contract. The sequence length and model dimension must match before addition can preserve the hidden sequence object.

Layer Normalization

Rust syntax:

LayerNormParameters
LayerNormalization : HiddenSequence -> HiddenSequence

ML concept:

Layer normalization normalizes each hidden vector across its feature dimension. It keeps the sequence length and model dimension unchanged.

Category theory concept:

LayerNormalization is an endomorphism:

HiddenSequence -> HiddenSequence

First-principles reading:

The operation changes values, not the object type. The parameter object protects the scale, shift, and epsilon invariants before a hidden sequence can be normalized.

Position-Wise Feed-Forward

Rust syntax:

PositionWiseFeedForward : HiddenSequence -> HiddenSequence

ML concept:

A position-wise feed-forward network applies the same two-layer non-linear map to every hidden vector in the sequence. It can expand the feature dimension internally, apply an activation, then project back to the original model dimension.

Category theory concept:

PositionWiseFeedForward is an endomorphism:

HiddenSequence -> HiddenSequence

First-principles reading:

The internal feed-forward width is allowed to differ from the model dimension, but the public output must return to the same hidden sequence shape. The type protects that shape before later blocks try to compose with it.

Positional Encoding

Rust syntax:

PositionalEncoding : HiddenSequence -> HiddenSequence

ML concept:

Position information lets a sequence model distinguish the first token from the second token even when their content vectors are otherwise similar.

Category theory concept:

PositionalEncoding is an endomorphism:

HiddenSequence -> HiddenSequence

First-principles reading:

The encoding table must have enough rows for the hidden sequence and the same model width. Adding position changes the values at each row, not the public shape of the hidden sequence.

Single-Head Transformer Block

Rust syntax:

SingleHeadTransformerBlock : HiddenSequence -> HiddenSequence

ML concept:

The single-head block sketch composes hidden-to-role projections, attention, output projection, residual addition, normalization, and a feed-forward sublayer. It is intentionally small: one head and no production training machinery.

Category theory concept:

SingleHeadTransformerBlock is an endomorphism:

HiddenSequence -> HiddenSequence

First-principles reading:

The block is useful because it hides internal steps without hiding shape contracts. The caller sees one sequence-preserving transformation; the constructor still checks the dimensions that make the internal composition legal.

Multi-Head Transformer Block

Rust syntax:

SelfAttentionHead
MultiHeadTransformerBlock : HiddenSequence -> HiddenSequence

ML concept:

A multi-head block applies several self-attention heads in parallel, concatenates their outputs, projects back to the model dimension, then applies the same residual, normalization, and feed-forward shape-preserving pattern.

Category theory concept:

MultiHeadTransformerBlock is an endomorphism:

HiddenSequence -> HiddenSequence

First-principles reading:

The block checks that every head accepts the same hidden width, every value head has the same output width, and the output projection expects exactly:

head_count * value_head_dimension

Those checks keep multi-head attention as explicit structure rather than an unlabeled matrix pile.

Masked Multi-Head Transformer Block

Rust syntax:

MaskedMultiHeadTransformerBlock : HiddenSequence x AttentionMask -> HiddenSequence

ML concept:

A masked block runs the same multi-head path while preventing disallowed query-key positions from receiving attention probability.

Category theory concept:

MaskedMultiHeadTransformerBlock consumes a product object:

HiddenSequence x AttentionMask -> HiddenSequence

First-principles reading:

The mask is not a side channel. It is an explicit input to the block. The mask shape must match the query-by-key score table produced inside each head.

Fixed Mask View

Rust syntax:

AttentionMask
MaskedMultiHeadTransformerBlock : HiddenSequence x AttentionMask -> HiddenSequence

ML concept:

A fixed mask view means a particular mask has already been chosen for this run. For example, one training example may reuse the same allowed-position pattern every time the block is applied to its hidden sequence.

Category theory concept:

The open boundary is product-input:

HiddenSequence x AttentionMask -> HiddenSequence

After choosing one concrete mask as context, that specific run can induce a unary map:

HiddenSequence -> HiddenSequence

First-principles reading:

Do not erase the mask to get a cleaner category name. Either keep the open product-input boundary visible, or say exactly which AttentionMask was fixed before calling the result a HiddenSequence -> HiddenSequence view.

Sequence Logits

Rust syntax:

SequenceLogits

ML concept:

Sequence logits are unnormalized vocabulary scores for each position in a hidden sequence.

Category theory concept:

They are the output object of a sequence-level readout morphism:

HiddenSequence -> SequenceLogits

First-principles reading:

The object keeps sequence length and vocabulary size explicit. That prevents a sequence readout from becoming an unlabeled table of floats.

Transformer Readout

Rust syntax:

TransformerReadout : HiddenSequence -> SequenceLogits

ML concept:

A readout maps each final hidden vector to vocabulary scores. It is the sequence-level version of the earlier Vector -> Logits language-model head.

Category theory concept:

TransformerReadout is a morphism from hidden sequence object to sequence logit object:

HiddenSequence -> SequenceLogits

First-principles reading:

The readout validates the input model dimension and vocabulary width before projecting rows. The model should fail at the boundary, not inside an indexing loop.

Tiny Transformer Parameters

Rust syntax:

TinyTransformerParameters : HiddenSequence x AttentionMask -> SequenceLogits

ML concept:

The parameter object owns the position table, masked block, and sequence readout needed for the tiny Transformer forward path.

Category theory concept:

It is a product-to-object morphism:

HiddenSequence x AttentionMask -> SequenceLogits

First-principles reading:

The object groups named roles. The point is not to claim a production Transformer; the point is to stop passing unrelated matrices as loose arguments.

Transformer Training State

Rust syntax:

TransformerTrainingState

ML concept:

A training state owns parameters, a learning rate, and a step count. The current code can evaluate through the structured state and record that a new parameter object belongs to the next step.

Category theory concept:

The forward path has shape:

HiddenSequence x AttentionMask -> SequenceLogits

The optimizer updates the state with endomorphism shape:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

This is honest scaffolding. It models the state boundary the current tiny optimizer updates, without pretending that the teaching implementation is a production Transformer trainer.

Transformer Readout Training Example

Rust syntax:

TransformerReadoutTrainingExample

ML concept:

A readout training example pairs one hidden sequence and attention mask with a target token at every sequence position.

Category theory concept:

It is a validated product-like learning object that feeds a training endomorphism:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

The hidden sequence length, mask shape, and target-token count must agree before a training step can compute a meaningful loss.

Transformer Readout Train Step

Rust syntax:

TransformerReadoutTrainStep : TransformerTrainingState -> TransformerTrainingState

ML concept:

This step updates only the sequence readout. It keeps the position table and attention block fixed, computes softmax cross-entropy gradients at each sequence position, updates the readout weights and bias, and increments the step count.

Category theory concept:

The update is an endomorphism:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

This is a real update with a narrow scope. It teaches how the structured state can change without claiming that gradients already flow through every Transformer block parameter.

Transformer Feed-Forward Training Example

Rust syntax:

TransformerFeedForwardTrainingExample

ML concept:

A local feed-forward training example pairs a hidden-sequence input with a hidden-sequence target. It trains the feed-forward sublayer as a small supervised map before the book attempts full block gradients.

Category theory concept:

It is a validated training object for an endomorphism:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

The input and target must have the same sequence length and model dimension. Otherwise the squared-error training signal would compare incompatible hidden objects.

Transformer Feed-Forward Train Step

Rust syntax:

TransformerFeedForwardTrainStep : TransformerTrainingState -> TransformerTrainingState

ML concept:

This step updates the position-wise feed-forward sublayer. It computes a local squared-error gradient through the second linear layer, the ReLU gate, and the first linear layer. It leaves attention and readout parameters fixed.

Category theory concept:

The update is another endomorphism:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

This is one layer deeper than readout-only training, but it is still not full Transformer backpropagation. It is a deliberately scoped way to show that a structured state can update an internal block component without erasing the roles of the other components.

Transformer Block Training Example

Rust syntax:

TransformerBlockTrainingExample

ML concept:

A block training example pairs an input hidden sequence and attention mask with target tokens. The loss starts at sequence logits, not at a hand-written hidden target.

Category theory concept:

It is a supervised object for a state endomorphism:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

The hidden sequence length, mask shape, and target-token count must agree because the training signal is position-wise. Every row in the hidden sequence produces one vocabulary score row and expects one target token.

Transformer Block Train Step

Rust syntax:

TransformerBlockTrainStep : TransformerTrainingState -> TransformerTrainingState

ML concept:

This step updates the sequence readout, position-wise feed-forward sublayer, and attention output projection from the same token-level loss. It computes the softmax cross-entropy gradient at the readout, backpropagates through the final layer-normalization boundary and residual addition, updates the feed-forward layers through the ReLU gate, then carries the signal through the attention normalization and residual boundary to the attention output projection.

Category theory concept:

The update is a composed endomorphism:

TransformerTrainingState -> TransformerTrainingState

First-principles reading:

This is the first update in the repository where a token prediction loss reaches inside the Transformer block and updates the attention output projection, query/key/value projections, and both layer-normalization scale/shift parameter sets. It still keeps position encodings fixed. That boundary is deliberate: the implemented step is real, but it is not pretending to be a production training algorithm.

Where This Leaves Us

The glossary is not a substitute for the chapters. It is the index of the book’s repeated translation habit. When a term feels unfamiliar, connect it back to one of three things: the Rust syntax that names it, the ML or software role that motivates it, and the categorical shape that explains how it composes.