Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

References

The problem this chapter solves is:

The course uses small Rust examples. These references point to the larger Rust, ML, category-theory, Transformer, and learning-science treatments behind those examples.

Use references as a source map, not as decoration. When you read a chapter, these links show where the larger Rust, ML, category-theory, Transformer, and learning-design ideas come from. A useful reference should help answer at least one of the book’s three recurring questions:

Rust syntax:
which source file in this course uses the idea?

ML concept:
which model, training, or learning behavior does the source explain?

Category theory concept:
which object, morphism, composition, product, endomorphism, functor, monoid, or law does it deepen?

How To Read Source Roles

Not every reference has the same job. Use this order when deciding what a source can support:

Source roleExamples in this chapterUse it forDo not use it for
repository code and testssrc/domain.rs, src/ml.rs, src/attention.rs, examples/the executable claim this book actually makesreplacing the larger math or framework source
official documentationRust Book, PyTorch docs, TensorFlow/Keras docs, Hugging Face docslanguage behavior, API shape, framework boundary checksproving a category-theory law by itself
academic papersSeven Sketches, Attention Is All You Need, Layer Normalization, Backprop as Functororiginal claims, formal scope, research vocabularyclaiming the tiny Rust code implements the whole paper
open textbooks and university materialDive into Deep Learning, CS231n, MIT Applied Category Theorypedagogy, intuition, course sequence, worked explanationsoverriding repository code or official API docs
implementation bridgesThe Annotated Transformer, The Illustrated Transformer, developer discussionsconnecting notation to code and finding likely reader confusionsserving as final authority for definitions or laws
learner-friction signalsreview reports, public questions, workshop notesdeciding what to explain more slowlyproving technical correctness

When two sources disagree in vocabulary, prefer the source that owns the boundary. Rust documentation owns Rust syntax. Framework documentation owns framework API shape. Academic papers own the formal claim they introduce. This book’s code owns only the smaller executable teaching claim that the chapter states and tests.

Chapter Reference Map

ChapterCentral questionBest references to use while rewriting
WelcomeWhat is the book’s promise and reading contract?How People Learn II, Rust By Example, Seven Sketches
Course MapHow do the Rust files, ML pipeline, and category-theory vocabulary fit together?How People Learn II, Rust modules, Rust By Example, Seven Sketches, Category Theory for Programming
Domain ObjectsWhy should meaningful ML values become separate Rust types?Rust structs, Rust By Example: New Type Idiom, Rust enums, Rust API Guidelines, Result error handling
Morphism and CompositionHow do typed transformations compose safely?Rust traits, Rust generics, Stanford Encyclopedia of Philosophy: Category Theory, Seven Sketches, Category Theory for Programming
The Tiny ML PipelineHow do token pairs become prediction, probability, and loss?Dive into Deep Learning: Softmax Regression, Softmax from Scratch, Accurate Computation of the Log-Sum-Exp and Softmax Functions, On Calibration of Modern Neural Networks, CS231n Linear Classification, PyTorch CrossEntropyLoss, Deep Learning
Training as an EndomorphismWhy is one training step a repeatable Parameters -> Parameters update?D2L Gradient Descent, CS231n Optimization, D2L Backpropagation and Computational Graphs, Automatic differentiation in machine learning: a survey, PyTorch torch.optim, The Matrix Calculus You Need For Deep Learning, Backprop as Functor, Learners’ Languages, Generalized Gradient Descent is a Hypergraph Functor
Functors, Naturality, Monoids, and Chain RuleWhich recurring structures appear after the first ML pipeline works?Categories for the Working Mathematician, Backprop as Functor, Seven Sketches, Category Theory for Programming, Category Theory in Machine Learning, Learners’ Languages, D2L Backpropagation and Computational Graphs, Automatic differentiation in machine learning: a survey, PyTorch Autograd mechanics, The Matrix Calculus You Need For Deep Learning
Seven Sketches Through RustHow can applied category theory become concrete enough to inspect in Rust?Seven Sketches, MIT Applied Category Theory OCW, Compositional Deep Learning, Categorical Deep Learning, Learning Functors using Gradient Descent, Category Theory in Machine Learning, Category Theory for Programming, Rust Book: Enums, Rust Book: Traits
ExercisesHow does the reader prove they can transfer the method?How People Learn II, Improving Students’ Learning With Effective Learning Techniques, Test-Enhanced Learning, Structuring the Transition From Example Study to Problem Solving, Counteracting detrimental effects of misconceptions, Writing Automated Tests, Rust By Example: Tests, CS231n Optimization: numerical gradients, CS231n: Neural Networks Part 3, PyTorch gradcheck, D2L Backpropagation and Computational Graphs, Rust API Guidelines
ChallengesHow can compiler-fix exercises and paper-to-code translations turn ideas into public practice?Rustlings Usage, Rustlings Community Exercises, Adam: A Method for Stochastic Optimization, PyTorch Adam, PyTorch torch.optim, Writing Automated Tests
Transformer RoadmapHow does the tiny system grow toward attention and Transformer blocks?Attention Is All You Need, NeurIPS proceedings page, D2L Attention and Transformers, D2L Transformer Architecture, D2L Queries, Keys, and Values, D2L Attention Scoring Functions, D2L Parameter Management, D2L Softmax From Scratch, D2L Gradient Descent, CS231n Neural Networks Part 3, PyTorch gradcheck, Hugging Face Course: How do Transformers work?, Hugging Face Transformers Model Outputs, PyTorch MultiheadAttention, PyTorch Transformer, TensorFlow Keras MultiHeadAttention, PyTorch scaled_dot_product_attention, PyTorch Transformer building blocks tutorial, PyTorch TransformerEncoderLayer, Rust Book: Closures, PyTorch Design Philosophy, PyTorch Numerical Accuracy, PyTorch developer MHA discussion, Hugging Face Performance and Scalability, Layer Normalization, On Layer Normalization in the Transformer Architecture, On the Anatomy of Attention, Self-Attention as a Parametric Endofunctor, Categorical Deep Learning, Seven Sketches, The Annotated Transformer, The Illustrated Transformer

Rust

Category Theory

Machine Learning

  • Dive into Deep Learning: Softmax Regression explains multiclass classification, logits, softmax, and cross entropy. Use it with src/ml.rs.
  • Dive into Deep Learning: Softmax Regression Implementation from Scratch shows the implementation path behind this course’s smaller Rust version.
  • Accurate Computation of the Log-Sum-Exp and Softmax Functions by Blanchard, Higham, and Higham supports the shifted softmax implementation that subtracts the maximum logit before exponentiation to improve floating-point behavior.
  • On Calibration of Modern Neural Networks by Guo, Pleiss, Sun, and Weinberger supports the distinction between softmax probabilities and calibrated confidence. Use it as a modesty boundary: Distribution means normalized model probabilities in this tiny example, not a guarantee that confidence matches empirical correctness.
  • Dive into Deep Learning: Gradient Descent gives the optimization background for TrainStep.
  • Dive into Deep Learning: Forward Propagation, Backward Propagation, and Computational Graphs supports the chain-rule and training chapters.
  • Automatic differentiation in machine learning: a survey by Baydin, Pearlmutter, Radul, and Siskind separates automatic differentiation, backpropagation, symbolic differentiation, and numerical finite differences. Use it when the book needs to distinguish a tiny hand-written gradient path from a general AD system.
  • PyTorch torch.optim is official framework documentation for optimizer objects, gradient clearing, backward passes, and optimizer steps. Use it to contrast production training loops with the book’s tiny TrainStep(dataset, learning_rate) : Parameters -> Parameters boundary.
  • Adam: A Method for Stochastic Optimization by Kingma and Ba introduces Adam as an adaptive stochastic optimizer based on first-moment and second-moment estimates. Use it for the Paper-To-Rust challenge claim that optimizer state must move with parameters.
  • PyTorch Adam is official framework documentation for Adam’s public optimizer API, moment estimates, bias correction, state_dict, and step() boundary. Use it as a production API sanity check for the smaller AdamModelState -> AdamModelState challenge.
  • Dive into Deep Learning: Numerical Stability and Initialization is useful when explaining broader gradient-scale and initialization stability issues beyond the tiny first softmax example.
  • PyTorch Autograd mechanics is official framework documentation for dynamic graph recording, saved tensors, and backward traversal with the chain rule. Use it to contrast production automatic differentiation with the book’s tiny MulOp::backward boundary.
  • Stanford CS231n: Optimization explains finite differences, numerical gradients, analytic gradients, and gradient checks. Use it with the finite-difference exercise and the TransformerBlockTrainStep tests.
  • Stanford CS231n: Neural Networks Part 3 explains gradient-checking cautions, learning-rate checks, and small-data sanity checks. Use it when exercises ask readers to interpret a failed training or gradient-check signal.
  • PyTorch gradcheck is official framework documentation for checking small finite differences against analytical gradients with tolerance, precision, and differentiability caveats. Use it to keep the finite-difference exercise honest about what a local gradient check can and cannot prove.
  • PyTorch CrossEntropyLoss is official framework documentation for the common production interface where the input is unnormalized logits and the target is a class index or class probability. Use it as an API-shape sanity check for the book’s smaller Logits -> Distribution -> Product<Distribution, TokenId> -> Loss path, not as the implementation target.
  • Stanford CS231n: Linear Classification explains linear classifiers, scores, losses, and the softmax classifier from a widely used university course.
  • Deep Learning by Goodfellow, Bengio, and Courville is a standard textbook reference for the broader ML vocabulary behind the tiny examples.
  • The Matrix Calculus You Need For Deep Learning gives a compact bridge from scalar calculus to the matrix shapes behind neural-network training. Use it as an advanced support reference for the chain-rule and gradient-check sections, not as a prerequisite.

Category Theory And Learning Systems

  • Backprop as Functor: A compositional perspective on supervised learning connects supervised learning, parameter updates, gradient descent, and compositional structure. Use it carefully: the book’s TrainStep is a tiny executable analogy, not a full implementation of the paper.
  • Compositional Deep Learning is a research reference for neural-network composition and categorical schemas. Use it as advanced context, not as prerequisite reading.
  • Category Theory in Machine Learning surveys category-theory applications across gradient-based learning, probability, and equivariant learning. Use it to decide whether a new chapter claim belongs to a recognized research theme or should stay a local teaching analogy.
  • Learners’ Languages develops the learner/update perspective around backpropagation, simple lenses, polynomial functors, and dynamical systems. Use it as advanced support for keeping TransformerTrainingState -> TransformerTrainingState modestly framed as a state-update teaching shape.
  • Generalized Gradient Descent is a Hypergraph Functor treats generalized gradient descent as a functor from compositional optimization problems to open dynamical systems. Use it as advanced context for composite objectives and distributed updates, not as a prerequisite for the tiny training loop.
  • Learning Functors using Gradient Descent studies category-shaped learning problems where functorial structure and composition invariants are learned with gradient descent. Use it as an advanced bridge from Seven Sketches-style schemas to learning systems.
  • Categorical Deep Learning is an ICML 2024 position paper about using category theory to connect architecture constraints with implementations. Use it as advanced context for roadmap warnings that a typed implementation boundary and a mathematical architecture constraint are related but not identical.

Transformers

  • Attention Is All You Need on arXiv is the original Transformer paper.
  • Attention Is All You Need on the NeurIPS proceedings site is the archival conference listing.
  • Dive into Deep Learning: Attention Mechanisms and Transformers is a practical bridge from softmax and vector operations to attention and Transformer blocks. Use it with src/attention.rs for the query-key scoring, mask, score-to-weight, value-mixing, head-concatenation, output-projection, residual, normalization, and feed-forward boundaries.
  • Dive into Deep Learning: Queries, Keys, and Values supports the role distinction between queries, keys, and values before the code names QuerySequence, KeySequence, and ValueSequence.
  • Dive into Deep Learning: Attention Scoring Functions supports the scaled dot-product, masked-softmax, and value-mixing path used by ScaledDotProductScores, MaskedAttentionScores, WeightedValueMixing, and MaskedMultiHeadTransformerBlock.
  • Dive into Deep Learning: Multi-Head Attention supports the roadmap distinction between separate attention heads, concatenated head outputs, the output projection, and the MultiHeadTransformerBlock shape.
  • PyTorch MultiheadAttention is a framework documentation reference for query, key, and value as separate forward inputs, separate source and target sequence shapes, total embedding dimension split across attention heads, and the convention that boolean attention and key-padding masks mark blocked or ignored positions. Use it as an API-shape sanity check for the book’s typed role split, multi-head shape arithmetic, and mask-polarity warnings.
  • PyTorch Transformer is an official framework reference for encoder/decoder mask arguments where boolean masks mark positions that are not allowed to participate in attention. Use it to keep the roadmap honest that mask polarity is API-specific.
  • TensorFlow Keras MultiHeadAttention is a second official framework reference for the same target/query versus source/key-value distinction: query length T, value/key length S, attention masks over (B, T, S), and an allow-mask convention where 1 means attention is allowed. Use it to keep the roadmap’s product-input boundary and mask-polarity rule from looking like a PyTorch-only convention.
  • PyTorch scaled_dot_product_attention is a framework documentation reference for the implementation order: score, apply mask or bias, row-wise softmax, dropout if used, then value mixing. It is also a useful polarity warning: its boolean attn_mask uses True for participation, while some higher-level PyTorch masks use True for blocking or padding. Use it as an implementation sanity check, not as the book’s primary API target.
  • PyTorch Transformer building blocks tutorial is official tutorial material on composing low-level Transformer pieces such as nested tensors, scaled_dot_product_attention, torch.compile, and FlexAttention. Use it when the roadmap needs production context for variable sequence lengths, padding, masks, fully masked rows, and the distinction between pedagogical boundaries and optimized framework blocks.
  • PyTorch TransformerEncoderLayer is an official framework reference for the original Transformer encoder layer shape and the norm_first switch. Use it to keep the roadmap’s teaching boundary honest: the book can model foundational components while still being explicit that production libraries expose broader and faster variants.
  • PyTorch Developer Mailing List: Understanding Multi-Head Attention for ML Framework Developers is a developer-facing implementation bridge for Q/K/V source ownership, q_len versus kv_len, target/source sequence naming, masks, and the data-flow shape behind PyTorch attention APIs.
  • Dive into Deep Learning: Self-Attention and Positional Encoding supports the need for position information before sequence attention and the PositionalEncoding boundary.
  • Dive into Deep Learning: Transformer Architecture supports the residual-connection, layer-normalization, position-wise feed-forward, block, decoder masking, readout, and training-loop shape requirements used by ResidualConnection, LayerNormalization, PositionWiseFeedForward, SingleHeadTransformerBlock, MultiHeadTransformerBlock, MaskedMultiHeadTransformerBlock, TransformerReadout, and TransformerTrainingState.
  • Dive into Deep Learning: Parameter Management supports the idea that model parameters should be managed as explicit named components rather than scattered unnamed arrays. Use it with TinyTransformerParameters and TransformerTrainingState.
  • Dive into Deep Learning: Softmax Regression Implementation from Scratch supports the readout-only gradient step used by TransformerReadoutTrainStep.
  • Dive into Deep Learning: Backpropagation and Computational Graphs supports the forward-cache and reverse-computation order used by TransformerBlockTrainStep.
  • Dive into Deep Learning: Gradient Descent supports the learning-rate update shape used by TransformerReadoutTrainStep, TransformerFeedForwardTrainStep, and TransformerBlockTrainStep.
  • CS231n: Neural Networks Part 3 supports the roadmap’s gradient-evidence ledger: centered finite differences, relative-error reasoning, and the warning that gradient checks are local implementation checks.
  • PyTorch gradcheck is official framework documentation for comparing finite differences with analytical gradients under tolerance, precision, differentiability, and memory-layout caveats. Use it to keep the roadmap’s finite-difference tests scoped as local evidence.
  • Hugging Face Course: How do Transformers work? is a practitioner-facing course reference for architecture families, attention layers, masks, and the distinction between architecture, checkpoint, and model. Use it when the roadmap needs to explain why this repository builds tiny architecture pieces rather than loading pretrained checkpoints.
  • Hugging Face Transformers: Model outputs is official framework documentation for returned hidden states, attentions, and output structures. Use it as an API-shape sanity check for the roadmap’s HiddenSequence, AttentionWeights, and SequenceLogits boundaries.
  • PyTorch Design Philosophy is an official engineering note about PyTorch’s design trade-offs. Use it only as production-context background when the roadmap contrasts inspectable tiny Rust examples with full framework ergonomics.
  • PyTorch Numerical Accuracy is an official engineering note about numerical behavior, precision, and reproducibility limits. Use it as a boundary reminder when the book moves from tiny deterministic examples toward production-scale floating-point systems.
  • Hugging Face Transformers: Performance and Scalability is official engineering documentation for training and inference constraints in large Transformer systems. Use it as deployment-context background, not as a prerequisite for the tiny first-principles path.
  • Layer Normalization by Ba, Kiros, and Hinton supports the layer-normalization boundary and the per-example mean-and-variance normalization used by the roadmap code.
  • On Layer Normalization in the Transformer Architecture supports the roadmap warning that Post-LN and Pre-LN Transformer variants can share a public HiddenSequence -> HiddenSequence shape while differing in internal order and training behavior.
  • On the Anatomy of Attention is an advanced research reference for using category-theoretic diagrams to decompose attention mechanisms, compare variants, and identify recurring attention components. Use it as support for the roadmap’s component-by-component boundary map, not as a claim that the tiny Rust code implements the paper’s full formalism.
  • Self-Attention as a Parametric Endofunctor is an advanced research reference for categorical structure in the linear query, key, and value portions of self-attention. Use it as precision support when discussing linear attention structure, iterated layers, positional encodings, and the limit of the book’s claims around softmax and layer normalization.
  • The Annotated Transformer is useful when the roadmap needs an implementation-oriented bridge from paper notation to code.
  • The Illustrated Transformer is useful when the roadmap needs visual explanation of attention, encoder/decoder structure, and token-to-vector flow.

Learning Design

Use this section when checking why the chapters use worked examples, retrieval prompts, contrastive mistakes, and transfer exercises. The goal is not to cite learning science on every page. The goal is to make each chapter easier to enter, practice, remember, and transfer.