References
The problem this chapter solves is:
The course uses small Rust examples. These references point to the larger Rust, ML, category-theory, Transformer, and learning-science treatments behind those examples.
Use references as a source map, not as decoration. When you read a chapter, these links show where the larger Rust, ML, category-theory, Transformer, and learning-design ideas come from. A useful reference should help answer at least one of the book’s three recurring questions:
Rust syntax:
which source file in this course uses the idea?
ML concept:
which model, training, or learning behavior does the source explain?
Category theory concept:
which object, morphism, composition, product, endomorphism, functor, monoid, or law does it deepen?
How To Read Source Roles
Not every reference has the same job. Use this order when deciding what a source can support:
| Source role | Examples in this chapter | Use it for | Do not use it for |
|---|---|---|---|
| repository code and tests | src/domain.rs, src/ml.rs, src/attention.rs, examples/ | the executable claim this book actually makes | replacing the larger math or framework source |
| official documentation | Rust Book, PyTorch docs, TensorFlow/Keras docs, Hugging Face docs | language behavior, API shape, framework boundary checks | proving a category-theory law by itself |
| academic papers | Seven Sketches, Attention Is All You Need, Layer Normalization, Backprop as Functor | original claims, formal scope, research vocabulary | claiming the tiny Rust code implements the whole paper |
| open textbooks and university material | Dive into Deep Learning, CS231n, MIT Applied Category Theory | pedagogy, intuition, course sequence, worked explanations | overriding repository code or official API docs |
| implementation bridges | The Annotated Transformer, The Illustrated Transformer, developer discussions | connecting notation to code and finding likely reader confusions | serving as final authority for definitions or laws |
| learner-friction signals | review reports, public questions, workshop notes | deciding what to explain more slowly | proving technical correctness |
When two sources disagree in vocabulary, prefer the source that owns the boundary. Rust documentation owns Rust syntax. Framework documentation owns framework API shape. Academic papers own the formal claim they introduce. This book’s code owns only the smaller executable teaching claim that the chapter states and tests.
Chapter Reference Map
Rust
- Category Theory for Tiny ML in Rust GitHub repository is the public source for this book, including Rust modules, examples, exercises, and issue templates.
- Category Theory for Tiny ML in Rust public workshop is the first public workshop for discussing the draft and the tiny ML pipeline.
- The Rust Programming Language: Packages, Crates, and Modules explains how Rust packages are organized into library and binary crates. Use it with
src/lib.rs,src/bin/category_ml.rs, and theexamples/files. - The Rust Programming Language: Defining and Instantiating Structs supports the domain-object chapter’s use of named Rust structs.
- The Rust Programming Language: Defining an Enum supports enum-based modeling in
src/sketches.rsand future Transformer state modeling. - The Rust Programming Language: Generic Data Types supports the generic shapes in
Product<A, B>,Compose<F, G, Middle>, and the functor examples. - The Rust Programming Language: Defining Shared Behavior with Traits explains the trait contract behind
Morphism<Input, Output>,Functor<A, B>, andMonoid. - The Rust Programming Language: Recoverable Errors with
Resultexplains the error pattern behindCtResult<T>and constructors such asDistribution::new. - The Rust Programming Language: Writing Automated Tests supports the exercise design where tests act as executable feedback.
- The Rust Programming Language: Closures supports the fixed-context roadmap analogy: a callable value can capture a mask from its environment before the remaining call receives
HiddenSequence. - Rust By Example is useful when a chapter needs a smaller runnable Rust example before the real crate code.
- Rust By Example: New Type Idiom supports the idea that a wrapper type can make the compiler reject values with the wrong semantic role, even when the underlying representation is the same.
- Rust By Example: Tests gives a compact view of unit and integration test organization for readers turning exercises into checks.
- Rustlings Usage supports compiler-feedback practice where learners fix small Rust exercises from the command line.
- Rustlings Community Exercises supports the challenge-track idea that a public project can define focused exercises around one domain-specific topic.
- The rustdoc book: How to write documentation explains the documentation comments used above public types and methods.
- Rust API Guidelines Checklist is a practical review checklist for naming, documentation, type conversions, and error design.
Category Theory
- Seven Sketches in Compositionality: An Invitation to Applied Category Theory is the larger applied-category-theory text behind the companion chapter. Use it with
src/sketches.rs. - Seven Sketches in Compositionality PDF is the direct paper file for offline reading and page-by-page study.
- MIT OpenCourseWare: Applied Category Theory is a university course built around applied category theory and the Seven Sketches text. Use it when a chapter needs more examples before a formal definition.
- Categories for the Working Mathematician is the classic formal reference for category, functor, natural transformation, duality, adjunctions, limits, monoids, and related structures. Use it as precision support, not as prerequisite reading.
- Stanford Encyclopedia of Philosophy: Category Theory gives a concise academic account of objects, morphisms, identities, composition, associativity, and examples. Use it to keep the local
Morphism<Input, Output>andCompose<F, G, Middle>language aligned with the formal category definition. - Category Theory for Programming is a programming-oriented academic reference for connecting category-theory ideas to datatype and functional-programming structure.
- Category Theory for Programmers PDF source repository is a programmer-friendly bridge for readers who want a longer informal route from programming to category theory.
Machine Learning
- Dive into Deep Learning: Softmax Regression explains multiclass classification, logits, softmax, and cross entropy. Use it with
src/ml.rs. - Dive into Deep Learning: Softmax Regression Implementation from Scratch shows the implementation path behind this course’s smaller Rust version.
- Accurate Computation of the Log-Sum-Exp and Softmax Functions by Blanchard, Higham, and Higham supports the shifted softmax implementation that subtracts the maximum logit before exponentiation to improve floating-point behavior.
- On Calibration of Modern Neural Networks by Guo, Pleiss, Sun, and Weinberger supports the distinction between softmax probabilities and calibrated confidence. Use it as a modesty boundary:
Distributionmeans normalized model probabilities in this tiny example, not a guarantee that confidence matches empirical correctness. - Dive into Deep Learning: Gradient Descent gives the optimization background for
TrainStep. - Dive into Deep Learning: Forward Propagation, Backward Propagation, and Computational Graphs supports the chain-rule and training chapters.
- Automatic differentiation in machine learning: a survey by Baydin, Pearlmutter, Radul, and Siskind separates automatic differentiation, backpropagation, symbolic differentiation, and numerical finite differences. Use it when the book needs to distinguish a tiny hand-written gradient path from a general AD system.
- PyTorch torch.optim is official framework documentation for optimizer objects, gradient clearing, backward passes, and optimizer steps. Use it to contrast production training loops with the book’s tiny
TrainStep(dataset, learning_rate) : Parameters -> Parametersboundary. - Adam: A Method for Stochastic Optimization by Kingma and Ba introduces Adam as an adaptive stochastic optimizer based on first-moment and second-moment estimates. Use it for the Paper-To-Rust challenge claim that optimizer state must move with parameters.
- PyTorch Adam is official framework documentation for Adam’s public optimizer API, moment estimates, bias correction,
state_dict, andstep()boundary. Use it as a production API sanity check for the smallerAdamModelState -> AdamModelStatechallenge. - Dive into Deep Learning: Numerical Stability and Initialization is useful when explaining broader gradient-scale and initialization stability issues beyond the tiny first softmax example.
- PyTorch Autograd mechanics is official framework documentation for dynamic graph recording, saved tensors, and backward traversal with the chain rule. Use it to contrast production automatic differentiation with the book’s tiny
MulOp::backwardboundary. - Stanford CS231n: Optimization explains finite differences, numerical gradients, analytic gradients, and gradient checks. Use it with the finite-difference exercise and the
TransformerBlockTrainSteptests. - Stanford CS231n: Neural Networks Part 3 explains gradient-checking cautions, learning-rate checks, and small-data sanity checks. Use it when exercises ask readers to interpret a failed training or gradient-check signal.
- PyTorch gradcheck is official framework documentation for checking small finite differences against analytical gradients with tolerance, precision, and differentiability caveats. Use it to keep the finite-difference exercise honest about what a local gradient check can and cannot prove.
- PyTorch CrossEntropyLoss is official framework documentation for the common production interface where the input is unnormalized logits and the target is a class index or class probability. Use it as an API-shape sanity check for the book’s smaller
Logits -> Distribution -> Product<Distribution, TokenId> -> Losspath, not as the implementation target. - Stanford CS231n: Linear Classification explains linear classifiers, scores, losses, and the softmax classifier from a widely used university course.
- Deep Learning by Goodfellow, Bengio, and Courville is a standard textbook reference for the broader ML vocabulary behind the tiny examples.
- The Matrix Calculus You Need For Deep Learning gives a compact bridge from scalar calculus to the matrix shapes behind neural-network training. Use it as an advanced support reference for the chain-rule and gradient-check sections, not as a prerequisite.
Category Theory And Learning Systems
- Backprop as Functor: A compositional perspective on supervised learning connects supervised learning, parameter updates, gradient descent, and compositional structure. Use it carefully: the book’s
TrainStepis a tiny executable analogy, not a full implementation of the paper. - Compositional Deep Learning is a research reference for neural-network composition and categorical schemas. Use it as advanced context, not as prerequisite reading.
- Category Theory in Machine Learning surveys category-theory applications across gradient-based learning, probability, and equivariant learning. Use it to decide whether a new chapter claim belongs to a recognized research theme or should stay a local teaching analogy.
- Learners’ Languages develops the learner/update perspective around backpropagation, simple lenses, polynomial functors, and dynamical systems. Use it as advanced support for keeping
TransformerTrainingState -> TransformerTrainingStatemodestly framed as a state-update teaching shape. - Generalized Gradient Descent is a Hypergraph Functor treats generalized gradient descent as a functor from compositional optimization problems to open dynamical systems. Use it as advanced context for composite objectives and distributed updates, not as a prerequisite for the tiny training loop.
- Learning Functors using Gradient Descent studies category-shaped learning problems where functorial structure and composition invariants are learned with gradient descent. Use it as an advanced bridge from Seven Sketches-style schemas to learning systems.
- Categorical Deep Learning is an ICML 2024 position paper about using category theory to connect architecture constraints with implementations. Use it as advanced context for roadmap warnings that a typed implementation boundary and a mathematical architecture constraint are related but not identical.
Transformers
- Attention Is All You Need on arXiv is the original Transformer paper.
- Attention Is All You Need on the NeurIPS proceedings site is the archival conference listing.
- Dive into Deep Learning: Attention Mechanisms and Transformers is a practical bridge from softmax and vector operations to attention and Transformer blocks. Use it with
src/attention.rsfor the query-key scoring, mask, score-to-weight, value-mixing, head-concatenation, output-projection, residual, normalization, and feed-forward boundaries. - Dive into Deep Learning: Queries, Keys, and Values supports the role distinction between queries, keys, and values before the code names
QuerySequence,KeySequence, andValueSequence. - Dive into Deep Learning: Attention Scoring Functions supports the scaled dot-product, masked-softmax, and value-mixing path used by
ScaledDotProductScores,MaskedAttentionScores,WeightedValueMixing, andMaskedMultiHeadTransformerBlock. - Dive into Deep Learning: Multi-Head Attention supports the roadmap distinction between separate attention heads, concatenated head outputs, the output projection, and the
MultiHeadTransformerBlockshape. - PyTorch
MultiheadAttentionis a framework documentation reference for query, key, and value as separate forward inputs, separate source and target sequence shapes, total embedding dimension split across attention heads, and the convention that boolean attention and key-padding masks mark blocked or ignored positions. Use it as an API-shape sanity check for the book’s typed role split, multi-head shape arithmetic, and mask-polarity warnings. - PyTorch
Transformeris an official framework reference for encoder/decoder mask arguments where boolean masks mark positions that are not allowed to participate in attention. Use it to keep the roadmap honest that mask polarity is API-specific. - TensorFlow Keras
MultiHeadAttentionis a second official framework reference for the same target/query versus source/key-value distinction: query lengthT, value/key lengthS, attention masks over(B, T, S), and an allow-mask convention where1means attention is allowed. Use it to keep the roadmap’s product-input boundary and mask-polarity rule from looking like a PyTorch-only convention. - PyTorch
scaled_dot_product_attentionis a framework documentation reference for the implementation order: score, apply mask or bias, row-wise softmax, dropout if used, then value mixing. It is also a useful polarity warning: its booleanattn_maskusesTruefor participation, while some higher-level PyTorch masks useTruefor blocking or padding. Use it as an implementation sanity check, not as the book’s primary API target. - PyTorch Transformer building blocks tutorial is official tutorial material on composing low-level Transformer pieces such as nested tensors,
scaled_dot_product_attention,torch.compile, andFlexAttention. Use it when the roadmap needs production context for variable sequence lengths, padding, masks, fully masked rows, and the distinction between pedagogical boundaries and optimized framework blocks. - PyTorch
TransformerEncoderLayeris an official framework reference for the original Transformer encoder layer shape and thenorm_firstswitch. Use it to keep the roadmap’s teaching boundary honest: the book can model foundational components while still being explicit that production libraries expose broader and faster variants. - PyTorch Developer Mailing List: Understanding Multi-Head Attention for ML Framework Developers is a developer-facing implementation bridge for Q/K/V source ownership,
q_lenversuskv_len, target/source sequence naming, masks, and the data-flow shape behind PyTorch attention APIs. - Dive into Deep Learning: Self-Attention and Positional Encoding supports the need for position information before sequence attention and the
PositionalEncodingboundary. - Dive into Deep Learning: Transformer Architecture supports the residual-connection, layer-normalization, position-wise feed-forward, block, decoder masking, readout, and training-loop shape requirements used by
ResidualConnection,LayerNormalization,PositionWiseFeedForward,SingleHeadTransformerBlock,MultiHeadTransformerBlock,MaskedMultiHeadTransformerBlock,TransformerReadout, andTransformerTrainingState. - Dive into Deep Learning: Parameter Management supports the idea that model parameters should be managed as explicit named components rather than scattered unnamed arrays. Use it with
TinyTransformerParametersandTransformerTrainingState. - Dive into Deep Learning: Softmax Regression Implementation from Scratch supports the readout-only gradient step used by
TransformerReadoutTrainStep. - Dive into Deep Learning: Backpropagation and Computational Graphs supports the forward-cache and reverse-computation order used by
TransformerBlockTrainStep. - Dive into Deep Learning: Gradient Descent supports the learning-rate update shape used by
TransformerReadoutTrainStep,TransformerFeedForwardTrainStep, andTransformerBlockTrainStep. - CS231n: Neural Networks Part 3 supports the roadmap’s gradient-evidence ledger: centered finite differences, relative-error reasoning, and the warning that gradient checks are local implementation checks.
- PyTorch
gradcheckis official framework documentation for comparing finite differences with analytical gradients under tolerance, precision, differentiability, and memory-layout caveats. Use it to keep the roadmap’s finite-difference tests scoped as local evidence. - Hugging Face Course: How do Transformers work? is a practitioner-facing course reference for architecture families, attention layers, masks, and the distinction between architecture, checkpoint, and model. Use it when the roadmap needs to explain why this repository builds tiny architecture pieces rather than loading pretrained checkpoints.
- Hugging Face Transformers: Model outputs is official framework documentation for returned hidden states, attentions, and output structures. Use it as an API-shape sanity check for the roadmap’s
HiddenSequence,AttentionWeights, andSequenceLogitsboundaries. - PyTorch Design Philosophy is an official engineering note about PyTorch’s design trade-offs. Use it only as production-context background when the roadmap contrasts inspectable tiny Rust examples with full framework ergonomics.
- PyTorch Numerical Accuracy is an official engineering note about numerical behavior, precision, and reproducibility limits. Use it as a boundary reminder when the book moves from tiny deterministic examples toward production-scale floating-point systems.
- Hugging Face Transformers: Performance and Scalability is official engineering documentation for training and inference constraints in large Transformer systems. Use it as deployment-context background, not as a prerequisite for the tiny first-principles path.
- Layer Normalization by Ba, Kiros, and Hinton supports the layer-normalization boundary and the per-example mean-and-variance normalization used by the roadmap code.
- On Layer Normalization in the Transformer Architecture supports the roadmap warning that Post-LN and Pre-LN Transformer variants can share a public
HiddenSequence -> HiddenSequenceshape while differing in internal order and training behavior. - On the Anatomy of Attention is an advanced research reference for using category-theoretic diagrams to decompose attention mechanisms, compare variants, and identify recurring attention components. Use it as support for the roadmap’s component-by-component boundary map, not as a claim that the tiny Rust code implements the paper’s full formalism.
- Self-Attention as a Parametric Endofunctor is an advanced research reference for categorical structure in the linear query, key, and value portions of self-attention. Use it as precision support when discussing linear attention structure, iterated layers, positional encodings, and the limit of the book’s claims around softmax and layer normalization.
- The Annotated Transformer is useful when the roadmap needs an implementation-oriented bridge from paper notation to code.
- The Illustrated Transformer is useful when the roadmap needs visual explanation of attention, encoder/decoder structure, and token-to-vector flow.
Learning Design
- How People Learn II: Learners, Contexts, and Cultures supports the book’s learning design: prior knowledge activation, worked examples, practice, retrieval, and attention to learner context.
- Improving Students’ Learning With Effective Learning Techniques by Dunlosky, Rawson, Marsh, Nathan, and Willingham is useful when deciding whether a chapter asks readers to practice durable techniques instead of only rereading.
- Test-Enhanced Learning: Taking Memory Tests Improves Long-Term Retention by Roediger and Karpicke supports retrieval-practice prompts that ask readers to recall, explain, and apply without looking back first.
- Structuring the Transition From Example Study to Problem Solving in Cognitive Skill Acquisition by Renkl and Atkinson supports the book’s progression from worked examples to partially completed examples and then transfer exercises.
- Self-Explanations: How Students Study and Use Examples in Learning to Solve Problems by Chi, Bassok, Lewis, Reimann, and Glaser supports self-check prompts that ask readers to explain why a worked example has the shape it has.
- Counteracting detrimental effects of misconceptions on learning and metacomprehension accuracy supports short contrast prompts that place a plausible misconception next to the corrected boundary before asking for transfer.
Use this section when checking why the chapters use worked examples, retrieval prompts, contrastive mistakes, and transfer exercises. The goal is not to cite learning science on every page. The goal is to make each chapter easier to enter, practice, remember, and transfer.