Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Tiny ML Pipeline

The problem this chapter solves is:

The abstract Morphism trait needs concrete machine-learning arrows that turn token data into predictions and loss.

The whole prediction-and-loss path is:

TokenSequence -> TrainingSet
TokenId       -> Vector
Vector        -> Logits
Logits        -> Distribution
Distribution x TokenId -> Loss

In ordinary ML language, the path turns a token stream into adjacent training pairs, looks up an embedding vector for the current token, uses a linear layer to score every possible next token, normalizes those scores with softmax, and then measures surprise with cross entropy.

In category-theory language:

Each stage is a morphism, and the legal stages compose.

Reader orientation: This is the first chapter where all three subjects meet at once. When the code feels dense, follow the pipeline order: data preparation first, prediction second, loss third.

First Mental Model

The public shorthand for the whole project is:

Text -> Tokens -> TrainingPairs -> ModelState -> Prediction -> Loss -> Updated ModelState

This chapter zooms into the middle of that path. It explains how token pairs, current parameters, predictions, and loss become separate typed boundaries.

flowchart LR
    A["Text"] --> B["Tokens / TokenSequence"]
    B --> C["TrainingPairs / TrainingSet"]
    C --> F["Loss"]
    M["ModelState / Parameters"] --> P["Prediction / Distribution"]
    P --> F
    M --> U["Updated ModelState / Updated Parameters"]
    F --> U

Read the diagram as orientation, then use the Rust types for precision. The loss boundary needs both current model state and training data:

Parameters x TrainingSet -> Loss

The update boundary returns a complete next model state:

Parameters -> Updated Parameters

The same orientation as a compact rendered math view:

[ \begin{array}{ccccccc} \mathrm{Text} & \to & \mathrm{TokenSequence} & \to & \mathrm{TrainingSet} & \to & \mathrm{Loss} \ &&&& \uparrow && \downarrow \ &&&& \mathrm{Parameters} & \to & \mathrm{UpdatedParameters} \end{array} ]

How to read this diagram:

  • the top row is the data path from text into measured loss,
  • Parameters enters the prediction and loss boundary as model state,
  • the returned object is a complete updated parameter object,
  • the diagram is a map of responsibilities; the later sections name the exact Rust functions that own each arrow.

Chapter Outcomes

By the end of this chapter, you should be able to:

  • trace TokenId -> Vector -> Logits -> Distribution through the concrete Rust morphisms,
  • explain why cross entropy consumes both a prediction and the target token,
  • distinguish the production shortcut CrossEntropyLoss(logits, target) from this book’s explicit Logits -> Distribution -> Product<Distribution, TokenId> -> Loss teaching path.

What You Already Know

If you know ML, you already know the rough path: prepare data, make a prediction, and measure the error. If you know Rust, you already know that each step can have a concrete input and output type. This chapter combines those two habits by making each ML step implement the same morphism interface.

Prediction Trace Before Source

Before reading src/ml.rs, keep this trace in view. It separates raw scores, probabilities, the target token, and the final loss.

StageRust typePlain meaningWhat to check
Input tokenTokenIdthe current token positionIs this a vocabulary index, not a dimension?
EmbeddingVectordense hidden features for that tokenHas the token become numeric features?
ScoresLogitsunnormalized next-token scoresAre these still raw scores, not probabilities?
NormalizationDistributionprobabilities over next tokensDo probabilities sum to one?
Target pairingProduct<Distribution, TokenId>prediction plus correct next tokenWhich index is the target token?
LossLosssurprise assigned to the target tokenDid cross entropy use the target probability?

The target index is the key detail. Cross entropy does not punish every probability equally. It first selects the probability assigned to the correct next token, then computes:

loss = -ln(probability assigned to target)

So this chapter’s core mental model is:

Logits
  -> Distribution
  -> target probability
  -> Loss

That order matters. If a reader treats logits as probabilities, or forgets that loss uses the target token index, the pipeline becomes hard to debug.

Source Snapshot

This file owns the concrete ML arrows.

Source snapshot: src/ml.rs
use crate::category::{Compose, Morphism};
use crate::domain::{
    Distribution, Logits, Loss, Parameters, Product, TokenId, TokenSequence, TrainingSet, Vector,
    approx_eq,
};
use crate::error::{CtError, CtResult};

/// Turns adjacent tokens into next-token training examples.
#[derive(Debug, Clone)]
pub struct DatasetWindowing;

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing {
    fn name(&self) -> &'static str {
        "dataset_windowing"
    }

    fn apply(&self, tokens: TokenSequence) -> CtResult<TrainingSet> {
        if tokens.as_slice().len() < 2 {
            return Err(CtError::EmptyInput(
                "dataset windowing requires at least 2 tokens",
            ));
        }

        TrainingSet::new(
            tokens
                .as_slice()
                .windows(2)
                .map(|pair| Product::new(pair[0], pair[1])),
        )
    }
}

/// Morphism from token id to embedding vector.
#[derive(Debug, Clone)]
pub struct Embedding {
    table: Vec<Vec<f32>>,
}

impl Embedding {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            table: params.embedding_table().to_vec(),
        }
    }
}

impl Morphism<TokenId, Vector> for Embedding {
    fn name(&self) -> &'static str {
        "embedding"
    }

    fn apply(&self, token: TokenId) -> CtResult<Vector> {
        let Some(row) = self.table.get(token.index()) else {
            return Err(CtError::OutOfRange {
                kind: "token",
                index: token.index(),
                limit: self.table.len(),
            });
        };

        Ok(Vector::new(row.clone()))
    }
}

/// Linear projection from hidden vector to vocabulary logits.
#[derive(Debug, Clone)]
pub struct LinearToLogits {
    weight: Vec<Vec<f32>>,
    bias: Vec<f32>,
}

impl LinearToLogits {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            weight: params.lm_head().to_vec(),
            bias: params.bias().to_vec(),
        }
    }

    pub(crate) fn from_parts(weight: Vec<Vec<f32>>, bias: Vec<f32>) -> Self {
        Self { weight, bias }
    }
}

impl Morphism<Vector, Logits> for LinearToLogits {
    fn name(&self) -> &'static str {
        "linear_to_logits"
    }

    fn apply(&self, input: Vector) -> CtResult<Logits> {
        let d_model = input.as_slice().len();
        let vocab_size = self.bias.len();

        if self.weight.len() != d_model {
            return Err(CtError::ShapeMismatch {
                op: "linear layer",
                expected: format!("weight rows == input dim {d_model}"),
                got: format!("weight rows {}", self.weight.len()),
            });
        }

        let mut out = self.bias.clone();

        for (feature, input_value) in input.as_slice().iter().enumerate() {
            if self.weight[feature].len() != vocab_size {
                return Err(CtError::ShapeMismatch {
                    op: "linear layer",
                    expected: format!("weight cols == vocab size {vocab_size}"),
                    got: format!("weight cols {}", self.weight[feature].len()),
                });
            }

            for (vocab_id, output_value) in out.iter_mut().enumerate() {
                *output_value += input_value * self.weight[feature][vocab_id];
            }
        }

        Ok(Logits::new(out))
    }
}

/// Converts logits to a probability distribution.
#[derive(Debug, Clone)]
pub struct Softmax;

impl Morphism<Logits, Distribution> for Softmax {
    fn name(&self) -> &'static str {
        "softmax"
    }

    fn apply(&self, logits: Logits) -> CtResult<Distribution> {
        if logits.as_slice().is_empty() {
            return Err(CtError::EmptyInput("softmax"));
        }

        let max_value = logits
            .as_slice()
            .iter()
            .copied()
            .fold(f32::NEG_INFINITY, f32::max);
        let mut exps = Vec::with_capacity(logits.as_slice().len());
        let mut sum = 0.0;

        for value in logits.as_slice() {
            let exp = (*value - max_value).exp();
            exps.push(exp);
            sum += exp;
        }

        if sum <= 0.0 || !sum.is_finite() {
            return Err(CtError::InvalidProbability("softmax"));
        }

        Distribution::new(exps.into_iter().map(|value| value / sum).collect())
    }
}

/// Negative log likelihood for `(distribution, target_token)`.
#[derive(Debug, Clone)]
pub struct CrossEntropy;

impl Morphism<Product<Distribution, TokenId>, Loss> for CrossEntropy {
    fn name(&self) -> &'static str {
        "cross_entropy"
    }

    fn apply(&self, input: Product<Distribution, TokenId>) -> CtResult<Loss> {
        let (distribution, target) = input.into_parts();

        let Some(probability) = distribution.as_slice().get(target.index()).copied() else {
            return Err(CtError::OutOfRange {
                kind: "target",
                index: target.index(),
                limit: distribution.as_slice().len(),
            });
        };

        Loss::new(-probability.max(1e-9).ln())
    }
}

/// Direct path used for a commutative-diagram check.
#[derive(Debug, Clone)]
pub struct DirectPredict {
    params: Parameters,
}

impl DirectPredict {
    pub fn new(params: Parameters) -> Self {
        Self { params }
    }
}

impl Morphism<TokenId, Distribution> for DirectPredict {
    fn name(&self) -> &'static str {
        "direct_predict"
    }

    fn apply(&self, token: TokenId) -> CtResult<Distribution> {
        let embedding = Embedding::from_parameters(&self.params).apply(token)?;
        let logits = LinearToLogits::from_parameters(&self.params).apply(embedding)?;
        Softmax.apply(logits)
    }
}

/// Average cross-entropy over a training set.
pub fn average_loss(params: &Parameters, dataset: &TrainingSet) -> CtResult<Loss> {
    let embedding = Embedding::from_parameters(params);
    let linear = LinearToLogits::from_parameters(params);
    let predict = Compose::<_, _, Vector>::new(embedding, linear);
    let predict = Compose::<_, _, Logits>::new(predict, Softmax);
    let loss_fn = CrossEntropy;

    let mut total = 0.0;

    for example in dataset.examples() {
        let distribution = predict.apply(*example.first())?;
        let loss = loss_fn.apply(Product::new(distribution, *example.second()))?;
        total += loss.value();
    }

    Loss::new(total / dataset.len() as f32)
}

/// Verifies that the composed path and direct path produce the same result.
pub fn composed_prediction_matches_direct_prediction(params: &Parameters) -> CtResult<bool> {
    let token = TokenId::new(1);

    let composed = Compose::<_, _, Vector>::new(
        Embedding::from_parameters(params),
        LinearToLogits::from_parameters(params),
    );
    let composed = Compose::<_, _, Logits>::new(composed, Softmax);
    let direct = DirectPredict::new(params.clone());

    let left_path = composed.apply(token)?;
    let right_path = direct.apply(token)?;

    Ok(left_path
        .as_slice()
        .iter()
        .zip(right_path.as_slice().iter())
        .all(|(a, b)| approx_eq(*a, *b, 1e-6)))
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::domain::{ModelDimension, VocabSize};

    #[test]
    fn dataset_windowing_builds_adjacent_pairs() -> CtResult<()> {
        let tokens = TokenSequence::from_indices([1, 2, 3])?;
        let dataset = DatasetWindowing.apply(tokens)?;

        assert_eq!(dataset.len(), 2);
        assert_eq!(dataset.examples()[0].first().index(), 1);
        assert_eq!(dataset.examples()[0].second().index(), 2);
        Ok(())
    }

    #[test]
    fn composed_and_direct_prediction_match() -> CtResult<()> {
        let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

        assert!(composed_prediction_matches_direct_prediction(&params)?);
        Ok(())
    }

    #[test]
    fn softmax_normalizes_logits_into_distribution() -> CtResult<()> {
        let distribution = Softmax.apply(Logits::new(vec![1.0, 2.0, 3.0]))?;
        let probabilities = distribution.as_slice();
        let sum: f32 = probabilities.iter().sum();

        assert!(approx_eq(sum, 1.0, 1e-6));
        assert!(probabilities[2] > probabilities[1]);
        assert!(probabilities[1] > probabilities[0]);
        Ok(())
    }

    #[test]
    fn cross_entropy_is_lower_for_more_confident_target_probability() -> CtResult<()> {
        let confident = Distribution::new(vec![0.9, 0.1])?;
        let surprised = Distribution::new(vec![0.1, 0.9])?;

        let confident_loss = CrossEntropy.apply(Product::new(confident, TokenId::new(0)))?;
        let surprised_loss = CrossEntropy.apply(Product::new(surprised, TokenId::new(0)))?;

        assert!(confident_loss.value() < surprised_loss.value());
        Ok(())
    }
}

The Whole File

src/ml.rs defines:

DatasetWindowing
Embedding
LinearToLogits
Softmax
CrossEntropy
DirectPredict
average_loss
composed_prediction_matches_direct_prediction

The chapter reads them in pipeline order.

Read each block through the same three lenses:

Rust syntax:
what struct, trait implementation, loop, or error branch does the code use?

ML concept:
which prediction, loss, or data-preparation step does the block implement?

Category theory concept:
which object, product, morphism, composition, or commutative check appears?

Worked Example: Normalizing Scores

The smallest first-principles version of “normalize scores into probabilities” does not need a model yet:

#![allow(unused)]
fn main() {
let scores = [1.0_f32, 2.0, 3.0];
let total: f32 = scores.iter().sum();
let probabilities: Vec<f32> = scores.iter().map(|score| score / total).collect();

let probability_sum: f32 = probabilities.iter().sum();
assert!((probability_sum - 1.0).abs() < 1e-6);
}

The real Softmax implementation is more careful than this toy normalization: it uses exponentials, subtracts the maximum score for numerical stability, and validates the result through Distribution::new.

Self-Check

Why is it useful for the probability-validation boundary to live in Distribution::new instead of in every caller that uses probabilities?

Scores, Probabilities, And Loss

This chapter becomes easier if you keep three numbers separate.

Logits are raw scores. They can be negative, larger than one, and they do not need to sum to one. A logit says “how strongly the model scores this token before normalization.”

Distribution values are probabilities. They must be finite, non-negative, and sum to one. A distribution says “how much probability the model assigns to each possible next token.”

That is still a local model probability, not a promise that the model’s confidence is calibrated in the outside world. Calibration asks whether events predicted with about 0.90 confidence really happen about ninety percent of the time over a population of predictions. This tiny chapter only builds and validates the normalized probability object.

Loss is a scalar penalty. Cross entropy makes the penalty small when the model assigns high probability to the correct token and large when it assigns low probability to the correct token.

The concrete path is:

raw scores
  -> probabilities
  -> surprise about the target

Here is one small numeric trace:

target token index: 0
probability assigned to target: 0.90
loss = -ln(0.90) = 0.105

target token index: 0
probability assigned to target: 0.10
loss = -ln(0.10) = 2.303

Nothing mysterious happened. The loss only looked at the probability assigned to the correct target token. A confident correct prediction receives a small penalty. A surprised prediction receives a larger penalty.

Worked Example: Do Not Use The Largest Probability

A common mistake is to compute loss from the largest probability in the distribution.

That is wrong.

Cross entropy uses the probability assigned to the actual target token, even when the model assigned a larger probability to some other token.

Consider this prediction:

probabilities over next tokens:
index 0: 0.60
index 1: 0.30
index 2: 0.10

target token index: 1

The largest probability is 0.60, but it belongs to token index 0.

The target probability is 0.30, because the correct next token is index 1.

So the loss is:

loss = -ln(0.30) = 1.204

The incorrect shortcut would be:

loss = -ln(0.60) = 0.511

That shortcut would make the prediction look better than it is. It rewards the model for being confident about the wrong token.

The Rust code prevents that confusion by pairing the distribution with the target:

Product<Distribution, TokenId>

Then CrossEntropy indexes into the distribution with target.index(). The target decides which probability becomes the loss.

The Rust path is:

Logits -> Distribution
Distribution x TokenId -> Loss

Framework Shortcut, Teaching Boundary

PyTorch’s CrossEntropyLoss accepts unnormalized logits and a target class index or target probabilities. That production API is efficient and ergonomic: the framework can combine normalization, target selection, reduction, and gradient behavior behind one call.

This book splits the same idea into smaller objects:

Logits -> Distribution -> Product<Distribution, TokenId> -> Loss

Read that as the book’s smaller Logits -> Distribution -> Product<Distribution, TokenId> -> Loss path.

That split is not a claim that production frameworks are wrong. It is a teaching boundary. It makes two questions visible before the code becomes compact:

which boundary turns scores into probabilities?
which target index selects the probability used by loss?
Production API habitTiny Rust teaching boundary
CrossEntropyLoss(logits, target_index)Logits -> Distribution, then Distribution x TokenId -> Loss
logits and target passed togetherprobability invariant and target selection are separate
optimized fused behavior may hide the intermediate probability objectreader can inspect the Distribution constructor and the target probability

When moving back to frameworks, remember that the compact API still owns both roles: score normalization and target-conditioned loss.

Target-Probability Responsibility Ledger

This chapter’s most important debugging habit is to keep responsibility in the right place. Each boundary owns one job.

Pipeline cueRust handleML responsibilityCategory boundaryUnsafe shortcut rejectedSource-backed limit
raw vocabulary scoresLinearToLogits : Vector -> Logitsproduce one unnormalized score per tokenVector -> Logitstreating logits as probabilitiesthis is a tiny linear projection, not a full classifier stack
normalized probabilitiesSoftmax : Logits -> Distribution and Distribution::newexponentiate, normalize, and validate a probability vectorLogits -> Distributionskipping the probability invariantnormalized probability is not calibrated confidence
target probabilitytarget.index() and distribution.as_slice().get(...)select the probability assigned to the correct next tokenpart of Distribution x TokenId -> Lossusing the largest probabilitythis checks supervised class-index loss, not every target encoding
scalar surpriseLoss::new(-probability.max(1e-9).ln())turn the target probability into a non-negative penaltyDistribution x TokenId -> Losshiding target selection inside a vague loss wordthis is the expanded teaching path, not a fused production kernel

Use this audit card whenever the loss boundary feels slippery:

pipeline cue:
Rust handle:
ML responsibility:
category boundary:
unsafe shortcut rejected:
source-backed limit:
validation command:

Worked audit:

pipeline cue: target probability
Rust handle: distribution.as_slice().get(target.index())
ML responsibility: select the probability assigned to the correct next token
category boundary: CrossEntropy : Distribution x TokenId -> Loss
unsafe shortcut rejected: using the largest probability
source-backed limit: this checks one local supervised classification boundary,
  not calibration and not full framework equivalence
validation command:
  cargo test cross_entropy_is_lower_for_more_confident_target_probability --lib

The phrase “probability assigned to the target” should now point to one line of Rust, one ML responsibility, and one category-shaped boundary.

Source-Backed Precision Rules

This chapter uses external sources to keep the tiny prediction-and-loss path honest. Each source supports a limited claim; these citations are not proof that this crate is a production classifier, a calibrated probability model, or a framework replacement.

SourceWhat the source supportsLocal rule in this chapterRust evidence
D2L Softmax RegressionA classifier needs one output per class; softmax turns raw outputs into non-negative probabilities that sum to one.Logits are raw scores; Softmax is the only boundary that creates a Distribution.LinearToLogits : Vector -> Logits, Softmax : Logits -> Distribution
D2L Softmax From ScratchImplementing softmax explicitly makes normalization and probability sums visible, and cross entropy selects the probability assigned to the true label.The local teaching path exposes Distribution before loss so readers can inspect normalization and target selection separately.Distribution::new, CrossEntropy, target.index()
Accurate Computation of the Log-Sum-Exp and Softmax FunctionsSoftmax and log-sum-exp evaluation can overflow or underflow, and shifted formulas are used to improve floating-point behavior.Subtract the maximum logit before exponentiation, but keep the local claim to numerical stability of this boundary, not full production numerical analysis.let max_value = ..., let exp = (*value - max_value).exp()
On Calibration of Modern Neural NetworksConfidence calibration asks whether predicted probabilities match empirical correctness frequencies.A Distribution is a normalized local model output; it is not a guarantee of calibrated confidence.Distribution::new, softmax_normalizes_logits_into_distribution
PyTorch CrossEntropyLossThe production API accepts unnormalized logits and target class indices, and internally corresponds to log-softmax plus negative log likelihood.The book deliberately expands that compact API into Logits -> Distribution -> Product<Distribution, TokenId> -> Loss.Product<Distribution, TokenId>, CrossEntropy.apply
CS231n Linear ClassificationSoftmax treats class scores as unnormalized log probabilities and cross entropy penalizes the probability assigned to the correct class.Do not compute loss from the largest probability; compute it from the target token’s probability.cross_entropy_is_lower_for_more_confident_target_probability

The transfer pattern is:

source claim -> local typed boundary -> validation command or test

For this chapter, that means reading cargo test ml::tests and the src/ml.rs morphisms as evidence for the tiny Logits -> Distribution -> Product<Distribution, TokenId> -> Loss boundary, not as evidence for every production classification stack.

The tests in src/ml.rs protect those claims: softmax normalizes logits into a distribution, and cross entropy is lower when the target token receives higher probability.

Here is the chapter’s full data-preparation and prediction diagram:

TokenSequence
     |
     | DatasetWindowing
     v
TrainingSet = [
  Product<TokenId, TokenId>,
  Product<TokenId, TokenId>,
  ...
]

For each TrainingExample:

input TokenId -------------------------------+
     |                                       |
     | Embedding                             |
     v                                       |
Vector                                      target TokenId
     |                                       |
     | LinearToLogits                        |
     v                                       |
Logits                                      |
     |                                       |
     | Softmax                               |
     v                                       |
Distribution ---------------- Product -------+
     |
     | CrossEntropy
     v
Loss

The left side is the prediction path. The right side carries the target token. CrossEntropy is the first stage that needs both, so the chapter uses Product<Distribution, TokenId> at that boundary.

The loss boundary as a rendered math view:

[ \begin{array}{ccccc} \mathrm{TokenId} & \xrightarrow{\mathrm{Embedding}} \mathrm{Vector} & \xrightarrow{\mathrm{LinearToLogits}} \mathrm{Logits} & \xrightarrow{\mathrm{Softmax}} \mathrm{Distribution} \ &&&& \downarrow \mathrm{Product(-, target)} \ &&&& \mathrm{Product}\langle \mathrm{Distribution}, \mathrm{TokenId}\rangle \xrightarrow{\mathrm{CrossEntropy}} \mathrm{Loss} \end{array} ]

How to read this diagram:

  • the prediction path produces a Distribution,
  • the target token does not become a prediction; it selects which probability becomes the loss,
  • CrossEntropy is the first arrow that needs the product input,
  • redrawing the diagram should make the target side visible, not hidden inside the word “loss”.

DatasetWindowing

The problem this block solves is:

A token sequence must become input-target pairs before supervised next-token training can happen.

The block:

/// Turns adjacent tokens into next-token training examples.
#[derive(Debug, Clone)]
pub struct DatasetWindowing;

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing {
    fn name(&self) -> &'static str {
        "dataset_windowing"
    }

    fn apply(&self, tokens: TokenSequence) -> CtResult<TrainingSet> {
        if tokens.as_slice().len() < 2 {
            return Err(CtError::EmptyInput(
                "dataset windowing requires at least 2 tokens",
            ));
        }

        TrainingSet::new(
            tokens
                .as_slice()
                .windows(2)
                .map(|pair| Product::new(pair[0], pair[1])),
        )
    }
}

Rust Syntax: Unit Struct

pub struct DatasetWindowing;

This is a unit struct.

It stores no fields because the operation has no configuration.

The value itself represents the transformation.

Rust Syntax: Morphism Shape

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing

This says:

DatasetWindowing : TokenSequence -> TrainingSet

So it consumes the raw sequence stage and produces the training-example stage.

Rust Syntax: Why It Requires At Least Two Tokens

if tokens.as_slice().len() < 2 {
    return Err(CtError::EmptyInput(
        "dataset windowing requires at least 2 tokens",
    ));
}

TokenSequence only guarantees at least one token.

But next-token training requires at least one adjacent pair.

One token:

[7]

produces zero pairs.

Two tokens:

[7, 8]

produce one pair:

7 -> 8

So this morphism owns the stronger validation rule.

Rust Syntax: windows(2)

tokens.as_slice().windows(2)

This walks adjacent pairs:

[1, 2, 3, 4]

becomes:

[1, 2]
[2, 3]
[3, 4]

Each pair becomes:

Product::new(pair[0], pair[1])

That is a TrainingExample.

ML Concept

This is the data-preparation step for next-token prediction.

Category Theory Concept

This is a morphism between two structured objects:

non-empty token list -> non-empty product list

The output examples are product objects:

TokenId x TokenId

Embedding

The problem this block solves is:

A discrete token ID needs to become a dense vector before the model can use linear algebra.

The core block:

#[derive(Debug, Clone)]
pub struct Embedding {
    table: Vec<Vec<f32>>,
}

impl Embedding {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            table: params.embedding_table().to_vec(),
        }
    }
}

impl Morphism<TokenId, Vector> for Embedding {
    fn name(&self) -> &'static str {
        "embedding"
    }

    fn apply(&self, token: TokenId) -> CtResult<Vector> {
        let Some(row) = self.table.get(token.index()) else {
            return Err(CtError::OutOfRange {
                kind: "token",
                index: token.index(),
                limit: self.table.len(),
            });
        };

        Ok(Vector::new(row.clone()))
    }
}

Rust Syntax: Stored Table

table: Vec<Vec<f32>>

The embedding table has shape:

vocab_size x model_dimension

Each row is the vector for one token.

Rust Syntax: Constructor From Parameters

pub fn from_parameters(params: &Parameters) -> Self

The embedding morphism is built from model parameters.

It copies the table out of Parameters:

params.embedding_table().to_vec()

This keeps the morphism simple and owned for the tiny tutorial.

Rust Syntax: Morphism Shape

impl Morphism<TokenId, Vector> for Embedding

This says:

Embedding : TokenId -> Vector

Rust Syntax: Bounds Check

let Some(row) = self.table.get(token.index()) else {
    return Err(CtError::OutOfRange { ... });
};

The code does not assume every TokenId is valid for every embedding table.

It checks the row lookup at the boundary where the table is used.

Rust Syntax: Why Clone The Row

Ok(Vector::new(row.clone()))

The morphism returns an owned Vector.

The row inside the table is borrowed, so the code clones it into the output object.

This is a deliberate ownership boundary.

ML Concept

An embedding converts a symbolic token into numerical features.

Category Theory Concept

It is an arrow:

TokenId -> Vector

LinearToLogits

The problem this block solves is:

A hidden vector must be projected into one raw score per vocabulary item.

The shape is:

Vector -> Logits

The core implementation stores:

pub struct LinearToLogits {
    weight: Vec<Vec<f32>>,
    bias: Vec<f32>,
}

The dimensions are:

weight: d_model x vocab_size
bias: vocab_size
input: d_model
output: vocab_size

Rust Syntax: Shape Validation

Inside apply, the code checks:

if self.weight.len() != d_model {
    return Err(CtError::ShapeMismatch { ... });
}

This catches a matrix whose row count does not match the input vector length.

Then each row checks:

if self.weight[feature].len() != vocab_size {
    return Err(CtError::ShapeMismatch { ... });
}

This catches rows whose column count does not match the output vocabulary size.

Rust Syntax: Linear Computation

The output begins as the bias:

let mut out = self.bias.clone();

Then each input feature contributes to every vocabulary score:

for (feature, input_value) in input.as_slice().iter().enumerate() {
    for (vocab_id, output_value) in out.iter_mut().enumerate() {
        *output_value += input_value * self.weight[feature][vocab_id];
    }
}

Mathematically:

logits = input x weight + bias

ML Concept

This is the language-model head.

It scores each possible next token.

Category Theory Concept

It is a morphism:

Vector -> Logits

It can compose after Embedding because Embedding returns Vector.

Softmax

The problem this block solves is:

Raw scores are not probabilities. They must be normalized into a valid distribution.

The block:

#[derive(Debug, Clone)]
pub struct Softmax;

impl Morphism<Logits, Distribution> for Softmax {
    fn name(&self) -> &'static str {
        "softmax"
    }

    fn apply(&self, logits: Logits) -> CtResult<Distribution> {
        if logits.as_slice().is_empty() {
            return Err(CtError::EmptyInput("softmax"));
        }

        let max_value = logits
            .as_slice()
            .iter()
            .copied()
            .fold(f32::NEG_INFINITY, f32::max);
        let mut exps = Vec::with_capacity(logits.as_slice().len());
        let mut sum = 0.0;

        for value in logits.as_slice() {
            let exp = (*value - max_value).exp();
            exps.push(exp);
            sum += exp;
        }

        if sum <= 0.0 || !sum.is_finite() {
            return Err(CtError::InvalidProbability("softmax"));
        }

        Distribution::new(exps.into_iter().map(|value| value / sum).collect())
    }
}

Rust Syntax: Unit Struct

Softmax stores no state.

It is the operation itself.

Rust Syntax: Morphism Shape

impl Morphism<Logits, Distribution> for Softmax

This says:

Softmax : Logits -> Distribution

Rust Syntax: Empty Check

Softmax over no scores is meaningless.

So the code rejects empty logits.

Rust Syntax: Numerical Stability

let max_value = ...
let exp = (*value - max_value).exp();

Subtracting the maximum value keeps exponentials smaller and more stable.

It does not change the final probabilities because softmax is invariant under adding or subtracting the same constant from every logit.

Rust Syntax: Normalization

Distribution::new(exps.into_iter().map(|value| value / sum).collect())

The raw exponentials are divided by their sum.

Then the Distribution constructor validates the probability invariant.

This is good boundary design: softmax computes, and Distribution::new enforces the distribution contract.

ML Concept

Softmax turns raw model scores into probabilities. In softmax regression and classification models, this is the step that makes one score per class interpretable as a probability distribution.

High logits become high probabilities.

Low logits become low probabilities.

The output can be interpreted as:

P(next token | current token)

Category Theory Concept

Softmax is a morphism-like transformation:

Logits -> Distribution

It changes the object from an unconstrained score vector into a probability simplex-like object.

CrossEntropy

The problem this block solves is:

A model prediction must be compared to the actual target token to produce a scalar loss.

The block:

#[derive(Debug, Clone)]
pub struct CrossEntropy;

impl Morphism<Product<Distribution, TokenId>, Loss> for CrossEntropy {
    fn name(&self) -> &'static str {
        "cross_entropy"
    }

    fn apply(&self, input: Product<Distribution, TokenId>) -> CtResult<Loss> {
        let (distribution, target) = input.into_parts();

        let Some(probability) = distribution.as_slice().get(target.index()).copied() else {
            return Err(CtError::OutOfRange {
                kind: "target",
                index: target.index(),
                limit: distribution.as_slice().len(),
            });
        };

        Loss::new(-probability.max(1e-9).ln())
    }
}

Rust Syntax: Input Type

Product<Distribution, TokenId>

Cross entropy needs both:

  • the predicted distribution
  • the correct target token

That pair is a product object.

Rust Syntax: Splitting The Product

let (distribution, target) = input.into_parts();

This consumes the product and extracts both values.

Rust Syntax: Target Bounds Check

distribution.as_slice().get(target.index())

The target token must be inside the probability vector.

If the distribution has 5 entries, target index 7 is invalid.

This error belongs here because this is the first place the target is used as an index into the predicted distribution.

Rust Syntax: Negative Log Likelihood

Loss::new(-probability.max(1e-9).ln())

The loss is:

-ln(probability assigned to the correct token)

The max(1e-9) avoids taking the log of zero.

Then Loss::new validates the loss scalar.

ML Concept

Cross entropy measures how surprised the model was by the true target.

If the model assigns high probability to the target, the loss is small.

If the model assigns low probability to the target, the loss is large.

This is why the chapter says loss is a training signal. It turns a probability assigned to the correct token into a number the optimizer can try to reduce.

Category Theory Concept

Cross entropy consumes a product object:

Distribution x TokenId

and maps it into:

Loss

So its shape is:

Product<Distribution, TokenId> -> Loss

DirectPredict

The problem this block solves is:

The course needs a direct implementation to compare against the composed prediction path.

DirectPredict stores parameters and implements:

TokenId -> Distribution

Internally, it still performs:

Embedding
LinearToLogits
Softmax

but it writes the steps directly.

This allows the code to test:

composed path == direct path

That is the program’s tiny commutative diagram check.

Rust Syntax

DirectPredict is a struct that owns Parameters.

Its apply method calls the prediction steps directly instead of using Compose.

ML Concept

This is the direct prediction implementation.

It exists so the composed path can be checked against a straightforward path.

Category Theory Concept

It provides the second path in a commutative diagram:

composed path
direct path

average_loss

The problem this function solves is:

Training needs one scalar loss over the whole training set.

The function builds the composed prediction path:

let embedding = Embedding::from_parameters(params);
let linear = LinearToLogits::from_parameters(params);
let predict = Compose::<_, _, Vector>::new(embedding, linear);
let predict = Compose::<_, _, Logits>::new(predict, Softmax);

The resulting shape is:

TokenId -> Distribution

Then each training example is evaluated:

let distribution = predict.apply(*example.first())?;
let loss = loss_fn.apply(Product::new(distribution, *example.second()))?;

Finally, the average is wrapped in Loss::new.

The function does not return a raw f32.

It returns a validated Loss.

Rust Syntax

The function takes borrowed parameters and a borrowed dataset:

pub fn average_loss(params: &Parameters, dataset: &TrainingSet) -> CtResult<Loss>

It does not consume either one.

The function loops through examples, accumulates scalar losses, and divides by the dataset length.

ML Concept

Average loss summarizes model performance over the full training set.

Category Theory Concept

It folds many example-level loss morphism results into one scalar object.

composed_prediction_matches_direct_prediction

The problem this function solves is:

The code should prove that the composed prediction pipeline and the direct implementation agree.

The composed path is:

TokenId -> Vector -> Logits -> Distribution

The direct path is:

TokenId -> Distribution

The function runs both on the same token and compares every probability with a small floating-point tolerance.

Category-theoretically, this is a commutative diagram test:

          composed
TokenId ------------> Distribution
   \                      ^
    \ direct              |
     ---------------------

The exact drawing is less important than the idea:

Two paths through the system should produce the same meaning.

Rust Syntax

The function builds one composed path with Compose and one direct path with DirectPredict.

It compares probabilities pairwise with approx_eq.

ML Concept

This verifies that refactoring the prediction path into smaller stages did not change the predicted probabilities.

Category Theory Concept

This is a commutative-diagram check in code.

Run The Demo

Run:

cargo run --bin category_ml

Look at sections 2 through 5 in the output.

You should see:

TokenSequence -> TrainingSet
prediction probabilities
loss for a target token

Demo Output Transfer Checklist

Sections 2 through 5 of the demo are the smallest complete ML story in the book. Read them as a boundary report.

Demo outputBoundary to ownShortcut to reject
Dataset morphism: TokenSequence -> TrainingSetDatasetWindowing turns a token stream into adjacent input-target pairs.Treating a raw token sequence as if it were already supervised data.
"I" -> "love"Each pair is Product<TokenId, TokenId>.Forgetting which token is the input and which token is the target.
Composition: Softmax after Linear after EmbeddingPrediction is TokenId -> Vector -> Logits -> Distribution.Skipping Logits and pretending vectors are probabilities.
`P(next token‘I’) = […]`The printed vector is a validated Distribution.
Product object: Prediction x Target -> LossLoss needs both the prediction and the correct next token.Calling loss on Distribution alone.
loss for target 'love' = ...Cross entropy uses the probability at the target token index.Using the largest probability instead of the target probability.

This checklist compresses the chapter into one reader habit:

visible output -> typed boundary -> invalid shortcut rejected

The ML idea is that a training example is not just an input. It is an input paired with the answer the model should have predicted. The category-theory idea is that the answer enters through a product boundary:

Distribution x TokenId -> Loss

The Rust idea is that the boundary is not only prose. It appears as a concrete type:

Product<Distribution, TokenId>

Why This Matters

This chapter is where the course stops being abstract.

The code implements a real, tiny version of the common language-model training path:

context token -> hidden vector -> next-token probabilities -> loss

The implementation is small, but the boundaries are real. Invalid token lookup returns OutOfRange, invalid matrix shape returns ShapeMismatch, empty logits return EmptyInput, invalid probabilities return InvalidProbability, and invalid loss returns InvalidLoss.

Errors are caught where the invalid data first becomes meaningful.

Core Mental Model

In Rust terms:

each ML operation implements Morphism<Input, Output>

In ML terms:

prediction is embedding + linear projection + softmax
loss is negative log probability of the target

In category-theory terms:

prediction is composition of arrows
loss consumes a product object
the direct and composed paths should commute

Checkpoint

Where should an out-of-range target token be caught?

Correct answer:

Inside CrossEntropy, because that is where the target is used to index the predicted distribution.

Where This Leaves Us

This chapter assembled the first complete tiny ML path. A token sequence becomes training examples, a token becomes a vector, a vector becomes logits, logits become probabilities, and a probability distribution plus a target token becomes loss.

The next chapter, Training as an Endomorphism, changes the question from “how do we evaluate one prediction?” to “how do repeated updates change the model state?” That is where training enters as an endomorphism.

Further Reading

The problem this section solves is transfer. If you only read the tiny Rust implementation, larger framework APIs may still look unrelated. If you only read a framework reference, the explicit typed boundaries in this chapter may feel unnecessarily small. Use the references to connect the two views without collapsing them.

Start from the local Rust evidence:

DatasetWindowing.apply : TokenSequence -> TrainingSet
Embedding.apply        : TokenId -> Vector
LinearToLogits.apply   : Vector -> Logits
Softmax.apply        : Logits -> Distribution
CrossEntropy.apply   : Distribution x TokenId -> Loss
average_loss          : Parameters x TrainingSet -> Loss

Then read the sources in this order:

SourceWhat to transfer back into this chapterLocal evidence to inspect
D2L Softmax RegressionMulticlass classification uses raw scores, softmax probabilities, and cross entropy as one connected prediction-and-loss story.LinearToLogits.apply, Softmax.apply, CrossEntropy.apply
D2L Softmax From ScratchImplementing the pieces from scratch reveals the roles hidden by concise framework calls.src/ml.rs, average_loss, cargo test ml::tests --lib
Accurate Computation of the Log-Sum-Exp and Softmax FunctionsFloating-point softmax implementations use shifted formulas to reduce overflow and harmful underflow.let max_value = ..., (*value - max_value).exp()
On Calibration of Modern Neural NetworksA normalized probability vector is not automatically a calibrated confidence estimate over future predictions.Distribution::new, softmax_normalizes_logits_into_distribution
PyTorch CrossEntropyLossA production API can accept unnormalized logits and target class indices while internally combining log-softmax and negative log likelihood.Logits -> Distribution, Product<Distribution, TokenId>, CrossEntropy.apply
CS231n Linear ClassificationScores, classifiers, and losses should be kept conceptually separate before optimization is discussed.Vector -> Logits, Distribution x TokenId -> Loss

The ML bridge is:

framework call
  -> raw scores plus target index
  -> probability assigned to the target
  -> loss

The category-theory bridge is:

Logits -> Distribution
Distribution x TokenId -> Loss

The first arrow is an ordinary morphism. The second is a product-input morphism because loss needs both the model’s prediction and the correct target token.

After reading one source, answer four questions:

  1. Which local boundary did it clarify?
  2. Which value is raw score, probability, target, or loss?
  3. Which shortcut did the source use that the tiny Rust path expands?
  4. Which command or test shows the local evidence?

For this chapter, the commands are:

cargo run --bin category_ml
cargo test ml::tests --lib

Checkpoint:

When reading an external loss API, can you name which part corresponds to
Logits -> Distribution and which part corresponds to Distribution x TokenId
-> Loss?

For terminology recovery, use:

  • Glossary: logits, softmax, probability distribution, cross entropy
  • References: softmax regression and linear classifiers

If a source does not help you point to one local boundary and one output or test signal, it has not transferred back into this chapter yet.

Practice After This Chapter

Use Exercise 3 to trace adjacent training pairs and Exercise 9 to connect this tiny implementation to a larger ML reference. The pair checks both local code understanding and source-backed transfer.

Retrieval Practice

Recall

Recover the path before explaining the calculations.

  1. What morphism turns a TokenSequence into a TrainingSet?
  2. Which three arrows turn a TokenId into a Distribution?
  3. Which two objects are paired before CrossEntropy can produce Loss?

Explain

Use the target token to explain why the loss boundary needs a product object.

  1. Why are Logits not the same object as Distribution?
  2. Why does CrossEntropy use the probability at target.index() instead of the largest probability in the distribution?
  3. Why does an out-of-range target error belong inside CrossEntropy?

Apply

Use the demo output and the numeric examples in this chapter.

  1. Given TokenId -> Vector -> Logits -> Distribution, write the Rust type that must appear between Embedding and Softmax.
  2. A distribution is [0.70, 0.20, 0.10] and the target token index is 1. Which probability should cross entropy use?
  3. A token sequence is [4, 9, 2]. Which adjacent training pairs should DatasetWindowing produce?

Debug

For each invalid shortcut, name the missing boundary or wrong object:

Logits -> Loss
Distribution -> Loss
using the maximum probability instead of the target probability

A strong answer should mention the exact typed path:

Logits -> Distribution
Distribution x TokenId -> Loss

The point is not to memorize the formula. The point is to know which object owns the probability invariant and which object selects the target probability.