Cover

Category Theory for Tiny ML in Rust

A practical bridge between compositional mathematics, Rust types, and tiny machine-learning systems

Working Draft · Public Feedback Edition

Coauthored by
Hamze Ghalebi
Farzad Jafarranmani

About This Book

Category Theory for Tiny ML in Rust is a working draft that develops a small, explicit machine-learning system through the lens of category theory and Rust.

The book is designed for readers who want to understand machine learning not only as numerical computation, but as a structured pipeline of objects, transformations, composition, and constraints.

Rather than treating category theory as decorative abstraction, this book uses it as an engineering tool:

domain objects become Rust types,
morphisms become typed transformations,
composition becomes executable program structure,
training becomes repeated transformation of model state,
and tiny ML systems become a way to make mathematical structure concrete.

This is not a finalized edition. Chapters, examples, terminology, diagrams, code, and references may evolve as the work continues.

Hamze Ghalebi is a Paris-based AI architect, CTO, and software builder associated with Remo Lab. His work focuses on production GenAI, regulated AI systems, auditable AI products, Rust systems, and the transition from AI prototypes to reliable production architectures.

His background includes advanced study at Institut Polytechnique de Paris across statistics, optimization, machine learning, artificial intelligence, distributed systems, cloud computing, and data science.

Hamze brings the engineering and product perspective of the book: how to turn mathematical and machine-learning ideas into understandable, typed, maintainable systems. His current work is especially concerned with AI systems that can be evaluated, monitored, audited, and kept under human accountability in real operational environments.

In this book, his role is to connect tiny ML, Rust implementation, and production-minded software architecture — because apparently making category theory executable was not ambitious enough already.

Farzad Jafarranmani

Farzad Jafarranmani is a researcher and engineer in the Paris area, associated with Huawei and the Lagrange Mathematics and Computing Research Center. His work sits at the intersection of mathematics, computer science, logic, semantics, proof theory, and category theory.

He holds a PhD in Mathematics and Computer Science from Université Paris Cité, where his doctoral work focused on fixpoints of types in linear logic from a Curry–Howard–Lambek perspective. He also studied Mathematics and Computer Science at ENS Paris-Saclay, with work including induction in fibred multicategories and denotational semantics of linear logic with least and greatest fixpoints.

His previous research experience includes postdoctoral work at LIP6, Laboratoire d’Informatique de Sorbonne Université / CNRS, as well as a visiting research position at the University of Cambridge.

Farzad brings the mathematical and theoretical foundation of the book: category theory, denotational semantics, proof theory, type-theoretic structure, and the discipline required to keep abstractions precise instead of merely fashionable.

Public Feedback

Public feedback is welcome while the book is still growing.

Useful feedback includes unclear explanations, broken examples, missing references, awkward terminology, incorrect or overloaded mathematical language, Rust examples that could be clearer or more idiomatic, and places where the connection between Rust, machine learning, and category theory should be made more explicit.

This edition is intentionally public before it is final.

Use With Reference

Original material from this book may be quoted, reused, adapted, or taught from when the source is clearly referenced and both coauthors are credited.

Suggested reference format:

Ghalebi, H., & Jafarranmani, F.
Category Theory for Tiny ML in Rust.
Working Draft, Public Feedback Edition.

External works cited or referenced by this book remain under their own licenses, terms, and attribution requirements.

Category Theory for Tiny ML in Rust

This course teaches category-theory ideas through a tiny Rust language-model pipeline.

The goal is not to memorize abstract vocabulary.

The goal is to connect each abstract word to a concrete machine-learning operation, a Rust type or trait, an invariant the code protects, and a command you can run.

The whole course follows one central idea:

A useful ML system is a chain of typed transformations.

What You Already Know

If you have written a Rust function, you already know the informal shape behind much of this book. A function receives a value of one type and returns a value of another type. If you have seen an ML pipeline, you already know that data moves through staged transformations. Category theory asks us to look at that movement structurally.

Worked Example: One Typed Transformation

Start with the smallest version:

#![allow(unused)]
fn main() {
fn token_to_position(token_id: usize) -> usize {
    token_id + 100
}

assert_eq!(token_to_position(3), 103);
}

Rust reads this as a function from usize to usize. The book’s real examples make the same movement safer by replacing raw usize values with named domain types such as TokenId, VocabSize, and ModelDimension.

In this repository, that chain is small enough to read completely:

raw text idea
  -> token ids
  -> token sequence
  -> next-token training pairs
  -> prediction distribution
  -> loss
  -> updated parameters

Rust gives those stages names.

Category theory gives those stages shapes.

Machine learning gives those stages a reason to exist.

The Explanation Standard

Every major chapter now explains code at four levels.

First, it says what problem a block solves.

Second, it places the block in the ML pipeline.

Third, it reads the Rust syntax directly.

Fourth, it explains the category-theory shape behind the code.

For example, when you see:

pub struct TokenSequence(Vec<TokenId>);

do not read it as only:

a struct containing a vector

Read it as:

a validated, owned, non-empty list of token IDs

That one type carries several meanings at once. In Rust, it is a private tuple struct wrapping Vec<TokenId>. In the ML pipeline, it is tokenized text before it becomes examples. At the API boundary, it prevents callers from constructing an empty sequence directly. Categorically, it behaves like a non-empty list-like object.

This is the level of reading used throughout the course.

Self-Check

Before continuing, explain this in your own words: what changes when a raw number becomes a named type such as TokenId?

Learning Contract

Use the same loop for every chapter. Start with the concrete problem, study the code block or source snapshot, translate each type into plain English, and then translate each method into the pipeline stage it serves. After that, run the chapter command and answer the checkpoint without looking back.

The chapters are deliberately repetitive in structure. That repetition is part of the learning design. You should start to recognize the same pattern:

raw representation
  -> validated domain object
  -> typed morphism
  -> composed pipeline
  -> tested law

Fast Start

From the repository root:

cargo run --bin category_ml

That command runs the full guided walkthrough.

You should see token IDs becoming training pairs, a prediction path built from embedding, linear projection, and softmax, cross entropy producing a loss, repeated training lowering the loss, and small examples for functors, naturality, monoids, and the chain rule.

The Main Picture

The tiny model is a chain of typed arrows:

TokenSequence -> TrainingSet
TokenId       -> Vector
Vector        -> Logits
Logits        -> Distribution
Distribution x TokenId -> Loss
Parameters    -> Parameters

The first line prepares examples.

The middle lines make predictions and measure error.

The last line updates the model.

The category-theory reading is:

objects + morphisms + composition + laws

The Rust reading is:

types + traits + smart constructors + tests

The ML reading is:

data + model + probabilities + loss + training

Reading Path

Read the chapters in order. The Course Map gives the whole pipeline shape. Domain Objects names the typed nouns, and Morphism and Composition names the typed arrows between them. The Tiny ML Pipeline turns those arrows into prediction and loss, while Training as an Endomorphism shows why repeated updates have the shape Parameters -> Parameters.

After the core pipeline, Functors, Naturality, Monoids, and Chain Rule introduces reusable structure, and Seven Sketches Through Rust widens the same style to applied category theory. The Exercises, Glossary, References, and Transformer Roadmap are there for practice, review, deeper reading, and the path toward attention.

What To Remember

The central discipline is:

Do not let raw values travel farther than they should.

A raw usize becomes TokenId.

A raw Vec<TokenId> becomes TokenSequence.

A raw Vec<f32> becomes Distribution only after probability validation.

A raw optimizer update becomes TrainStep, a typed endomorphism:

Parameters -> Parameters

The result is a small codebase where every concept has a name, every boundary has a type, and every composition has to make sense before Rust lets it run.

Where This Leaves Us

The welcome page sets the reading contract. You will see the same idea through three lenses: Rust syntax, tiny ML behavior, and category-theory shape. The next chapter gives the full map before the book starts reading individual source files.

Retrieval Practice

Recall

What is the central pipeline shape this book keeps returning to?

Explain

Why does the book connect every concept to Rust syntax, ML meaning, and category-theory shape?

Apply

Pick one raw value from the pipeline, such as a token index or probability vector. Give it a domain-type name and explain what confusion the name prevents.

Course Map

The problem this chapter solves is:

Before reading individual Rust files, you need one map of how the whole machine-learning pipeline, Rust type system, and category-theory vocabulary fit together.

The repository is small, but it contains several layers:

domain objects
  -> typed morphisms
  -> concrete ML morphisms
  -> training endomorphism
  -> reusable structure patterns
  -> applied category-theory sketches

This chapter is the index of those layers.

Reader orientation: The map is not a list of things to memorize. It is a promise about how the book will move: first name the values, then name the arrows, then compose the arrows into a tiny learning system.

What You Already Know

If you already read programs from top to bottom, you know how to follow a flow. If you know Rust function signatures, you know that each step has an input type and an output type. If you know ML pipelines, you know that raw data becomes features, predictions, loss, and updates. This chapter puts those familiar habits on one map.

The Whole Pipeline

The central pipeline is:

TokenSequence -> TrainingSet
TokenId       -> Vector
Vector        -> Logits
Logits        -> Distribution
Distribution x TokenId -> Loss
Parameters    -> Parameters

Read this as three stories at once.

In ML terms:

tokenized text
  -> prediction examples
  -> embeddings
  -> vocabulary scores
  -> probabilities
  -> error measurement
  -> updated weights

In Rust terms:

validated input types
  -> trait implementations
  -> explicit error handling
  -> private fields
  -> read-only accessors
  -> tests

In category-theory terms:

objects
  -> morphisms
  -> products
  -> composition
  -> endomorphisms
  -> laws

The course is about learning to see the same pipeline through all three views.

Worked Example: A Tiny Typed Movement

Here is the smallest Rust idea behind that map. A function has an input type and an output type:

#![allow(unused)]
fn main() {
fn token_to_vector_id(token_id: usize) -> usize {
    token_id + 100
}

assert_eq!(token_to_vector_id(7), 107);
}

The real code does not leave those values as raw usize forever. It gives each pipeline stage a domain type, then uses morphisms to make the connections explicit.

Self-Check

Before moving into the file map, explain why TokenId -> Vector is easier to reason about than usize -> Vec<f32>.

Code Map

Each Rust file owns one part of the idea.

`src/domain.rs`

This file defines the nouns.

The main examples are TokenId, TokenSequence, Vector, Logits, Distribution, Loss, TrainingSet, and Parameters.

The problem this file solves is:

Raw numbers are too ambiguous for a training pipeline.

For example, these are all machine numbers:

token index
vocabulary size
model dimension
loss value
learning rate

But they are not the same concept.

src/domain.rs gives each concept a separate type.

`src/category.rs`

This file defines the arrows.

The central trait is:

pub trait Morphism<Input, Output> {
    fn name(&self) -> &'static str;
    fn apply(&self, input: Input) -> CtResult<Output>;
}

This says:

A morphism is something that knows how to transform an Input into an Output, possibly failing with CtError.

The rest of the file defines identity, composition, endomorphism, and repeated application.

`src/ml.rs`

This file defines concrete ML arrows.

The main transformations are:

DatasetWindowing : TokenSequence -> TrainingSet
Embedding        : TokenId -> Vector
LinearToLogits   : Vector -> Logits
Softmax          : Logits -> Distribution
CrossEntropy     : Distribution x TokenId -> Loss

This file is where the abstract Morphism trait becomes a tiny learning system.

`src/training.rs`

This file defines:

TrainStep : Parameters -> Parameters

That shape is important.

Because the output type is the same as the input type, training can be repeated:

Parameters0 -> Parameters1 -> Parameters2 -> ... -> ParametersN

That is why training is taught as an endomorphism.

`src/structure.rs`

This file teaches reusable structure:

functor: map inside a wrapper
natural transformation: convert wrapper shape consistently
monoid: combine values with an empty value

These are not extra theory for decoration. They name patterns that appear in ordinary ML systems: batches, optional values, traces, logs, and composed workflows.

`src/calculus.rs`

This file shows the smallest useful backpropagation idea:

z = x * y
dL/dx = dL/dz * y
dL/dy = dL/dz * x

The code does not implement a full automatic differentiation engine. It gives you the local rule that larger systems compose.

`src/sketches.rs`

This file connects the course to seven applied category-theory themes: orders, resources, databases, co-design, signal flow, circuits, and behavior logic.

Each theme is represented as typed Rust values plus law-checking tests.

Guided Walkthrough Snapshot

The terminal demo is the spine of the course.

The problem this block solves is:

A learner should be able to run one command and see every major concept used once in a concrete order.

Source snapshot: src/demo.rs

use crate::calculus::{LocalGradient, MulOp, Scalar};
use crate::category::{Compose, StepCount, apply_endomorphism_n_times};
use crate::domain::{
    LearningRate, Logits, ModelDimension, Parameters, Product, TokenId, TokenSequence, Vector,
    VocabSize,
};
use crate::error::CtResult;
use crate::ml::{
    CrossEntropy, DatasetWindowing, Embedding, LinearToLogits, Softmax, average_loss,
    composed_prediction_matches_direct_prediction,
};
use crate::structure::{
    Functor, Monoid, OptionFunctor, PipelineTrace, TraceStep, VecFunctor,
    monoid_laws_hold_for_pipeline_trace, naturality_square_holds_for_first_option,
};
use crate::training::TrainStep;
use crate::{Identity, Morphism};

/// Run the full terminal walkthrough used by `cargo run --bin category_ml`.
pub fn run_demo() -> CtResult<()> {
    println!("Category theory concepts implemented in Rust 2024");
    println!("=================================================\n");

    let vocab = ["<pad>", "I", "love", "Rust", "."];
    let raw_text = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;

    println!("1. Object examples");
    println!("   TokenId(1) means {:?}\n", vocab[1]);

    println!("2. Dataset morphism: TokenSequence -> TrainingSet");
    let dataset = DatasetWindowing.apply(raw_text)?;
    for example in dataset.examples() {
        println!(
            "   {:?} -> {:?}",
            vocab[example.first().index()],
            vocab[example.second().index()]
        );
    }
    println!();

    println!("3. Identity morphism: id_Vector : Vector -> Vector");
    let v = Vector::new(vec![1.0, 2.0, 3.0]);
    let same_v = Identity::<Vector>::new().apply(v.clone())?;
    println!("   input  = {:?}", v);
    println!("   output = {:?}\n", same_v);

    println!("4. Composition: Softmax after Linear after Embedding");
    let params = Parameters::init(VocabSize::new(vocab.len())?, ModelDimension::new(4)?);
    let embedding = Embedding::from_parameters(&params);
    let linear = LinearToLogits::from_parameters(&params);
    let token_to_logits = Compose::<_, _, Vector>::new(embedding, linear);
    let token_to_distribution = Compose::<_, _, Logits>::new(token_to_logits, Softmax);
    let distribution = token_to_distribution.apply(TokenId::new(1))?;
    println!("   P(next token | 'I') = {:?}\n", distribution.as_slice());

    println!("5. Product object: Prediction x Target -> Loss");
    let loss = CrossEntropy.apply(Product::new(distribution, TokenId::new(2)))?;
    println!("   loss for target 'love' = {:.6}\n", loss.value());

    println!("6. Endomorphism: TrainStep : Parameters -> Parameters");
    let before = average_loss(&params, &dataset)?;
    let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
    let trained_params =
        apply_endomorphism_n_times(&train_step, params.clone(), StepCount::new(80))?;
    let after = average_loss(&trained_params, &dataset)?;
    println!("   average loss before training = {:.6}", before.value());
    println!("   average loss after  training = {:.6}\n", after.value());

    println!("7. Functor: fmap over Vec and Option");
    let xs = vec![1, 2, 3];
    let ys = VecFunctor::fmap(xs, |x| x * x);
    let maybe = OptionFunctor::fmap(Some(7), |x| x + 1);
    println!("   VecFunctor fmap square: {:?}", ys);
    println!("   OptionFunctor fmap +1: {:?}\n", maybe);

    println!("8. Natural transformation: Vec<A> -> Option<A>");
    println!(
        "   naturality square holds: {}\n",
        naturality_square_holds_for_first_option()
    );

    println!("9. Monoid: pipeline traces compose associatively with identity");
    let trace = PipelineTrace::from_steps(vec![TraceStep::new("embedding")])
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("linear")]))
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("softmax")]));
    println!("   trace = {:?}", trace.names());
    println!(
        "   monoid laws hold for this trace type: {}\n",
        monoid_laws_hold_for_pipeline_trace()
    );

    println!("10. Commutative diagram check");
    println!(
        "   composed prediction == direct prediction: {}\n",
        composed_prediction_matches_direct_prediction(&params)?
    );

    println!("11. Chain rule / local derivative morphism");
    let mul = MulOp;
    let x = Scalar::new(2.0)?;
    let y = Scalar::new(3.0)?;
    let z = mul.forward(x, y)?;
    let upstream = LocalGradient::new(1.0)?;
    let (dl_dx, dl_dy) = mul.backward(x, y, upstream)?;
    println!("   z = x * y = {}", z.value());
    println!(
        "   if dL/dz = {}, then dL/dx = {}, dL/dy = {}\n",
        upstream.value(),
        dl_dx.value(),
        dl_dy.value()
    );

    println!("Compressed categorical training view:");
    println!("   Dataset x Parameters -> Prediction -> Loss -> Gradients -> Updated Parameters");
    println!("   TrainStep is repeated as Parameters0 -> Parameters1 -> ... -> ParametersN");

    Ok(())
}

How To Read The Demo

The demo is not random output. It is a staged proof that the pieces connect.

Section 1 introduces an object:

TokenId(1)

Section 2 applies a data-preparation morphism:

TokenSequence -> TrainingSet

Section 3 applies identity:

Vector -> Vector

Section 4 composes prediction:

TokenId -> Vector -> Logits -> Distribution

Section 5 uses a product object:

Distribution x TokenId -> Loss

Section 6 repeats an endomorphism:

Parameters -> Parameters

Sections 7 through 11 add the structural patterns:

Functor
NaturalTransformation
Monoid
Commutative diagram check
Chain rule

So the demo is a miniature course outline in executable form.

Binary Entrypoint

The binary entrypoint is deliberately tiny:

Source snapshot: src/bin/category_ml.rs

fn main() -> category_theory_transformer_rs::CtResult<()> {
    category_theory_transformer_rs::run_demo()
}

The whole file is:

use category_theory_transformer_rs::run_demo;

fn main() {
    run_demo().unwrap();
}

Line by line:

use category_theory_transformer_rs::run_demo;

This imports the library function that owns the walkthrough.

fn main()

This is the process entrypoint. When you run the binary, Rust starts here.

run_demo().unwrap();

This runs the walkthrough and panics if it fails. In the library code, fallible work uses CtResult. The binary keeps the entrypoint short because the course focus is the library, not command-line error reporting.

First Run

Run:

cargo run --bin category_ml

You should see a tiny language-model pipeline and the loss decreasing after training.

The important part is not the exact floating-point numbers.

The important part is the shape:

before training: higher loss
after training:  lower loss

That means repeated TrainStep applications moved the parameters in a useful direction on the tiny dataset.

Core Mental Model

Every chapter after this one zooms into one row of the map.

Remember:

object = typed thing
morphism = typed transformation
composition = legal connection of transformations
endomorphism = transformation from a type back to itself
law = property the code checks so composition remains trustworthy

Checkpoint

Explain this line in your own words:

TokenId -> Vector -> Logits -> Distribution

A strong answer should mention token lookup, the embedding vector, vocabulary scores, the probability distribution, and the fact that the whole path is a composition of typed morphisms.

Where This Leaves Us

This chapter gave the whole shape before the details. You now know the names of the source files, the major pipeline objects, and the difference between objects, morphisms, composition, endomorphisms, and laws.

The next chapter slows down and studies the objects themselves. Before a pipeline can compose arrows safely, it needs values whose meanings are clear enough for arrows to start and end at them.

Retrieval Practice

Recall

Name the three readings used throughout the course.

Explain

Why does the course start with a whole-pipeline map before reading individual source files?

Apply

Write a one-line diagram for a pipeline you already know, then label the input object, arrow, and output object.

Domain Objects

The problem this chapter solves is:

A machine-learning pipeline should not pass raw numbers around and hope everyone remembers what each number means.

Before this code talks about arrows, composition, loss, or training, it defines the objects those arrows will connect.

In this course, a domain object means:

raw representation
  + a meaningful name
  + optional validation
  + controlled access

For example:

usize

could mean:

a token index
a vocabulary size
a model dimension
a matrix row count
a training step count

Those are different concepts.

So the code creates different types.

Reader orientation: In this chapter, focus on why a type exists before focusing on its syntax. A tuple struct, private field, constructor, or accessor is not decoration. It is a small boundary that tells the rest of the pipeline which states it may trust.

What You Already Know

If you have used a Rust struct, you already know that a value can carry a name instead of floating around as raw data. If you have used an ML pipeline, you already know that a token index, a vector, a probability distribution, and a loss value play different roles. This chapter turns that familiar separation into explicit domain types.

Worked Example: Naming One Number

The smallest version of the pattern looks like this:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
struct TokenId(usize);

impl TokenId {
    fn new(index: usize) -> Self {
        Self(index)
    }

    fn index(self) -> usize {
        self.0
    }
}

assert_eq!(TokenId::new(3).index(), 3);
}

The real source file repeats that pattern with stronger validation where the value has an invariant, such as “a distribution must contain probabilities that sum to one.”

Self-Check

Before reading the full source snapshot, explain why TokenId(3) communicates more than the raw number 3.

Source Snapshot

This is the domain layer used by the whole tutorial.

Source snapshot: src/domain.rs

use crate::error::{CtError, CtResult};

/// A vocabulary index. It is intentionally not a raw `usize` in public APIs.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct TokenId(usize);

impl TokenId {
    pub fn new(index: usize) -> Self {
        Self(index)
    }

    pub fn index(&self) -> usize {
        self.0
    }
}

impl From<usize> for TokenId {
    fn from(value: usize) -> Self {
        Self::new(value)
    }
}

/// A sequence of tokens before it has been converted into training pairs.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct TokenSequence(Vec<TokenId>);

impl TokenSequence {
    pub fn new(tokens: impl IntoIterator<Item = TokenId>) -> CtResult<Self> {
        let tokens = tokens.into_iter().collect::<Vec<_>>();

        if tokens.is_empty() {
            return Err(CtError::EmptyInput("token sequence"));
        }

        Ok(Self(tokens))
    }

    pub fn from_indices(indices: impl IntoIterator<Item = usize>) -> CtResult<Self> {
        Self::new(indices.into_iter().map(TokenId::new))
    }

    pub fn as_slice(&self) -> &[TokenId] {
        &self.0
    }
}

/// A dense feature vector.
#[derive(Debug, Clone, PartialEq)]
pub struct Vector(Vec<f32>);

impl Vector {
    pub fn new(values: Vec<f32>) -> Self {
        Self(values)
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

/// Unnormalized model scores.
#[derive(Debug, Clone, PartialEq)]
pub struct Logits(Vec<f32>);

impl Logits {
    pub fn new(values: Vec<f32>) -> Self {
        Self(values)
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

/// A validated probability distribution.
#[derive(Debug, Clone, PartialEq)]
pub struct Distribution(Vec<f32>);

impl Distribution {
    pub fn new(probabilities: Vec<f32>) -> CtResult<Self> {
        if probabilities.is_empty() {
            return Err(CtError::EmptyInput("distribution"));
        }

        let sum: f32 = probabilities.iter().sum();
        let all_valid = probabilities
            .iter()
            .all(|probability| probability.is_finite() && *probability >= 0.0);

        if !all_valid || !approx_eq(sum, 1.0, 1e-4) {
            return Err(CtError::InvalidProbability("distribution constructor"));
        }

        Ok(Self(probabilities))
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

/// A non-negative scalar objective value.
#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Loss(f32);

impl Loss {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() || value < 0.0 {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

/// Number of vocabulary entries.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct VocabSize(usize);

impl VocabSize {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("vocabulary"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Width of each embedding vector.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ModelDimension(usize);

impl ModelDimension {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("model dimension"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Positive optimizer step size.
#[derive(Debug, Clone, Copy, PartialEq)]
pub struct LearningRate(f32);

impl LearningRate {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() || value <= 0.0 {
            return Err(CtError::InvalidLearningRate(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

/// Categorical product object: `A x B`.
#[derive(Debug, Clone, PartialEq)]
pub struct Product<A, B> {
    first: A,
    second: B,
}

impl<A, B> Product<A, B> {
    pub fn new(first: A, second: B) -> Self {
        Self { first, second }
    }

    pub fn first(&self) -> &A {
        &self.first
    }

    pub fn second(&self) -> &B {
        &self.second
    }

    pub fn into_parts(self) -> (A, B) {
        (self.first, self.second)
    }
}

pub type TrainingExample = Product<TokenId, TokenId>;

/// Non-empty next-token training pairs.
#[derive(Debug, Clone, PartialEq)]
pub struct TrainingSet(Vec<TrainingExample>);

impl TrainingSet {
    pub fn new(examples: impl IntoIterator<Item = TrainingExample>) -> CtResult<Self> {
        let examples = examples.into_iter().collect::<Vec<_>>();

        if examples.is_empty() {
            return Err(CtError::EmptyInput("training set"));
        }

        Ok(Self(examples))
    }

    pub fn examples(&self) -> &[TrainingExample] {
        &self.0
    }

    pub fn len(&self) -> usize {
        self.0.len()
    }

    pub fn is_empty(&self) -> bool {
        self.0.is_empty()
    }
}

/// Tiny model parameters for an embedding plus language-model head.
#[derive(Debug, Clone, PartialEq)]
pub struct Parameters {
    pub(crate) embedding: Vec<Vec<f32>>,
    pub(crate) lm_head: Vec<Vec<f32>>,
    pub(crate) bias: Vec<f32>,
}

impl Parameters {
    pub fn init(vocab_size: VocabSize, d_model: ModelDimension) -> Self {
        let vocab_size = vocab_size.value();
        let d_model = d_model.value();

        Self {
            embedding: init_matrix(vocab_size, d_model, 0.2, 1),
            lm_head: init_matrix(d_model, vocab_size, 0.2, 2),
            bias: vec![0.0; vocab_size],
        }
    }

    pub fn vocab_size(&self) -> usize {
        self.bias.len()
    }

    pub fn d_model(&self) -> usize {
        self.embedding.first().map_or(0, Vec::len)
    }

    pub fn embedding_table(&self) -> &[Vec<f32>] {
        &self.embedding
    }

    pub fn lm_head(&self) -> &[Vec<f32>] {
        &self.lm_head
    }

    pub fn bias(&self) -> &[f32] {
        &self.bias
    }
}

pub(crate) fn init_matrix(rows: usize, cols: usize, scale: f32, seed: usize) -> Vec<Vec<f32>> {
    let mut out = vec![vec![0.0; cols]; rows];

    for (row_index, row) in out.iter_mut().enumerate() {
        for (col_index, value) in row.iter_mut().enumerate() {
            let raw = ((row_index * cols + col_index) * 37 + seed * 101) % 1000;
            let unit = raw as f32 / 1000.0;
            *value = (unit - 0.5) * scale;
        }
    }

    out
}

pub(crate) fn approx_eq(a: f32, b: f32, eps: f32) -> bool {
    (a - b).abs() <= eps
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn distribution_rejects_non_normalized_values() {
        let result = Distribution::new(vec![0.4, 0.4]);

        assert!(matches!(result, Err(CtError::InvalidProbability(_))));
    }

    #[test]
    fn token_sequence_rejects_empty_input() {
        let result = TokenSequence::new(vec![]);

        assert!(matches!(result, Err(CtError::EmptyInput("token sequence"))));
    }
}

The Whole File

src/domain.rs defines the nouns in the tiny ML system:

TokenId
TokenSequence
Vector
Logits
Distribution
Loss
VocabSize
ModelDimension
LearningRate
Product
TrainingExample
TrainingSet
Parameters

The ML pipeline needs all of them:

TokenSequence -> TrainingSet
TokenId       -> Vector
Vector        -> Logits
Logits        -> Distribution
Distribution x TokenId -> Loss
Parameters    -> Parameters

The category-theory reading is:

These are the objects that morphisms start from and end at.

The Rust reading is:

These are wrapper types that prevent raw representation from leaking through the whole program.

Each major block below is meant to be read through three lenses:

Rust syntax:
what does the code literally declare?

ML concept:
why does the training pipeline need this value?

Category theory concept:
what object, product, list, distribution, or morphism endpoint does it model?

`TokenId`

The problem this block solves is:

A token index should not be confused with any other usize.

The block:

/// A vocabulary index. It is intentionally not a raw `usize` in public APIs.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct TokenId(usize);

impl TokenId {
    pub fn new(index: usize) -> Self {
        Self(index)
    }

    pub fn index(&self) -> usize {
        self.0
    }
}

impl From<usize> for TokenId {
    fn from(value: usize) -> Self {
        Self::new(value)
    }
}

Rust Syntax

TokenId is a tuple struct with one private field:

pub struct TokenId(usize);

The struct is public, but the field is private.

That means other modules can name TokenId, pass it around, and call its methods, but they cannot directly reach inside and mutate the raw usize.

Why `new` Cannot Fail

pub fn new(index: usize) -> Self

Every usize is a valid token index at this layer.

The code does not know yet whether the token is inside a particular vocabulary. That check happens later when a morphism tries to look up an embedding row.

So TokenId::new is infallible.

Why `index` Exists

pub fn index(&self) -> usize {
    self.0
}

This accessor gives read-only access to the raw index when low-level code needs it.

The type still prevents accidental mixing at the API boundary.

ML Concept

In ML terms, TokenId is a vocabulary position.

If the vocabulary is:

0 = <pad>
1 = I
2 = love
3 = Rust
4 = .

then:

TokenId::new(3)

means the token Rust.

Category Theory Concept

TokenId is one object in the category of this program’s typed values.

Arrows such as Embedding start from this object:

TokenId -> Vector

`TokenSequence`

The problem this block solves is:

A language model does not train directly on raw text. First, text becomes a sequence of token IDs. Then that sequence becomes input-target training pairs.

This block represents the middle stage:

raw text
  -> tokens
  -> token sequence
  -> training examples
  -> model training

The block:

/// A sequence of tokens before it has been converted into training pairs.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct TokenSequence(Vec<TokenId>);

impl TokenSequence {
    pub fn new(tokens: impl IntoIterator<Item = TokenId>) -> CtResult<Self> {
        let tokens = tokens.into_iter().collect::<Vec<_>>();

        if tokens.is_empty() {
            return Err(CtError::EmptyInput("token sequence"));
        }

        Ok(Self(tokens))
    }

    pub fn from_indices(indices: impl IntoIterator<Item = usize>) -> CtResult<Self> {
        Self::new(indices.into_iter().map(TokenId::new))
    }

    pub fn as_slice(&self) -> &[TokenId] {
        &self.0
    }
}

Rust Syntax: Documentation Comment

/// A sequence of tokens before it has been converted into training pairs.

This tells you the pipeline stage.

TokenSequence is not raw text.

It is also not yet training data.

It is the ordered token stream before adjacent pairs are created.

Example:

[TokenId(1), TokenId(2), TokenId(3)]

can later become:

TokenId(1) -> TokenId(2)
TokenId(2) -> TokenId(3)

Rust Syntax: Derived Traits

#[derive(Debug, Clone, PartialEq, Eq)]

Debug allows test and debugging output.

Clone allows an explicit copy of the sequence.

PartialEq allows equality checks.

Eq says equality is total and well-behaved.

Order matters. These are not equal:

[TokenId(1), TokenId(2)]
[TokenId(2), TokenId(1)]

Rust Syntax: Private Vector

pub struct TokenSequence(Vec<TokenId>);

This wraps:

Vec<TokenId>

but does not expose the vector directly.

That is important because the type’s invariant is:

TokenSequence is non-empty.

If the field were public, a caller could construct:

TokenSequence(vec![])

and bypass validation.

The private field forces construction through TokenSequence::new or TokenSequence::from_indices.

Rust Syntax: Constructor

pub fn new(tokens: impl IntoIterator<Item = TokenId>) -> CtResult<Self>

This accepts any input that can produce TokenId values:

a vector
an array
a mapped iterator

The return type is:

CtResult<TokenSequence>

So construction can succeed or fail.

Rust Syntax: Collection

let tokens = tokens.into_iter().collect::<Vec<_>>();

This turns the flexible input into the concrete representation stored inside the struct.

The _ means Rust infers the element type as TokenId.

Rust Syntax: Empty Check

if tokens.is_empty() {
    return Err(CtError::EmptyInput("token sequence"));
}

This is the invariant boundary.

An empty token stream cannot carry useful sequence information.

The error happens immediately, before invalid data enters the rest of the pipeline.

Rust Syntax: Successful Construction

Ok(Self(tokens))

Inside the impl, Self means TokenSequence.

So this is equivalent to:

Ok(TokenSequence(tokens))

The vector has already been validated, so the object is safe for later code to trust.

Rust Syntax: Convenience Constructor

pub fn from_indices(indices: impl IntoIterator<Item = usize>) -> CtResult<Self> {
    Self::new(indices.into_iter().map(TokenId::new))
}

This accepts raw indices and converts each one into TokenId.

The important design choice is delegation:

from_indices -> new

Validation is not duplicated.

All construction still passes through the same non-empty check.

Rust Syntax: Read-Only Access

pub fn as_slice(&self) -> &[TokenId] {
    &self.0
}

This returns a borrowed slice.

Callers can inspect the sequence, but they cannot clear it, push to it, or replace the internal vector.

That preserves the invariant after construction.

ML Concept

TokenSequence is tokenized text before next-token examples are created.

A sequence of length n can produce n - 1 adjacent prediction pairs.

Category Theory Concept

TokenSequence behaves like:

List+ TokenId

where List+ means a non-empty finite list.

Its constructor is not:

List TokenId -> TokenSequence

because the empty list is invalid.

It is:

List TokenId -> Result TokenSequence CtError

Rust turns the partial construction into a total function by using Result.

`Vector` and `Logits`

The problem these blocks solve is:

A dense hidden vector and raw vocabulary scores are both Vec<f32>, but they do not mean the same thing.

The blocks:

#[derive(Debug, Clone, PartialEq)]
pub struct Vector(Vec<f32>);

impl Vector {
    pub fn new(values: Vec<f32>) -> Self {
        Self(values)
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

#[derive(Debug, Clone, PartialEq)]
pub struct Logits(Vec<f32>);

impl Logits {
    pub fn new(values: Vec<f32>) -> Self {
        Self(values)
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

Rust Syntax

Vector means hidden features.

Logits means unnormalized scores.

Both wrap Vec<f32>.

The distinction matters because only this arrow should produce logits:

Vector -> Logits

and only this arrow should normalize logits:

Logits -> Distribution

If both were plain Vec<f32>, the compiler could not help keep those stages separate.

These types derive PartialEq, but not Eq, because they contain f32.

Floating-point values do not have total equality because NaN is not equal to itself.

ML Concept

A Vector is the dense representation used after embedding lookup.

Example:

TokenId(3) -> [0.12, -0.44, 0.88, 0.03]

Logits are raw vocabulary scores.

Example:

[3.0, 1.0, -2.0]

They can be negative, larger than one, and do not need to sum to one.

The pipeline is:

TokenId -> Vector -> Logits -> Distribution

Category Theory Concept

If the model dimension is d, a vector lives in a vector-space-like object:

R^d

If the vocabulary size is V, logits live in:

R^V

The output projection is an arrow:

R^d -> R^V

and softmax maps:

R^V -> probability distributions over TokenId

`Distribution`

The problem this block solves is:

Probabilities are not just floats. A probability distribution must be non-empty, finite, non-negative, and sum to one.

The core block:

#[derive(Debug, Clone, PartialEq)]
pub struct Distribution(Vec<f32>);

impl Distribution {
    pub fn new(probabilities: Vec<f32>) -> CtResult<Self> {
        if probabilities.is_empty() {
            return Err(CtError::EmptyInput("distribution"));
        }

        let sum: f32 = probabilities.iter().sum();
        let all_valid = probabilities
            .iter()
            .all(|probability| probability.is_finite() && *probability >= 0.0);

        if !all_valid || !approx_eq(sum, 1.0, 1e-4) {
            return Err(CtError::InvalidProbability("distribution constructor"));
        }

        Ok(Self(probabilities))
    }
}

Rust Syntax: Why Construction Can Fail

This is invalid:

[]

This is invalid:

[0.4, 0.4]

because it sums to 0.8, not 1.0.

This is invalid:

[1.2, -0.2]

because probabilities cannot be negative.

So Distribution::new returns CtResult<Self>.

Rust Syntax: The Sum Check

let sum: f32 = probabilities.iter().sum();

This computes the total probability mass.

The code uses approximate equality:

approx_eq(sum, 1.0, 1e-4)

because floating-point arithmetic is not exact.

ML Concept

This is the output of softmax:

Logits -> Distribution

The rest of the model can treat a Distribution as real probabilities because the constructor checked the rule.

Category Theory Concept

Distribution is an object with a stronger invariant than Vec<f32>.

The softmax morphism lands in this object only if it can produce valid probability mass.

`Loss`

The problem this block solves is:

A loss value must be a real, non-negative scalar.

The block:

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Loss(f32);

impl Loss {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() || value < 0.0 {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

Rust Syntax

Loss::new rejects:

infinity
not-a-number values
negative values

Cross entropy should not produce a negative loss. If it does, something has gone wrong before or during loss calculation.

Loss derives Copy because it wraps one small scalar.

Calling value() returns the raw f32 for printing, comparison, or averaging.

ML Concept

Loss is the training signal.

For next-token prediction:

loss = -log(probability assigned to the correct token)

Lower loss means the model assigned more probability to the correct answer.

Training tries to reduce this value.

Category Theory Concept

Loss is the codomain of an evaluation morphism:

Distribution x TokenId -> Loss

It maps prediction plus truth into a non-negative scalar objective.

Shape and Training Hyperparameter Types

The problem these blocks solve is:

Dimensions and learning rates need boundary checks before they are used to allocate matrices or update parameters.

The types are:

VocabSize
ModelDimension
LearningRate

Rust Syntax

VocabSize::new(0) fails because a vocabulary with zero entries is unusable.

ModelDimension::new(0) fails because an embedding vector with zero width cannot carry features.

LearningRate::new(value) fails when the value is not finite or is not positive.

These checks keep bad configuration from becoming strange matrix behavior later.

ML Concept

VocabSize controls:

embedding rows
logit length
distribution length
bias length

ModelDimension controls embedding width:

R^d

LearningRate controls optimizer step size:

parameter = parameter - learning_rate * gradient

Category Theory Concept

VocabSize describes the cardinality of the finite token object.

ModelDimension chooses the intermediate vector-space-like object.

LearningRate chooses one update morphism from a family of training endomorphisms.

`Product<A, B>`

The problem this block solves is:

Some ML operations need two inputs that belong together.

The block:

#[derive(Debug, Clone, PartialEq)]
pub struct Product<A, B> {
    first: A,
    second: B,
}

This is a generic pair.

It is used in two important places:

pub type TrainingExample = Product<TokenId, TokenId>;

and:

Product<Distribution, TokenId> -> Loss

Rust Syntax: Why Not A Tuple Everywhere?

Rust tuples like (A, B) would work mechanically.

Product<A, B> makes the category-theory idea visible:

A x B

It also gives named methods:

first()
second()
into_parts()

Those methods make call sites easier to read during the course.

ML Concept

Product<TokenId, TokenId> is one supervised next-token example:

input token x target token

Product<Distribution, TokenId> is the input to cross entropy:

prediction x target

Category Theory Concept

Product<A, B> is the course’s named version of:

A x B

The accessors are projection-like operations:

first  ~ pi_1
second ~ pi_2

`TrainingSet`

The problem this block solves is:

Training should not run on an empty collection of examples.

The block:

#[derive(Debug, Clone, PartialEq)]
pub struct TrainingSet(Vec<TrainingExample>);

impl TrainingSet {
    pub fn new(examples: impl IntoIterator<Item = TrainingExample>) -> CtResult<Self> {
        let examples = examples.into_iter().collect::<Vec<_>>();

        if examples.is_empty() {
            return Err(CtError::EmptyInput("training set"));
        }

        Ok(Self(examples))
    }
}

This mirrors TokenSequence.

The internal vector is private.

Construction validates non-emptiness.

Callers get read-only access through:

pub fn examples(&self) -> &[TrainingExample]

Rust Syntax: Why `is_empty` Exists If Empty Is Impossible

TrainingSet includes:

pub fn is_empty(&self) -> bool {
    self.0.is_empty()
}

For values constructed through TrainingSet::new, this should always be false.

The method exists because collection-like types conventionally expose both len and is_empty, and tests or generic code may use it.

The invariant is still protected by private storage and the constructor.

ML Concept

A TrainingSet is a non-empty list of next-token examples.

For:

[10, 25, 31, 7]

the training set is:

(10, 25)
(25, 31)
(31, 7)

Category Theory Concept

The shape is:

non-empty list of (TokenId x TokenId)

or:

List+ (TokenId x TokenId)

`Parameters`

The problem this block solves is:

Training needs one object that owns all trainable model state.

The block:

#[derive(Debug, Clone, PartialEq)]
pub struct Parameters {
    pub(crate) embedding: Vec<Vec<f32>>,
    pub(crate) lm_head: Vec<Vec<f32>>,
    pub(crate) bias: Vec<f32>,
}

The model has three pieces:

embedding table
lm head matrix
bias vector

The fields are pub(crate), not fully public.

That means code inside this crate can update parameters during training, but external callers use accessors.

Rust Syntax: Initialization

pub fn init(vocab_size: VocabSize, d_model: ModelDimension) -> Self

This takes validated domain values, not raw usize.

That means matrix allocation starts from:

non-empty vocabulary
positive model dimension

The initialized shapes are:

embedding: vocab_size x d_model
lm_head:   d_model x vocab_size
bias:      vocab_size

ML Concept

Parameters is the trainable state.

Prediction reads it.

Training maps it back to a new Parameters value:

Parameters -> Parameters

Category Theory Concept

Parameters is the object of the training endomorphism.

The important point is not that the numbers change.

The important point is that the type remains the same.

Utility Functions

The file ends with:

pub(crate) fn init_matrix(...)
pub(crate) fn approx_eq(...)

init_matrix is local deterministic initialization for the teaching model.

approx_eq is a small floating-point helper used by probability checks and composition tests.

Both are crate-internal implementation details, not learner-facing domain objects.

Runnable Example

The domain example shows token IDs becoming training pairs:

Source snapshot: examples/01_domain_objects.rs

use category_theory_transformer_rs::{CtResult, DatasetWindowing, Morphism, TokenSequence};

fn main() -> CtResult<()> {
    let tokens = TokenSequence::from_indices([1, 2, 3, 4])?;
    let dataset = DatasetWindowing.apply(tokens)?;

    println!("training pairs:");
    for example in dataset.examples() {
        println!(
            "{} -> {}",
            example.first().index(),
            example.second().index()
        );
    }

    Ok(())
}

Run:

cargo run --example 01_domain_objects

Expected shape:

training pairs:
1 -> 2
2 -> 3
3 -> 4

Why This Matters

The main design rule is:

Use raw primitives only at the edge where they are created or displayed.

After that, use domain types.

This prevents mistakes like:

passing a model dimension where a token ID was expected
passing logits where probabilities were expected
training on an empty dataset
using a negative learning rate

Core Mental Model

src/domain.rs turns raw storage into trustworthy objects.

In Rust terms:

private fields + smart constructors + accessors

In ML terms:

tokens, vectors, probabilities, loss, and model weights

In category-theory terms:

objects that morphisms can safely connect

Checkpoint

Pick one type from this file and explain:

What raw representation it wraps.
What invalid state it prevents.
Which morphism consumes or produces it.

Example:

Distribution wraps Vec<f32>, rejects invalid probability mass, and is produced
by Softmax before CrossEntropy consumes it.

Where This Leaves Us

This chapter gave names to the values in the system. A token id is not a model dimension, logits are not probabilities, and a training set is not just any vector of pairs. Each type marks a stage where raw storage becomes a meaningful object.

The next chapter adds arrows between those objects. Once the arrows exist, the book can talk about identity, composition, and repeated transformations without falling back to loose wiring conventions.

Retrieval Practice

Recall

What is a domain object in this book?

Explain

Why does Distribution::new validate probability mass at construction time instead of leaving that check to CrossEntropy?

Apply

Design a one-field newtype for a future SequenceLength. State one invariant its constructor should protect.

Morphism and Composition

The problem this chapter solves is:

Once the system has typed objects, it needs typed transformations between them.

In the previous chapter, the code created objects such as:

TokenId
Vector
Logits
Distribution
Loss
Parameters

This chapter explains the arrows that connect them.

The central category-theory sentence is:

A morphism is a typed arrow from one object to another.

The central Rust sentence is:

A morphism is a trait implementation with an input type, output type, and typed error result.

Reader orientation: The previous chapter defined the objects of the tiny ML system. This chapter explains how values move between those objects. That movement is the bridge between ordinary Rust functions and the categorical idea of morphisms.

What You Already Know

If you know Rust functions, you already know that computation moves from an input type to an output type. If you know ML pipelines, you already know that a prediction path is built from stages. This chapter gives that familiar movement a shared interface: Morphism<Input, Output>.

Source Snapshot

This file defines the typed arrow interface and the composition adapter.

Source snapshot: src/category.rs

use std::marker::PhantomData;

use crate::error::CtResult;

/// A typed category-theory arrow: `Input -> Output`.
pub trait Morphism<Input, Output> {
    fn name(&self) -> &'static str;
    fn apply(&self, input: Input) -> CtResult<Output>;
}

/// Identity morphism: `id_A : A -> A`.
#[derive(Debug, Clone, Copy)]
pub struct Identity<T> {
    _marker: PhantomData<T>,
}

impl<T> Identity<T> {
    pub fn new() -> Self {
        Self {
            _marker: PhantomData,
        }
    }
}

impl<T> Default for Identity<T> {
    fn default() -> Self {
        Self::new()
    }
}

impl<T> Morphism<T, T> for Identity<T> {
    fn name(&self) -> &'static str {
        "identity"
    }

    fn apply(&self, input: T) -> CtResult<T> {
        Ok(input)
    }
}

/// Composition of two morphisms: if `f : A -> B` and `g : B -> C`, this is
/// `g after f : A -> C`.
#[derive(Debug, Clone)]
pub struct Compose<F, G, Middle> {
    first: F,
    second: G,
    _middle: PhantomData<Middle>,
}

impl<F, G, Middle> Compose<F, G, Middle> {
    pub fn new(first: F, second: G) -> Self {
        Self {
            first,
            second,
            _middle: PhantomData,
        }
    }
}

impl<Input, Middle, Output, F, G> Morphism<Input, Output> for Compose<F, G, Middle>
where
    F: Morphism<Input, Middle>,
    G: Morphism<Middle, Output>,
{
    fn name(&self) -> &'static str {
        "composition"
    }

    fn apply(&self, input: Input) -> CtResult<Output> {
        let middle = self.first.apply(input)?;
        self.second.apply(middle)
    }
}

/// Endomorphism: a morphism from a type back to itself.
pub trait Endomorphism<T>: Morphism<T, T> {}

impl<T, M> Endomorphism<T> for M where M: Morphism<T, T> {}

/// How many times to repeat an endomorphism.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct StepCount(usize);

impl StepCount {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Repeatedly apply an endomorphism: `A0 -> A1 -> ... -> An`.
pub fn apply_endomorphism_n_times<T, E>(endo: &E, mut value: T, count: StepCount) -> CtResult<T>
where
    E: Endomorphism<T>,
{
    for _ in 0..count.value() {
        value = endo.apply(value)?;
    }

    Ok(value)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn identity_returns_the_same_value() -> CtResult<()> {
        let value = String::from("same");

        assert_eq!(Identity::<String>::new().apply(value.clone())?, value);
        Ok(())
    }
}

The Whole File

src/category.rs defines:

Morphism<Input, Output>
Identity<T>
Compose<F, G, Middle>
Endomorphism<T>
StepCount
apply_endomorphism_n_times

These are the abstract shapes used by the ML code.

Without this file, prediction could still be written as ordinary functions.

With this file, the course can name and test the structure:

identity
composition
endomorphism
repeated application

Read each block through the same three lenses:

Rust syntax:
what trait, struct, generic parameter, or bound is declared?

ML concept:
which model pipeline behavior does the shape support?

Category theory concept:
which arrow, identity, composition, or endomorphism idea is being modeled?

Worked Example: A Function As An Arrow

Before reading the generic trait, start with an ordinary Rust function:

#![allow(unused)]
fn main() {
fn add_one(input: i32) -> i32 {
    input + 1
}

assert_eq!(add_one(41), 42);
}

That function already has an arrow shape:

i32 -> i32

The real Morphism<Input, Output> trait makes that shape explicit, gives the arrow a name, and lets the arrow fail with a typed error when the input cannot be transformed safely.

Self-Check

Before reading the trait, explain why i32 -> i32 and TokenId -> Vector have the same arrow shape even though they mean very different things.

`Morphism<Input, Output>`

The problem this block solves is:

The code needs one shared contract for typed transformations.

The block:

/// A typed category-theory arrow: `Input -> Output`.
pub trait Morphism<Input, Output> {
    fn name(&self) -> &'static str;
    fn apply(&self, input: Input) -> CtResult<Output>;
}

Rust Syntax: Documentation Comment

/// A typed category-theory arrow: `Input -> Output`.

This tells you how to read the trait.

For example:

Embedding : TokenId -> Vector

means:

impl Morphism<TokenId, Vector> for Embedding

Rust Syntax: Trait Definition

pub trait Morphism<Input, Output>

Input and Output are type parameters.

They are not values.

They describe the type-level shape of the arrow.

This allows the same trait to model:

TokenSequence -> TrainingSet
TokenId -> Vector
Vector -> Logits
Logits -> Distribution
Distribution x TokenId -> Loss
Parameters -> Parameters

Rust Syntax: `name`

fn name(&self) -> &'static str;

This gives a stable human-readable name.

It is useful for demonstrations, diagnostics, and teaching.

The return type &'static str means the string is known for the whole program lifetime. Names such as "softmax" and "embedding" are static literals.

Rust Syntax: `apply`

fn apply(&self, input: Input) -> CtResult<Output>;

This is the actual transformation.

It consumes an Input and returns either:

Ok(Output)

or:

Err(CtError)

This is important because many arrows can fail. Embedding can receive an out-of-range token, softmax can receive empty logits, cross entropy can receive an invalid target, and training can receive malformed parameters. The shared return type keeps those failures explicit instead of hiding them behind a panic.

ML Concept

Every ML stage becomes an implementation of the same contract.

That makes the pipeline inspectable as arrows, not just function calls.

Category Theory Concept

This trait is the course’s concrete model of a morphism.

It is not trying to implement all category theory. It gives enough structure to talk about typed arrows and composition in ordinary Rust.

`Identity<T>`

The problem this block solves is:

Every object should have an arrow that returns the object unchanged.

The block:

/// Identity morphism: `id_A : A -> A`.
#[derive(Debug, Clone, Copy)]
pub struct Identity<T> {
    _marker: PhantomData<T>,
}

Rust Syntax: Why The Struct Has No Real Data

Identity<T> does not need to store a T.

It only needs to remember the type T.

That is why it stores:

_marker: PhantomData<T>

PhantomData<T> tells Rust:

This struct is logically connected to T, even though it does not own a real T value.

Rust Syntax: Constructor

pub fn new() -> Self {
    Self {
        _marker: PhantomData,
    }
}

This creates the identity arrow for a type.

Example:

Identity::<Vector>::new()

means:

id_Vector : Vector -> Vector

Rust Syntax: Default

impl<T> Default for Identity<T> {
    fn default() -> Self {
        Self::new()
    }
}

This follows Rust convention: if a type has an obvious empty constructor, it can implement Default.

Rust Syntax: Morphism Implementation

impl<T> Morphism<T, T> for Identity<T> {
    fn name(&self) -> &'static str {
        "identity"
    }

    fn apply(&self, input: T) -> CtResult<T> {
        Ok(input)
    }
}

This is the key:

T -> T

The input and output type are the same.

The implementation simply returns the input.

ML Concept

Identity is a no-op transformation.

In a model pipeline, no-op stages are useful for tests and for understanding what it means for composition to have a neutral element.

Category Theory Concept

Identity matters because composition has laws:

id after f = f
f after id = f

This code does not prove those laws generally, but it gives the object you need to talk about them in Rust.

`Compose<F, G, Middle>`

The problem this block solves is:

If one morphism produces the type another morphism consumes, the code should be able to build a larger morphism.

The block:

/// Composition of two morphisms: if `f : A -> B` and `g : B -> C`, this is
/// `g after f : A -> C`.
#[derive(Debug, Clone)]
pub struct Compose<F, G, Middle> {
    first: F,
    second: G,
    _middle: PhantomData<Middle>,
}

Rust Syntax: The Shape

The category-theory shape is:

f : A -> B
g : B -> C
g after f : A -> C

The Rust type is:

Compose<F, G, Middle>

where:

F is the first morphism
G is the second morphism
Middle is the bridge type

The middle type is explicit because Rust needs to know what connects the two arrows.

Rust Syntax: Fields

first: F,
second: G,
_middle: PhantomData<Middle>,

first stores the first arrow.

second stores the second arrow.

_middle records the bridge type without storing a value of that type.

Rust Syntax: Constructor

pub fn new(first: F, second: G) -> Self

This builds the composed morphism.

It does not run the morphisms yet.

It only stores them.

Rust Syntax: Morphism Implementation

impl<Input, Middle, Output, F, G> Morphism<Input, Output>
    for Compose<F, G, Middle>
where
    F: Morphism<Input, Middle>,
    G: Morphism<Middle, Output>,
{
    fn apply(&self, input: Input) -> CtResult<Output> {
        let middle = self.first.apply(input)?;
        self.second.apply(middle)
    }
}

This is the most important block in the chapter.

The where clause says:

F must be Input -> Middle
G must be Middle -> Output

Only then can Compose<F, G, Middle> be:

Input -> Output

Rust Syntax: The `?` Operator

let middle = self.first.apply(input)?;

This applies the first arrow.

If it fails, the error returns immediately.

If it succeeds, the successful value is bound to middle.

Then the second arrow runs:

self.second.apply(middle)

So composition preserves failure.

It does not hide invalid states.

ML Concept

Prediction uses composition:

TokenId -> Vector -> Logits -> Distribution

The code builds that in two steps:

let token_to_logits = Compose::<_, _, Vector>::new(embedding, linear);
let token_to_distribution = Compose::<_, _, Logits>::new(token_to_logits, Softmax);

The bridge types are:

Vector
Logits

If you try to compose Embedding directly with Softmax, the middle type does not match:

Embedding : TokenId -> Vector
Softmax   : Logits -> Distribution

Vector is not Logits, so Rust rejects the composition.

Category Theory Concept

Compose is function composition with types made explicit.

It is the course’s main example of:

small legal arrows -> larger legal arrow

`Endomorphism<T>`

The problem this block solves is:

Some arrows start and end at the same type, and those arrows can be repeated.

The block:

/// Endomorphism: a morphism from a type back to itself.
pub trait Endomorphism<T>: Morphism<T, T> {}

impl<T, M> Endomorphism<T> for M where M: Morphism<T, T> {}

An endomorphism has shape:

T -> T

The trait has no methods of its own.

It is a marker trait:

if something implements Morphism<T, T>, it is an Endomorphism<T>

The blanket implementation says exactly that:

impl<T, M> Endomorphism<T> for M where M: Morphism<T, T> {}

ML Concept

Training has this shape:

Parameters -> Parameters

One training step consumes parameters and returns updated parameters.

The model changes, but the type stays the same.

Category Theory Concept

Endomorphisms are important because they can be iterated:

A -> A -> A -> A

That is the categorical shape of repeated training.

`StepCount`

The problem this block solves is:

Repetition count should have a semantic name instead of being a random usize at the call site.

The block:

/// How many times to repeat an endomorphism.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct StepCount(usize);

This wraps a raw usize.

It means:

number of repeated endomorphism applications

StepCount::new(80) reads better than a bare 80 because it names the role of the number.

Rust Syntax

StepCount is a newtype around usize.

It has a constructor and a value() accessor.

ML Concept

It controls how many optimizer steps are applied.

Category Theory Concept

It controls how many times an endomorphism is iterated.

`apply_endomorphism_n_times`

The problem this block solves is:

Given an endomorphism, repeatedly apply it in a type-safe loop.

The block:

pub fn apply_endomorphism_n_times<T, E>(
    endo: &E,
    mut value: T,
    count: StepCount,
) -> CtResult<T>
where
    E: Endomorphism<T>,
{
    for _ in 0..count.value() {
        value = endo.apply(value)?;
    }

    Ok(value)
}

Rust Syntax: Type Parameters

T is the object being updated.

E is the endomorphism type.

The bound:

E: Endomorphism<T>

means:

E must be a T -> T arrow

Rust Syntax: Mutable Value

mut value: T

The function owns the current value.

Each loop iteration replaces it with the next value:

value = endo.apply(value)?;

This is not mutation of shared global state.

It is ownership passing through a repeated transformation.

Rust Syntax: Failure Behavior

If any application fails, the whole repeated process fails immediately.

This is the correct behavior for training too: if a step discovers invalid parameters or an out-of-range token, the loop should not pretend everything is fine.

ML Concept

For training:

T = Parameters
E = TrainStep

The function becomes:

repeat TrainStep on Parameters

Category Theory Concept

This is iteration of an endomorphism:

value0
  -> value1
  -> value2
  -> ...
  -> valueN

Runnable Example

The composition example builds:

TokenId -> Vector -> Logits -> Distribution

Source snapshot: examples/02_morphism_composition.rs

use category_theory_transformer_rs::{
    Compose, CtResult, Embedding, LinearToLogits, Logits, ModelDimension, Morphism, Parameters,
    Softmax, TokenId, Vector, VocabSize,
};

fn main() -> CtResult<()> {
    let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

    let token_to_logits = Compose::<_, _, Vector>::new(
        Embedding::from_parameters(&params),
        LinearToLogits::from_parameters(&params),
    );
    let token_to_distribution = Compose::<_, _, Logits>::new(token_to_logits, Softmax);

    let distribution = token_to_distribution.apply(TokenId::new(1))?;

    println!("next-token probabilities: {:?}", distribution.as_slice());

    Ok(())
}

Run:

cargo run --example 02_morphism_composition

Why This API Is Good Design

The code does not make composition a loose runtime convention.

It puts composition into the type system.

That means the compiler checks the bridge type:

F output == G input

This is the core practical value of the category-theory framing in this repo.

It turns:

remember to wire the stages correctly

into:

make invalid wiring fail to compile

Core Mental Model

In Rust terms:

Morphism<Input, Output> = fallible typed transformation
Compose<F, G, Middle> = legal connection of two transformations
Endomorphism<T> = repeatable T -> T transformation

In ML terms:

small prediction stages compose into a model path
training is a repeatable update step

In category-theory terms:

objects are connected by arrows, arrows compose when their endpoints match

Checkpoint

Why does this composition compile:

TokenId -> Vector -> Logits

but this one does not:

TokenId -> Vector -> Distribution

A strong answer should mention that Softmax expects Logits, not Vector.

Where This Leaves Us

This chapter turned ordinary transformations into named arrows. Identity<T> leaves a value unchanged, Compose<F, G, Middle> connects compatible arrows, and Endomorphism<T> names the special case where the input and output object are the same.

The next chapter fills those arrow shapes with concrete ML behavior: token windowing, embedding lookup, linear projection, softmax, and cross entropy.

Retrieval Practice

Recall

What does Morphism<Input, Output> require an implementation to provide?

Explain

Why does Compose<F, G, Middle> need the middle type to match?

Apply

Write a diagram for the legal path from TokenId to Distribution, naming the middle objects.

The Tiny ML Pipeline

The problem this chapter solves is:

The abstract Morphism trait needs concrete machine-learning arrows that turn token data into predictions and loss.

The whole prediction-and-loss path is:

TokenSequence -> TrainingSet
TokenId       -> Vector
Vector        -> Logits
Logits        -> Distribution
Distribution x TokenId -> Loss

In ordinary ML language, the path turns a token stream into adjacent training pairs, looks up an embedding vector for the current token, uses a linear layer to score every possible next token, normalizes those scores with softmax, and then measures surprise with cross entropy.

In category-theory language:

Each stage is a morphism, and the legal stages compose.

Reader orientation: This is the first chapter where all three subjects meet at once. When the code feels dense, follow the pipeline order: data preparation first, prediction second, loss third.

What You Already Know

If you know ML, you already know the rough path: prepare data, make a prediction, and measure the error. If you know Rust, you already know that each step can have a concrete input and output type. This chapter combines those two habits by making each ML step implement the same morphism interface.

Source Snapshot

This file owns the concrete ML arrows.

Source snapshot: src/ml.rs

use crate::category::{Compose, Morphism};
use crate::domain::{
    Distribution, Logits, Loss, Parameters, Product, TokenId, TokenSequence, TrainingSet, Vector,
    approx_eq,
};
use crate::error::{CtError, CtResult};

/// Turns adjacent tokens into next-token training examples.
#[derive(Debug, Clone)]
pub struct DatasetWindowing;

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing {
    fn name(&self) -> &'static str {
        "dataset_windowing"
    }

    fn apply(&self, tokens: TokenSequence) -> CtResult<TrainingSet> {
        if tokens.as_slice().len() < 2 {
            return Err(CtError::EmptyInput(
                "dataset windowing requires at least 2 tokens",
            ));
        }

        TrainingSet::new(
            tokens
                .as_slice()
                .windows(2)
                .map(|pair| Product::new(pair[0], pair[1])),
        )
    }
}

/// Morphism from token id to embedding vector.
#[derive(Debug, Clone)]
pub struct Embedding {
    table: Vec<Vec<f32>>,
}

impl Embedding {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            table: params.embedding_table().to_vec(),
        }
    }
}

impl Morphism<TokenId, Vector> for Embedding {
    fn name(&self) -> &'static str {
        "embedding"
    }

    fn apply(&self, token: TokenId) -> CtResult<Vector> {
        let Some(row) = self.table.get(token.index()) else {
            return Err(CtError::OutOfRange {
                kind: "token",
                index: token.index(),
                limit: self.table.len(),
            });
        };

        Ok(Vector::new(row.clone()))
    }
}

/// Linear projection from hidden vector to vocabulary logits.
#[derive(Debug, Clone)]
pub struct LinearToLogits {
    weight: Vec<Vec<f32>>,
    bias: Vec<f32>,
}

impl LinearToLogits {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            weight: params.lm_head().to_vec(),
            bias: params.bias().to_vec(),
        }
    }

    pub(crate) fn from_parts(weight: Vec<Vec<f32>>, bias: Vec<f32>) -> Self {
        Self { weight, bias }
    }
}

impl Morphism<Vector, Logits> for LinearToLogits {
    fn name(&self) -> &'static str {
        "linear_to_logits"
    }

    fn apply(&self, input: Vector) -> CtResult<Logits> {
        let d_model = input.as_slice().len();
        let vocab_size = self.bias.len();

        if self.weight.len() != d_model {
            return Err(CtError::ShapeMismatch {
                op: "linear layer",
                expected: format!("weight rows == input dim {d_model}"),
                got: format!("weight rows {}", self.weight.len()),
            });
        }

        let mut out = self.bias.clone();

        for (feature, input_value) in input.as_slice().iter().enumerate() {
            if self.weight[feature].len() != vocab_size {
                return Err(CtError::ShapeMismatch {
                    op: "linear layer",
                    expected: format!("weight cols == vocab size {vocab_size}"),
                    got: format!("weight cols {}", self.weight[feature].len()),
                });
            }

            for (vocab_id, output_value) in out.iter_mut().enumerate() {
                *output_value += input_value * self.weight[feature][vocab_id];
            }
        }

        Ok(Logits::new(out))
    }
}

/// Converts logits to a probability distribution.
#[derive(Debug, Clone)]
pub struct Softmax;

impl Morphism<Logits, Distribution> for Softmax {
    fn name(&self) -> &'static str {
        "softmax"
    }

    fn apply(&self, logits: Logits) -> CtResult<Distribution> {
        if logits.as_slice().is_empty() {
            return Err(CtError::EmptyInput("softmax"));
        }

        let max_value = logits
            .as_slice()
            .iter()
            .copied()
            .fold(f32::NEG_INFINITY, f32::max);
        let mut exps = Vec::with_capacity(logits.as_slice().len());
        let mut sum = 0.0;

        for value in logits.as_slice() {
            let exp = (*value - max_value).exp();
            exps.push(exp);
            sum += exp;
        }

        if sum <= 0.0 || !sum.is_finite() {
            return Err(CtError::InvalidProbability("softmax"));
        }

        Distribution::new(exps.into_iter().map(|value| value / sum).collect())
    }
}

/// Negative log likelihood for `(distribution, target_token)`.
#[derive(Debug, Clone)]
pub struct CrossEntropy;

impl Morphism<Product<Distribution, TokenId>, Loss> for CrossEntropy {
    fn name(&self) -> &'static str {
        "cross_entropy"
    }

    fn apply(&self, input: Product<Distribution, TokenId>) -> CtResult<Loss> {
        let (distribution, target) = input.into_parts();

        let Some(probability) = distribution.as_slice().get(target.index()).copied() else {
            return Err(CtError::OutOfRange {
                kind: "target",
                index: target.index(),
                limit: distribution.as_slice().len(),
            });
        };

        Loss::new(-probability.max(1e-9).ln())
    }
}

/// Direct path used for a commutative-diagram check.
#[derive(Debug, Clone)]
pub struct DirectPredict {
    params: Parameters,
}

impl DirectPredict {
    pub fn new(params: Parameters) -> Self {
        Self { params }
    }
}

impl Morphism<TokenId, Distribution> for DirectPredict {
    fn name(&self) -> &'static str {
        "direct_predict"
    }

    fn apply(&self, token: TokenId) -> CtResult<Distribution> {
        let embedding = Embedding::from_parameters(&self.params).apply(token)?;
        let logits = LinearToLogits::from_parameters(&self.params).apply(embedding)?;
        Softmax.apply(logits)
    }
}

/// Average cross-entropy over a training set.
pub fn average_loss(params: &Parameters, dataset: &TrainingSet) -> CtResult<Loss> {
    let embedding = Embedding::from_parameters(params);
    let linear = LinearToLogits::from_parameters(params);
    let predict = Compose::<_, _, Vector>::new(embedding, linear);
    let predict = Compose::<_, _, Logits>::new(predict, Softmax);
    let loss_fn = CrossEntropy;

    let mut total = 0.0;

    for example in dataset.examples() {
        let distribution = predict.apply(*example.first())?;
        let loss = loss_fn.apply(Product::new(distribution, *example.second()))?;
        total += loss.value();
    }

    Loss::new(total / dataset.len() as f32)
}

/// Verifies that the composed path and direct path produce the same result.
pub fn composed_prediction_matches_direct_prediction(params: &Parameters) -> CtResult<bool> {
    let token = TokenId::new(1);

    let composed = Compose::<_, _, Vector>::new(
        Embedding::from_parameters(params),
        LinearToLogits::from_parameters(params),
    );
    let composed = Compose::<_, _, Logits>::new(composed, Softmax);
    let direct = DirectPredict::new(params.clone());

    let left_path = composed.apply(token)?;
    let right_path = direct.apply(token)?;

    Ok(left_path
        .as_slice()
        .iter()
        .zip(right_path.as_slice().iter())
        .all(|(a, b)| approx_eq(*a, *b, 1e-6)))
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::domain::{ModelDimension, VocabSize};

    #[test]
    fn dataset_windowing_builds_adjacent_pairs() -> CtResult<()> {
        let tokens = TokenSequence::from_indices([1, 2, 3])?;
        let dataset = DatasetWindowing.apply(tokens)?;

        assert_eq!(dataset.len(), 2);
        assert_eq!(dataset.examples()[0].first().index(), 1);
        assert_eq!(dataset.examples()[0].second().index(), 2);
        Ok(())
    }

    #[test]
    fn composed_and_direct_prediction_match() -> CtResult<()> {
        let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

        assert!(composed_prediction_matches_direct_prediction(&params)?);
        Ok(())
    }
}

The Whole File

src/ml.rs defines:

DatasetWindowing
Embedding
LinearToLogits
Softmax
CrossEntropy
DirectPredict
average_loss
composed_prediction_matches_direct_prediction

The chapter reads them in pipeline order.

Read each block through the same three lenses:

Rust syntax:
what struct, trait implementation, loop, or error branch does the code use?

ML concept:
which prediction, loss, or data-preparation step does the block implement?

Category theory concept:
which object, product, morphism, composition, or commutative check appears?

Worked Example: Normalizing Scores

The smallest first-principles version of “normalize scores into probabilities” does not need a model yet:

#![allow(unused)]
fn main() {
let scores = [1.0_f32, 2.0, 3.0];
let total: f32 = scores.iter().sum();
let probabilities: Vec<f32> = scores.iter().map(|score| score / total).collect();

let probability_sum: f32 = probabilities.iter().sum();
assert!((probability_sum - 1.0).abs() < 1e-6);
}

The real Softmax implementation is more careful than this toy normalization: it uses exponentials, subtracts the maximum score for numerical stability, and validates the result through Distribution::new.

Self-Check

Why is it useful for the probability-validation boundary to live in Distribution::new instead of in every caller that uses probabilities?

`DatasetWindowing`

The problem this block solves is:

A token sequence must become input-target pairs before supervised next-token training can happen.

The block:

/// Turns adjacent tokens into next-token training examples.
#[derive(Debug, Clone)]
pub struct DatasetWindowing;

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing {
    fn name(&self) -> &'static str {
        "dataset_windowing"
    }

    fn apply(&self, tokens: TokenSequence) -> CtResult<TrainingSet> {
        if tokens.as_slice().len() < 2 {
            return Err(CtError::EmptyInput(
                "dataset windowing requires at least 2 tokens",
            ));
        }

        TrainingSet::new(
            tokens
                .as_slice()
                .windows(2)
                .map(|pair| Product::new(pair[0], pair[1])),
        )
    }
}

Rust Syntax: Unit Struct

pub struct DatasetWindowing;

This is a unit struct.

It stores no fields because the operation has no configuration.

The value itself represents the transformation.

Rust Syntax: Morphism Shape

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing

This says:

DatasetWindowing : TokenSequence -> TrainingSet

So it consumes the raw sequence stage and produces the training-example stage.

Rust Syntax: Why It Requires At Least Two Tokens

if tokens.as_slice().len() < 2 {
    return Err(CtError::EmptyInput(
        "dataset windowing requires at least 2 tokens",
    ));
}

TokenSequence only guarantees at least one token.

But next-token training requires at least one adjacent pair.

One token:

[7]

produces zero pairs.

Two tokens:

[7, 8]

produce one pair:

7 -> 8

So this morphism owns the stronger validation rule.

Rust Syntax: `windows(2)`

tokens.as_slice().windows(2)

This walks adjacent pairs:

[1, 2, 3, 4]

becomes:

[1, 2]
[2, 3]
[3, 4]

Each pair becomes:

Product::new(pair[0], pair[1])

That is a TrainingExample.

ML Concept

This is the data-preparation step for next-token prediction.

Category Theory Concept

This is a morphism between two structured objects:

non-empty token list -> non-empty product list

The output examples are product objects:

TokenId x TokenId

`Embedding`

The problem this block solves is:

A discrete token ID needs to become a dense vector before the model can use linear algebra.

The core block:

#[derive(Debug, Clone)]
pub struct Embedding {
    table: Vec<Vec<f32>>,
}

impl Embedding {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            table: params.embedding_table().to_vec(),
        }
    }
}

impl Morphism<TokenId, Vector> for Embedding {
    fn name(&self) -> &'static str {
        "embedding"
    }

    fn apply(&self, token: TokenId) -> CtResult<Vector> {
        let Some(row) = self.table.get(token.index()) else {
            return Err(CtError::OutOfRange {
                kind: "token",
                index: token.index(),
                limit: self.table.len(),
            });
        };

        Ok(Vector::new(row.clone()))
    }
}

Rust Syntax: Stored Table

table: Vec<Vec<f32>>

The embedding table has shape:

vocab_size x model_dimension

Each row is the vector for one token.

Rust Syntax: Constructor From Parameters

pub fn from_parameters(params: &Parameters) -> Self

The embedding morphism is built from model parameters.

It copies the table out of Parameters:

params.embedding_table().to_vec()

This keeps the morphism simple and owned for the tiny tutorial.

Rust Syntax: Morphism Shape

impl Morphism<TokenId, Vector> for Embedding

This says:

Embedding : TokenId -> Vector

Rust Syntax: Bounds Check

let Some(row) = self.table.get(token.index()) else {
    return Err(CtError::OutOfRange { ... });
};

The code does not assume every TokenId is valid for every embedding table.

It checks the row lookup at the boundary where the table is used.

Rust Syntax: Why Clone The Row

Ok(Vector::new(row.clone()))

The morphism returns an owned Vector.

The row inside the table is borrowed, so the code clones it into the output object.

This is a deliberate ownership boundary.

ML Concept

An embedding converts a symbolic token into numerical features.

Category Theory Concept

It is an arrow:

TokenId -> Vector

`LinearToLogits`

The problem this block solves is:

A hidden vector must be projected into one raw score per vocabulary item.

The shape is:

Vector -> Logits

The core implementation stores:

pub struct LinearToLogits {
    weight: Vec<Vec<f32>>,
    bias: Vec<f32>,
}

The dimensions are:

weight: d_model x vocab_size
bias: vocab_size
input: d_model
output: vocab_size

Rust Syntax: Shape Validation

Inside apply, the code checks:

if self.weight.len() != d_model {
    return Err(CtError::ShapeMismatch { ... });
}

This catches a matrix whose row count does not match the input vector length.

Then each row checks:

if self.weight[feature].len() != vocab_size {
    return Err(CtError::ShapeMismatch { ... });
}

This catches rows whose column count does not match the output vocabulary size.

Rust Syntax: Linear Computation

The output begins as the bias:

let mut out = self.bias.clone();

Then each input feature contributes to every vocabulary score:

for (feature, input_value) in input.as_slice().iter().enumerate() {
    for (vocab_id, output_value) in out.iter_mut().enumerate() {
        *output_value += input_value * self.weight[feature][vocab_id];
    }
}

Mathematically:

logits = input x weight + bias

ML Concept

This is the language-model head.

It scores each possible next token.

Category Theory Concept

It is a morphism:

Vector -> Logits

It can compose after Embedding because Embedding returns Vector.

`Softmax`

The problem this block solves is:

Raw scores are not probabilities. They must be normalized into a valid distribution.

The block:

#[derive(Debug, Clone)]
pub struct Softmax;

impl Morphism<Logits, Distribution> for Softmax {
    fn name(&self) -> &'static str {
        "softmax"
    }

    fn apply(&self, logits: Logits) -> CtResult<Distribution> {
        if logits.as_slice().is_empty() {
            return Err(CtError::EmptyInput("softmax"));
        }

        let max_value = logits
            .as_slice()
            .iter()
            .copied()
            .fold(f32::NEG_INFINITY, f32::max);
        let mut exps = Vec::with_capacity(logits.as_slice().len());
        let mut sum = 0.0;

        for value in logits.as_slice() {
            let exp = (*value - max_value).exp();
            exps.push(exp);
            sum += exp;
        }

        if sum <= 0.0 || !sum.is_finite() {
            return Err(CtError::InvalidProbability("softmax"));
        }

        Distribution::new(exps.into_iter().map(|value| value / sum).collect())
    }
}

Rust Syntax: Unit Struct

Softmax stores no state.

It is the operation itself.

Rust Syntax: Morphism Shape

impl Morphism<Logits, Distribution> for Softmax

This says:

Softmax : Logits -> Distribution

Rust Syntax: Empty Check

Softmax over no scores is meaningless.

So the code rejects empty logits.

Rust Syntax: Numerical Stability

let max_value = ...
let exp = (*value - max_value).exp();

Subtracting the maximum value keeps exponentials smaller and more stable.

It does not change the final probabilities because softmax is invariant under adding or subtracting the same constant from every logit.

Rust Syntax: Normalization

Distribution::new(exps.into_iter().map(|value| value / sum).collect())

The raw exponentials are divided by their sum.

Then the Distribution constructor validates the probability invariant.

This is good boundary design: softmax computes, and Distribution::new enforces the distribution contract.

ML Concept

Softmax turns raw model scores into probabilities.

High logits become high probabilities.

Low logits become low probabilities.

The output can be interpreted as:

P(next token | current token)

Category Theory Concept

Softmax is a morphism-like transformation:

Logits -> Distribution

It changes the object from an unconstrained score vector into a probability simplex-like object.

`CrossEntropy`

The problem this block solves is:

A model prediction must be compared to the actual target token to produce a scalar loss.

The block:

#[derive(Debug, Clone)]
pub struct CrossEntropy;

impl Morphism<Product<Distribution, TokenId>, Loss> for CrossEntropy {
    fn name(&self) -> &'static str {
        "cross_entropy"
    }

    fn apply(&self, input: Product<Distribution, TokenId>) -> CtResult<Loss> {
        let (distribution, target) = input.into_parts();

        let Some(probability) = distribution.as_slice().get(target.index()).copied() else {
            return Err(CtError::OutOfRange {
                kind: "target",
                index: target.index(),
                limit: distribution.as_slice().len(),
            });
        };

        Loss::new(-probability.max(1e-9).ln())
    }
}

Rust Syntax: Input Type

Product<Distribution, TokenId>

Cross entropy needs both:

the predicted distribution
the correct target token

That pair is a product object.

Rust Syntax: Splitting The Product

let (distribution, target) = input.into_parts();

This consumes the product and extracts both values.

Rust Syntax: Target Bounds Check

distribution.as_slice().get(target.index())

The target token must be inside the probability vector.

If the distribution has 5 entries, target index 7 is invalid.

This error belongs here because this is the first place the target is used as an index into the predicted distribution.

Rust Syntax: Negative Log Likelihood

Loss::new(-probability.max(1e-9).ln())

The loss is:

-ln(probability assigned to the correct token)

The max(1e-9) avoids taking the log of zero.

Then Loss::new validates the loss scalar.

ML Concept

Cross entropy measures how surprised the model was by the true target.

If the model assigns high probability to the target, the loss is small.

If the model assigns low probability to the target, the loss is large.

Category Theory Concept

Cross entropy consumes a product object:

Distribution x TokenId

and maps it into:

Loss

So its shape is:

Product<Distribution, TokenId> -> Loss

`DirectPredict`

The problem this block solves is:

The course needs a direct implementation to compare against the composed prediction path.

DirectPredict stores parameters and implements:

TokenId -> Distribution

Internally, it still performs:

Embedding
LinearToLogits
Softmax

but it writes the steps directly.

This allows the code to test:

composed path == direct path

That is the program’s tiny commutative diagram check.

Rust Syntax

DirectPredict is a struct that owns Parameters.

Its apply method calls the prediction steps directly instead of using Compose.

ML Concept

This is the direct prediction implementation.

It exists so the composed path can be checked against a straightforward path.

Category Theory Concept

It provides the second path in a commutative diagram:

composed path
direct path

`average_loss`

The problem this function solves is:

Training needs one scalar loss over the whole training set.

The function builds the composed prediction path:

let embedding = Embedding::from_parameters(params);
let linear = LinearToLogits::from_parameters(params);
let predict = Compose::<_, _, Vector>::new(embedding, linear);
let predict = Compose::<_, _, Logits>::new(predict, Softmax);

The resulting shape is:

TokenId -> Distribution

Then each training example is evaluated:

let distribution = predict.apply(*example.first())?;
let loss = loss_fn.apply(Product::new(distribution, *example.second()))?;

Finally, the average is wrapped in Loss::new.

The function does not return a raw f32.

It returns a validated Loss.

Rust Syntax

The function takes borrowed parameters and a borrowed dataset:

pub fn average_loss(params: &Parameters, dataset: &TrainingSet) -> CtResult<Loss>

It does not consume either one.

The function loops through examples, accumulates scalar losses, and divides by the dataset length.

ML Concept

Average loss summarizes model performance over the full training set.

Category Theory Concept

It folds many example-level loss morphism results into one scalar object.

`composed_prediction_matches_direct_prediction`

The problem this function solves is:

The code should prove that the composed prediction pipeline and the direct implementation agree.

The composed path is:

TokenId -> Vector -> Logits -> Distribution

The direct path is:

TokenId -> Distribution

The function runs both on the same token and compares every probability with a small floating-point tolerance.

Category-theoretically, this is a commutative diagram test:

          composed
TokenId ------------> Distribution
   \                      ^
    \ direct              |
     ---------------------

The exact drawing is less important than the idea:

Two paths through the system should produce the same meaning.

Rust Syntax

The function builds one composed path with Compose and one direct path with DirectPredict.

It compares probabilities pairwise with approx_eq.

ML Concept

This verifies that refactoring the prediction path into smaller stages did not change the predicted probabilities.

Category Theory Concept

This is a commutative-diagram check in code.

Run The Demo

Run:

cargo run --bin category_ml

Look at sections 2 through 5 in the output.

You should see:

TokenSequence -> TrainingSet
prediction probabilities
loss for a target token

Why This Matters

This chapter is where the course stops being abstract.

The code implements a real, tiny version of the common language-model training path:

context token -> hidden vector -> next-token probabilities -> loss

The implementation is small, but the boundaries are real. Invalid token lookup returns OutOfRange, invalid matrix shape returns ShapeMismatch, empty logits return EmptyInput, invalid probabilities return InvalidProbability, and invalid loss returns InvalidLoss.

Errors are caught where the invalid data first becomes meaningful.

Core Mental Model

In Rust terms:

each ML operation implements Morphism<Input, Output>

In ML terms:

prediction is embedding + linear projection + softmax
loss is negative log probability of the target

In category-theory terms:

prediction is composition of arrows
loss consumes a product object
the direct and composed paths should commute

Checkpoint

Where should an out-of-range target token be caught?

Correct answer:

Inside CrossEntropy, because that is where the target is used to index the predicted distribution.

Where This Leaves Us

This chapter assembled the first complete tiny ML path. A token sequence becomes training examples, a token becomes a vector, a vector becomes logits, logits become probabilities, and a probability distribution plus a target token becomes loss.

The next chapter changes the question from “how do we evaluate one prediction?” to “how do repeated updates change the model state?” That is where training enters as an endomorphism.

Retrieval Practice

Recall

What are the concrete ML arrows in the prediction-and-loss path?

Explain

Why does CrossEntropy consume a product of Distribution and TokenId?

Apply

Given TokenId -> Vector -> Logits -> Distribution, write the Rust type that must appear between Embedding and Softmax.

Training as an Endomorphism

The problem this chapter solves is:

A model is not only used for prediction. It must also be updated by training, and one update should produce the same kind of object it consumed.

The key shape is:

Parameters -> Parameters

This is an endomorphism.

In ordinary ML terms:

old parameters
  -> compute predictions
  -> compute loss gradients
  -> subtract learning-rate-scaled gradients
  -> new parameters

In category-theory terms:

A -> A

Because the input and output type are the same, the step can be repeated.

Reader orientation: Do not read this chapter as a full backpropagation engine. It is a small, explicit training step whose purpose is to make the shape Parameters -> Parameters visible and runnable.

What You Already Know

If you have seen gradient descent, you already know the informal movement: parameters are adjusted and then used again. If you know Rust, you already know that a function can return the same type it receives. This chapter names that shape precisely: a training step is an endomorphism on Parameters.

Source Snapshot

This file implements one full-batch optimizer update.

Source snapshot: src/training.rs

use crate::category::Morphism;
use crate::domain::{LearningRate, Parameters, TrainingSet, Vector};
use crate::error::{CtError, CtResult};
use crate::ml::{LinearToLogits, Softmax};

/// One full-batch optimizer update.
///
/// Categorically, this is an endomorphism:
///
/// `Parameters -> Parameters`
#[derive(Debug, Clone)]
pub struct TrainStep {
    dataset: TrainingSet,
    learning_rate: LearningRate,
}

impl TrainStep {
    pub fn new(dataset: TrainingSet, learning_rate: LearningRate) -> Self {
        Self {
            dataset,
            learning_rate,
        }
    }
}

impl Morphism<Parameters, Parameters> for TrainStep {
    fn name(&self) -> &'static str {
        "train_step_endomorphism"
    }

    fn apply(&self, params: Parameters) -> CtResult<Parameters> {
        let vocab_size = params.vocab_size();
        let d_model = params.d_model();

        if vocab_size == 0 || d_model == 0 {
            return Err(CtError::EmptyInput("parameters"));
        }

        let mut grad_embedding = vec![vec![0.0; d_model]; params.embedding.len()];
        let mut grad_lm_head = vec![vec![0.0; vocab_size]; d_model];
        let mut grad_bias = vec![0.0; vocab_size];

        for example in self.dataset.examples() {
            let input_id = example.first().index();
            let target_id = example.second().index();

            if input_id >= params.embedding.len() {
                return Err(CtError::OutOfRange {
                    kind: "input token",
                    index: input_id,
                    limit: params.embedding.len(),
                });
            }

            if target_id >= vocab_size {
                return Err(CtError::OutOfRange {
                    kind: "target token",
                    index: target_id,
                    limit: vocab_size,
                });
            }

            let x = &params.embedding[input_id];
            let logits = LinearToLogits::from_parts(params.lm_head.clone(), params.bias.clone())
                .apply(Vector::new(x.clone()))?;
            let probs = Softmax.apply(logits)?;

            let mut dlogits = probs.as_slice().to_vec();
            dlogits[target_id] -= 1.0;

            for (vocab_id, dlogit) in dlogits.iter().copied().enumerate() {
                grad_bias[vocab_id] += dlogit;

                for (feature, x_feature) in x.iter().copied().enumerate() {
                    grad_lm_head[feature][vocab_id] += x_feature * dlogit;
                }
            }

            for (feature, grad_feature) in grad_embedding[input_id].iter_mut().enumerate() {
                let dx = params.lm_head[feature]
                    .iter()
                    .zip(dlogits.iter())
                    .map(|(weight, dlogit)| weight * dlogit)
                    .sum::<f32>();

                *grad_feature += dx;
            }
        }

        let batch_scale = 1.0 / self.dataset.len() as f32;
        let learning_rate = self.learning_rate.value();
        let mut updated = params.clone();

        for (row, grad_row) in updated.embedding.iter_mut().zip(grad_embedding.iter()) {
            for (value, grad) in row.iter_mut().zip(grad_row.iter()) {
                *value -= learning_rate * grad * batch_scale;
            }
        }

        for (row, grad_row) in updated.lm_head.iter_mut().zip(grad_lm_head.iter()) {
            for (value, grad) in row.iter_mut().zip(grad_row.iter()) {
                *value -= learning_rate * grad * batch_scale;
            }
        }

        for (bias, grad) in updated.bias.iter_mut().zip(grad_bias.iter()) {
            *bias -= learning_rate * grad * batch_scale;
        }

        Ok(updated)
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::category::{StepCount, apply_endomorphism_n_times};
    use crate::domain::{ModelDimension, TokenSequence, VocabSize};
    use crate::ml::{DatasetWindowing, average_loss};

    #[test]
    fn repeated_training_step_reduces_loss() -> CtResult<()> {
        let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
        let dataset = DatasetWindowing.apply(tokens)?;
        let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);
        let before = average_loss(&params, &dataset)?;
        let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
        let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
        let after = average_loss(&trained, &dataset)?;

        assert!(after.value() < before.value());
        Ok(())
    }
}

The Whole File

src/training.rs defines:

TrainStep
TrainStep::new
impl Morphism<Parameters, Parameters> for TrainStep
unit test proving repeated training reduces loss

The whole file is about one idea:

training is a repeatable typed transformation of model state

Worked Example: Repeating One Update

The smallest first-principles version of a repeated update is a number being moved a little at a time:

#![allow(unused)]
fn main() {
fn step_toward_zero(value: f32, learning_rate: f32) -> f32 {
    value - learning_rate * value
}

let once = step_toward_zero(10.0, 0.1);
let twice = step_toward_zero(once, 0.1);

assert!(twice < once);
}

The real training code applies the same repeatable-update idea to Parameters, not to one scalar. The output stays the same kind of object as the input, so the update can be run again.

Self-Check

Before reading the full training step, explain why Parameters -> Parameters is repeatable but Parameters -> Loss is not.

`TrainStep`

The problem this block solves is:

A training update needs a dataset and a learning rate, and those values should travel together as one configured operation.

The block:

/// One full-batch optimizer update.
///
/// Categorically, this is an endomorphism:
///
/// `Parameters -> Parameters`
#[derive(Debug, Clone)]
pub struct TrainStep {
    dataset: TrainingSet,
    learning_rate: LearningRate,
}

Rust Syntax

This is a named-field struct.

It stores:

dataset: TrainingSet
learning_rate: LearningRate

Both fields are private.

That means callers cannot directly replace the dataset or learning rate after construction.

The derived traits mean:

Debug -> can be printed for debugging
Clone -> can be explicitly duplicated

TrainingSet is already non-empty.

LearningRate is already finite and positive.

So TrainStep stores validated inputs.

ML Concept

A training step needs:

examples to learn from
a step size for parameter updates

The dataset gives the input-target pairs.

The learning rate controls how far the update moves.

Category-Theory Concept

TrainStep is the value that will implement:

Parameters -> Parameters

That makes it an endomorphism on the object Parameters.

`TrainStep::new`

The problem this block solves is:

Construct a configured training step from already validated pieces.

The block:

impl TrainStep {
    pub fn new(dataset: TrainingSet, learning_rate: LearningRate) -> Self {
        Self {
            dataset,
            learning_rate,
        }
    }
}

Rust Syntax

impl TrainStep defines methods for TrainStep.

The constructor takes ownership of:

dataset
learning_rate

and stores them.

It returns Self, not CtResult<Self>, because the inputs are already validated domain objects.

No extra validation is needed here.

ML Concept

This is like configuring an optimizer step:

use this dataset
use this learning rate

The actual update happens later in apply.

Category-Theory Concept

The constructor chooses one specific endomorphism from a family:

TrainStep(dataset, learning_rate) : Parameters -> Parameters

Different datasets or learning rates create different update morphisms.

Morphism Implementation

The problem this block solves is:

Make TrainStep a real typed arrow from model parameters back to model parameters.

The header:

impl Morphism<Parameters, Parameters> for TrainStep {

Rust Syntax

This says:

TrainStep implements Morphism<Input = Parameters, Output = Parameters>

So the apply method must have this effective shape:

Parameters -> CtResult<Parameters>

The name method:

fn name(&self) -> &'static str {
    "train_step_endomorphism"
}

returns a static label for the transformation.

ML Concept

The input Parameters are the current model weights.

The output Parameters are the updated weights after one full-batch step.

Category-Theory Concept

Because the input and output object are the same, TrainStep is an endomorphism.

That is what lets this work:

Parameters0 -> Parameters1 -> Parameters2 -> ... -> ParametersN

`apply`: Shape Checks

The problem this block solves is:

Before computing gradients, verify that the parameter object has usable dimensions.

The block:

let vocab_size = params.vocab_size();
let d_model = params.d_model();

if vocab_size == 0 || d_model == 0 {
    return Err(CtError::EmptyInput("parameters"));
}

Rust Syntax

The code asks the parameter object for two dimensions.

Then it rejects zero-sized parameters.

This uses an explicit error instead of panicking.

ML Concept

Training cannot run if:

there are zero possible vocabulary outputs
hidden vectors have zero width

Those shapes would make the gradient arrays meaningless.

Category-Theory Concept

The endomorphism is only defined on valid Parameters.

Invalid parameter state is rejected before the morphism performs the update.

Gradient Buffers

The problem this block solves is:

Accumulate gradients for every trainable parameter before applying the update.

The block:

let mut grad_embedding = vec![vec![0.0; d_model]; params.embedding.len()];
let mut grad_lm_head = vec![vec![0.0; vocab_size]; d_model];
let mut grad_bias = vec![0.0; vocab_size];

Rust Syntax

These are mutable matrices and vectors initialized to zero.

Their shapes mirror the trainable parameters:

grad_embedding: same row count as embedding, d_model columns
grad_lm_head:   d_model x vocab_size
grad_bias:      vocab_size

ML Concept

Gradients accumulate how each parameter should change to reduce loss.

The code uses full-batch training: it processes every example, accumulates all gradients, averages them, then updates once.

Category-Theory Concept

The gradient buffers are not the endomorphism itself.

They are internal machinery used to construct the output object in:

Parameters -> Parameters

Example Loop

The problem this block solves is:

For each training example, compute the local contribution to the parameter gradients.

The loop begins:

for example in self.dataset.examples() {
    let input_id = example.first().index();
    let target_id = example.second().index();
    ...
}

Rust Syntax

self.dataset.examples() returns a slice of TrainingExample.

Each example is a Product<TokenId, TokenId>.

So:

example.first()

is the input token.

example.second()

is the target token.

The code extracts raw indices because matrix indexing needs usize.

ML Concept

Each example says:

given input token, predict target token

The training loop calculates how wrong the current model is for that example.

Category-Theory Concept

The example is an element of:

TokenId x TokenId

The training morphism consumes many such product values while building the parameter update.

Token Bounds Checks

The problem this block solves is:

Training examples must refer to tokens that exist in the current parameter shapes.

The checks:

if input_id >= params.embedding.len() {
    return Err(CtError::OutOfRange { ... });
}

if target_id >= vocab_size {
    return Err(CtError::OutOfRange { ... });
}

Rust Syntax

These are ordinary bounds checks with typed errors.

They prevent invalid indexing into:

the embedding table
the vocabulary-sized output vector

ML Concept

An input token must have an embedding row.

A target token must be one of the possible prediction classes.

If either token is outside the model vocabulary, training cannot continue.

Category-Theory Concept

The example must belong to the finite token object that the parameters are currently modeling.

This check keeps the training morphism inside the intended domain.

Forward Pass Inside Training

The problem this block solves is:

To compute gradients, the training step first needs the current prediction.

The block:

let x = &params.embedding[input_id];
let logits = LinearToLogits::from_parts(params.lm_head.clone(), params.bias.clone())
    .apply(Vector::new(x.clone()))?;
let probs = Softmax.apply(logits)?;

Rust Syntax

x borrows the embedding row for the input token.

LinearToLogits::from_parts(...) builds a linear projection from the current weights.

Vector::new(x.clone()) wraps the embedding row as a Vector.

Then:

Vector -> Logits -> Distribution

runs through the same morphism interface as prediction.

ML Concept

This computes the model’s current predicted distribution for one input token.

The gradient depends on the difference between that distribution and the true target.

Category-Theory Concept

Even inside training, prediction is still a composed path:

TokenId -> Vector -> Logits -> Distribution

Training uses that path as part of a larger endomorphism:

Parameters -> Parameters

Logit Gradient

The problem this block solves is:

For softmax plus cross entropy, the gradient with respect to logits is predicted probability minus one-hot target.

The block:

let mut dlogits = probs.as_slice().to_vec();
dlogits[target_id] -= 1.0;

Rust Syntax

The probabilities are copied into a mutable vector.

Then the target class is adjusted by subtracting 1.0.

If:

probs = [0.70, 0.20, 0.10]
target = 1

then:

dlogits = [0.70, -0.80, 0.10]

ML Concept

This is the standard simplified gradient for softmax cross entropy.

It says:

decrease the scores that are too high
increase the target score if it was too low

Category-Theory Concept

This is local derivative information for one part of the composed prediction path.

The next loops compose that local derivative back into parameter gradients.

Output-Head And Bias Gradients

The problem this block solves is:

Convert the logit gradient into gradients for the output matrix and bias.

The core loop:

for (vocab_id, dlogit) in dlogits.iter().copied().enumerate() {
    grad_bias[vocab_id] += dlogit;

    for (feature, x_feature) in x.iter().copied().enumerate() {
        grad_lm_head[feature][vocab_id] += x_feature * dlogit;
    }
}

Rust Syntax

The outer loop visits every vocabulary output.

The inner loop visits every feature of the input vector.

The bias gradient is just the logit gradient.

The weight gradient is:

input feature * output gradient

ML Concept

For a linear layer:

logits = xW + b

the gradient of a weight is:

input activation * output gradient

This is the same pattern used in larger neural networks.

Category-Theory Concept

This is the local backward map for the affine projection stage.

It translates changes needed at the output object Logits into changes in the parameter object.

Embedding Gradient

The problem this block solves is:

Move the output error backward through the language-model head to the input embedding row.

The block:

for (feature, grad_feature) in grad_embedding[input_id].iter_mut().enumerate() {
    let dx = params.lm_head[feature]
        .iter()
        .zip(dlogits.iter())
        .map(|(weight, dlogit)| weight * dlogit)
        .sum::<f32>();

    *grad_feature += dx;
}

Rust Syntax

The loop mutates the gradient row for the input token.

For each feature, it pairs:

weights from that feature to every vocab output
dlogits for every vocab output

Then it sums:

weight * dlogit

ML Concept

This is backpropagation through the linear head.

It tells the embedding row how it should change so the future logits improve.

Only the row for the current input token receives an embedding gradient.

Category-Theory Concept

This is another local backward map.

The training endomorphism is built by composing local derivative information from output back toward parameters.

Parameter Update

The problem this block solves is:

Turn accumulated gradients into new parameters.

The code computes:

let batch_scale = 1.0 / self.dataset.len() as f32;
let learning_rate = self.learning_rate.value();
let mut updated = params.clone();

Then it subtracts scaled gradients from every parameter.

Rust Syntax

batch_scale averages the accumulated gradients.

learning_rate extracts the raw scalar.

updated = params.clone() creates the output parameter object.

The following loops mutate updated, not the original params.

Finally:

Ok(updated)

returns the new model state.

ML Concept

The update rule is:

parameter_new = parameter_old - learning_rate * average_gradient

This is gradient descent.

Category-Theory Concept

The final result has the same object type as the input:

Parameters -> Parameters

That completes the endomorphism.

Regression Test

The problem this block solves is:

Prove the learner-visible promise that repeated training reduces loss on the tiny dataset.

The test:

#[test]
fn repeated_training_step_reduces_loss() -> CtResult<()> {
    let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
    let dataset = DatasetWindowing.apply(tokens)?;
    let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);
    let before = average_loss(&params, &dataset)?;
    let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
    let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
    let after = average_loss(&trained, &dataset)?;

    assert!(after.value() < before.value());
    Ok(())
}

Rust Syntax

The test returns CtResult<()>, so it can use ?.

It builds a token sequence, turns it into a training set, initializes parameters, and configures a training step.

Then it applies the endomorphism 80 times and checks the loss decreased.

ML Concept

This is not a benchmark.

It is a sanity check:

training should make the tiny model better on the tiny data

Category-Theory Concept

The test exercises repeated endomorphism application:

Parameters0 -> Parameters1 -> ... -> Parameters80

Run The Example

Source snapshot: examples/03_training_endomorphism.rs

use category_theory_transformer_rs::{
    CtResult, DatasetWindowing, LearningRate, ModelDimension, Morphism, Parameters, StepCount,
    TokenSequence, TrainStep, VocabSize, apply_endomorphism_n_times, average_loss,
};

fn main() -> CtResult<()> {
    let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
    let dataset = DatasetWindowing.apply(tokens)?;
    let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

    let before = average_loss(&params, &dataset)?;
    let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
    let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
    let after = average_loss(&trained, &dataset)?;

    println!("loss before: {:.6}", before.value());
    println!("loss after:  {:.6}", after.value());

    Ok(())
}

Run:

cargo run --example 03_training_endomorphism

Expected pattern:

loss before: ...
loss after:  ...

The second number should be smaller.

Core Mental Model

In Rust terms:

TrainStep implements Morphism<Parameters, Parameters>

In ML terms:

one full-batch gradient descent update

In category-theory terms:

an endomorphism that can be iterated

Checkpoint

Why is it useful that training returns Parameters instead of a raw matrix?

A strong answer:

Because the output can immediately be used as the input to the next TrainStep, preserving the Parameters -> Parameters endomorphism shape.

Where This Leaves Us

This chapter turned training into a repeatable typed transformation. The model state enters as Parameters, the training step computes gradients from the tiny dataset, and the updated model state leaves as Parameters again.

The next chapter steps back from the training loop and names reusable structures that appear across the whole course: mapping inside wrappers, changing wrapper shapes consistently, combining traces, and composing local derivative rules.

Retrieval Practice

Recall

What makes TrainStep an endomorphism?

Explain

Why does the training code validate token bounds before accumulating gradients?

Apply

If you changed StepCount::new(80) to StepCount::new(1), what would you expect to happen to the loss, and why?

Functors, Naturality, Monoids, and Chain Rule

The problem this chapter solves is:

After seeing individual ML arrows, you need names for common structures that appear across many systems: mapping inside containers, converting containers consistently, combining traces, and composing local derivatives.

The previous chapters were mostly about one pipeline. This chapter zooms out from that pipeline and asks which shapes keep appearing even when the concrete data changes. Once you can see those repeated shapes, the category-theory vocabulary stops feeling like a separate subject. It becomes a set of names for ordinary engineering moves.

This chapter covers four patterns:

Functor
NaturalTransformation
Monoid
Chain rule

They are not separate from the ML pipeline.

They explain patterns you already saw:

mapping over many examples
converting one wrapper shape to another
combining pipeline traces
composing gradients through layers

Reader orientation: This chapter is more abstract than the previous ones. Read each section in this order: first the Rust mechanism, then the ML use, then the category-theory name. The names are not decoration; they are compression for patterns that appear repeatedly in real model code.

What You Already Know

If you have mapped over a Vec, handled an Option, appended logs, or applied the chain rule in calculus, you already know informal versions of this chapter. The new work is to name those repeated shapes and connect them to the same typed pipeline discipline used earlier.

Source Snapshots

src/structure.rs covers functors, natural transformations, and monoids.

Source snapshot: src/structure.rs

/// A minimal functor interface for this tutorial.
pub trait Functor<A, B> {
    type WrappedA;
    type WrappedB;

    fn fmap<F>(wrapped: Self::WrappedA, f: F) -> Self::WrappedB
    where
        F: Fn(A) -> B;
}

pub struct VecFunctor;

impl<A, B> Functor<A, B> for VecFunctor {
    type WrappedA = Vec<A>;
    type WrappedB = Vec<B>;

    fn fmap<F>(wrapped: Vec<A>, f: F) -> Vec<B>
    where
        F: Fn(A) -> B,
    {
        wrapped.into_iter().map(f).collect()
    }
}

pub struct OptionFunctor;

impl<A, B> Functor<A, B> for OptionFunctor {
    type WrappedA = Option<A>;
    type WrappedB = Option<B>;

    fn fmap<F>(wrapped: Option<A>, f: F) -> Option<B>
    where
        F: Fn(A) -> B,
    {
        wrapped.map(f)
    }
}

/// A structure-preserving conversion between wrappers.
pub trait NaturalTransformation<A> {
    type From;
    type To;

    fn transform(from: Self::From) -> Self::To;
}

/// Natural transformation from `Vec<A>` to `Option<A>` by taking the first item.
pub struct VecToFirstOption;

impl<A> NaturalTransformation<A> for VecToFirstOption {
    type From = Vec<A>;
    type To = Option<A>;

    fn transform(from: Vec<A>) -> Option<A> {
        from.into_iter().next()
    }
}

pub fn naturality_square_holds_for_first_option() -> bool {
    let xs = vec![1, 2, 3];
    let f = |x| x * 10;

    let path_top_then_right = VecToFirstOption::transform(VecFunctor::fmap(xs.clone(), f));
    let path_left_then_bottom = OptionFunctor::fmap(VecToFirstOption::transform(xs), f);

    path_top_then_right == path_left_then_bottom
}

pub trait Monoid: Sized {
    fn empty() -> Self;
    fn combine(&self, other: &Self) -> Self;
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct TraceStep(&'static str);

impl TraceStep {
    pub fn new(name: &'static str) -> Self {
        Self(name)
    }

    pub fn name(&self) -> &'static str {
        self.0
    }
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PipelineTrace(Vec<TraceStep>);

impl PipelineTrace {
    pub fn from_steps(steps: impl IntoIterator<Item = TraceStep>) -> Self {
        Self(steps.into_iter().collect())
    }

    pub fn names(&self) -> Vec<&'static str> {
        self.0.iter().map(TraceStep::name).collect()
    }
}

impl Monoid for PipelineTrace {
    fn empty() -> Self {
        PipelineTrace(vec![])
    }

    fn combine(&self, other: &Self) -> Self {
        let mut combined = self.0.clone();
        combined.extend_from_slice(&other.0);
        PipelineTrace(combined)
    }
}

pub fn monoid_laws_hold_for_pipeline_trace() -> bool {
    let a = PipelineTrace::from_steps(vec![TraceStep::new("embedding")]);
    let b = PipelineTrace::from_steps(vec![TraceStep::new("linear")]);
    let c = PipelineTrace::from_steps(vec![TraceStep::new("softmax")]);
    let identity = PipelineTrace::empty();

    let left_identity = identity.combine(&a) == a;
    let right_identity = a.combine(&identity) == a;
    let associativity = a.combine(&b).combine(&c) == a.combine(&b.combine(&c));

    left_identity && right_identity && associativity
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn naturality_square_commutes() {
        assert!(naturality_square_holds_for_first_option());
    }

    #[test]
    fn pipeline_trace_obeys_monoid_laws() {
        assert!(monoid_laws_hold_for_pipeline_trace());
    }
}

src/calculus.rs keeps the chain-rule example deliberately small.

Source snapshot: src/calculus.rs

use crate::error::{CtError, CtResult};

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Scalar(f32);

impl Scalar {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct LocalGradient(f32);

impl LocalGradient {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

#[derive(Debug, Clone, Copy)]
pub struct MulOp;

impl MulOp {
    pub fn forward(&self, x: Scalar, y: Scalar) -> CtResult<Scalar> {
        Scalar::new(x.value() * y.value())
    }

    /// Given upstream gradient dL/dz, return `(dL/dx, dL/dy)` for `z = x * y`.
    pub fn backward(
        &self,
        x: Scalar,
        y: Scalar,
        upstream: LocalGradient,
    ) -> CtResult<(LocalGradient, LocalGradient)> {
        let dz_dx = y.value();
        let dz_dy = x.value();

        Ok((
            LocalGradient::new(upstream.value() * dz_dx)?,
            LocalGradient::new(upstream.value() * dz_dy)?,
        ))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn multiply_backward_returns_local_chain_rule_gradients() -> CtResult<()> {
        let mul = MulOp;
        let x = Scalar::new(2.0)?;
        let y = Scalar::new(3.0)?;
        let upstream = LocalGradient::new(1.0)?;
        let (dl_dx, dl_dy) = mul.backward(x, y, upstream)?;

        assert_eq!(dl_dx.value(), 3.0);
        assert_eq!(dl_dy.value(), 2.0);
        Ok(())
    }
}

The Whole Structure File

src/structure.rs defines:

Functor<A, B>
VecFunctor
OptionFunctor
NaturalTransformation<A>
VecToFirstOption
Monoid
TraceStep
PipelineTrace

Each block gives a Rust handle to one abstract pattern.

Worked Example: Mapping Over A Vector

Before reading the traits, start with the plain Rust operation that motivates them. Mapping over a vector means taking each item out, applying a function, and collecting the new values into another vector:

#![allow(unused)]
fn main() {
let values = vec![1, 2, 3];
let doubled: Vec<i32> = values.into_iter().map(|x| x * 2).collect();

assert_eq!(doubled, vec![2, 4, 6]);
}

There is no category theory hidden in that snippet. It is just ordinary Rust. The category-theory word functor appears when we notice the reusable shape: the code changes the contents while preserving the surrounding container.

Self-Check

Before reading the Functor trait, explain what stayed the same and what changed in the Vec mapping example.

`Functor<A, B>`

The problem this block solves is:

The code needs a name for “apply a function inside a wrapper while keeping the wrapper shape.”

First principle: a trait is a contract. It says, “any type that implements this trait must provide these associated types and this method.” Here the contract is small on purpose. It does not try to model every possible functor in mathematics; it gives this tutorial one precise place to talk about fmap.

The block:

/// A minimal functor interface for this tutorial.
pub trait Functor<A, B> {
    type WrappedA;
    type WrappedB;

    fn fmap<F>(wrapped: Self::WrappedA, f: F) -> Self::WrappedB
    where
        F: Fn(A) -> B;
}

Rust Syntax

Functor<A, B> is a trait. A trait is Rust’s way to name behavior that many types can implement.

Here, the behavior is:

map a function through some wrapper shape

A and B are generic type parameters:

A = input item type
B = output item type

Generic means the trait is not tied to one concrete type like i32 or String. The same trait can describe Vec<i32> -> Vec<String>, Option<TokenId> -> Option<Embedding>, or any other pair of item types.

It has associated types:

type WrappedA;
type WrappedB;

Associated types are type names chosen by each implementation of the trait. They let the trait say:

every implementer must tell us what wrapped input and wrapped output mean

For VecFunctor, those associated types become Vec<A> and Vec<B>.

For OptionFunctor, they become Option<A> and Option<B>.

The method:

fn fmap<F>(wrapped: Self::WrappedA, f: F) -> Self::WrappedB
where
    F: Fn(A) -> B;

means:

Given a wrapped A and a function A -> B, produce a wrapped B.

The where clause is a readable place to put a bound. The bound:

F: Fn(A) -> B

means F must be callable like a function that consumes an A and returns a B.

Here is the real crate API in the smallest useful form:

use category_theory_transformer_rs::{Functor, VecFunctor};

let lengths = VecFunctor::fmap(vec!["cat", "rust"], |word| word.len());

assert_eq!(lengths, vec![3, 4]);

What to notice: The call names the structure once: VecFunctor::fmap. The closure only describes the item-level operation: &str -> usize. The vector shape is handled by the functor implementation.

ML Concept

In ML, you often apply the same transformation across a structure:

map preprocessing over a batch
map token conversion over a sequence
map loss computation over examples

The wrapper might be:

Vec
Option
Result
batch tensor

The idea is:

transform the contents, preserve the surrounding structure

Category-Theory Concept

A functor maps:

objects -> objects
morphisms -> morphisms

while preserving identity and composition.

This tutorial’s trait is deliberately small. It focuses on the practical fmap operation.

`VecFunctor`

The problem this block solves is:

Demonstrate fmap for lists.

The block:

pub struct VecFunctor;

impl<A, B> Functor<A, B> for VecFunctor {
    type WrappedA = Vec<A>;
    type WrappedB = Vec<B>;

    fn fmap<F>(wrapped: Vec<A>, f: F) -> Vec<B>
    where
        F: Fn(A) -> B,
    {
        wrapped.into_iter().map(f).collect()
    }
}

Rust Syntax

VecFunctor is a unit struct. It stores no state.

The impl block is a trait implementation:

impl<A, B> Functor<A, B> for VecFunctor

Read it as:

for every A and B, VecFunctor knows how to behave as Functor<A, B>

The implementation chooses the associated types:

WrappedA = Vec<A>
WrappedB = Vec<B>

The method consumes the vector:

wrapped.into_iter()

maps the function over every item:

.map(f)

and collects the result:

.collect()

The runnable companion example uses the same real crate API:

use category_theory_transformer_rs::{Functor, VecFunctor};

let token_ids = vec![1, 2, 3];
let shifted = VecFunctor::fmap(token_ids, |id| id + 100);

assert_eq!(shifted, vec![101, 102, 103]);

ML Concept

If you have a batch of token IDs:

[TokenId(1), TokenId(2), TokenId(3)]

and a function:

TokenId -> Vector

mapping over the batch gives:

[Vector, Vector, Vector]

That is the same shape as VecFunctor.

Category-Theory Concept

Vec is list-like structure.

Mapping preserves the list shape:

List A -> List B

The length and order remain structurally meaningful.

`OptionFunctor`

The problem this block solves is:

Demonstrate the same functor idea for optional values.

The block:

pub struct OptionFunctor;

impl<A, B> Functor<A, B> for OptionFunctor {
    type WrappedA = Option<A>;
    type WrappedB = Option<B>;

    fn fmap<F>(wrapped: Option<A>, f: F) -> Option<B>
    where
        F: Fn(A) -> B,
    {
        wrapped.map(f)
    }
}

Rust Syntax

The wrapper types are:

Option<A>
Option<B>

The implementation delegates to Rust’s built-in:

wrapped.map(f)

If the value is Some(a), it becomes Some(f(a)).

If the value is None, it stays None.

The important beginner point is that Option makes absence explicit in the type. You cannot accidentally treat a missing value as a real value without handling the None case.

use category_theory_transformer_rs::{Functor, OptionFunctor};

let present = OptionFunctor::fmap(Some(7), |value| value * 2);
let missing = OptionFunctor::fmap(None::<i32>, |value| value * 2);

assert_eq!(present, Some(14));
assert_eq!(missing, None);

ML Concept

Optional values appear when data may be absent:

maybe first token
maybe cached embedding
maybe resolved department

Mapping over Option lets you transform present values without inventing a fake value for missing ones.

Category-Theory Concept

Option is a context representing possible absence.

fmap lifts:

A -> B

to:

Option<A> -> Option<B>

Conceptual Extension: `Distribution<T>::map`

The problem this block solves is:

A probabilistic value may contain many possible outcomes. Sometimes you want to transform every possible outcome while keeping its probability attached.

The current crate’s Distribution in src/domain.rs is a concrete validated probability vector:

pub struct Distribution(Vec<f32>);

The block below is a conceptual generic version that explains the functor idea for probabilistic outcomes:

#![allow(unused)]
fn main() {
pub struct Probability(f32);

pub struct Distribution<T> {
    outcomes: Vec<(T, Probability)>,
}

impl<T> Distribution<T> {
    pub fn map<U>(
        self,
        f: impl Fn(T) -> U,
    ) -> Distribution<U> {
        let outcomes = self
            .outcomes
            .into_iter()
            .map(|(value, probability)| {
                (f(value), probability)
            })
            .collect();

        Distribution { outcomes }
    }
}
}

The core idea is:

map changes the values inside the distribution,
but keeps the probabilities attached to them.

Rust Syntax

Start with the generic struct:

#![allow(unused)]
fn main() {
pub struct Probability(f32);

pub struct Distribution<T> {
    outcomes: Vec<(T, Probability)>,
}
}

This means:

Distribution<T> = many possible T values, each paired with a probability

If T is TokenId, then the type is:

Distribution<TokenId>

If T is String, then the type is:

Distribution<String>

The method introduces a second generic type:

pub fn map<U>(...)

T is the old outcome type.

U is the new outcome type.

So the method has this shape:

Distribution<T> -> Distribution<U>

The first parameter is:

self

That means the method consumes the old distribution.

After calling:

let text_dist = token_dist.map(decode);

the old token_dist has been moved and cannot be used again.

That is why the implementation can call:

self.outcomes.into_iter()

into_iter() consumes the vector and yields owned pairs:

(T, Probability)

The function parameter is:

f: impl Fn(T) -> U

This means:

give this method a function or closure that takes T and returns U

For example, a decoder might have this shape:

TokenId -> String

Then:

Distribution<TokenId> -> Distribution<String>

The inner mapping line is:

.map(|(value, probability)| {
    (f(value), probability)
})

For every pair:

(value, probability)

the code applies f to the value and leaves the probability unchanged.

So:

(TokenId(2), Probability(0.70))

can become:

("Rust", Probability(0.70))

Finally:

.collect()

collects the transformed pairs back into a vector, and:

Distribution { outcomes }

wraps them in the new distribution.

ML Concept

Imagine a model returns possible next tokens:

TokenId(2) -> 0.70
TokenId(4) -> 0.20
TokenId(3) -> 0.10

Those token IDs are useful to the model, but a learner or UI might need text:

TokenId(2) -> "Rust"
TokenId(4) -> "Python"
TokenId(3) -> "."

map changes the representation:

Distribution<TokenId> -> Distribution<String>

The values change:

TokenId(2) becomes "Rust"
TokenId(4) becomes "Python"
TokenId(3) becomes "."

The probabilities do not change:

0.70 stays 0.70
0.20 stays 0.20
0.10 stays 0.10

So map is for changing the meaning or representation of each possible outcome, not for changing the probability mass.

Category Theory Concept

This is the functor pattern for probabilistic context.

Given a normal deterministic function:

f : T -> U

map lifts it into the distribution context:

Distribution<T> -> Distribution<U>

In functional-programming notation:

fmap : (T -> U) -> Distribution<T> -> Distribution<U>

The outer structure is preserved:

same number of possible outcomes
same probabilities
same probabilistic context

Only the inner values are transformed.

That is the same pattern as:

Option<T> -> Option<U>
Vec<T> -> Vec<U>
Distribution<T> -> Distribution<U>

Different context, same functor idea.

Concrete Example

Here is a complete conceptual example:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy)]
pub struct TokenId(pub usize);

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Probability(pub f32);

#[derive(Debug, Clone)]
pub struct Distribution<T> {
    outcomes: Vec<(T, Probability)>,
}

impl<T> Distribution<T> {
    pub fn new(outcomes: Vec<(T, Probability)>) -> Self {
        Self { outcomes }
    }

    pub fn map<U, F>(self, mut f: F) -> Distribution<U>
    where
        F: FnMut(T) -> U,
    {
        let outcomes = self
            .outcomes
            .into_iter()
            .map(|(value, probability)| {
                (f(value), probability)
            })
            .collect();

        Distribution { outcomes }
    }
}

let vocab = ["I", "love", "Rust", "."];

let token_dist = Distribution::new(vec![
    (TokenId(2), Probability(0.70)),
    (TokenId(3), Probability(0.30)),
]);

let text_dist = token_dist.map(|token| {
    vocab[token.0].to_string()
});

assert_eq!(
    text_dist.outcomes,
    vec![
        ("Rust".to_string(), Probability(0.70)),
        (".".to_string(), Probability(0.30)),
    ],
);
}

Conceptually, the result is:

"Rust" -> 0.70
"."    -> 0.30

Why `Fn(T) -> U`

The signature:

f: impl Fn(T) -> U

accepts functions and closures.

It is more flexible than:

fn(T) -> U

because closures can capture values from the surrounding scope:

let vocab = ["I", "love", "Rust", "."];

let text_dist = token_dist.map(|token| {
    vocab[token.0].to_string()
});

The closure uses vocab from outside the closure body.

Why A Library Might Use `FnMut`

The pedagogical signature:

f: impl Fn(T) -> U

is easy to read.

A more flexible library signature is often:

pub fn map<U, F>(self, mut f: F) -> Distribution<U>
where
    F: FnMut(T) -> U,

FnMut allows the closure to mutate captured state.

For example:

let mut counter = 0;

let numbered = token_dist.map(|token| {
    counter += 1;
    (counter, token)
});

The key ownership rule is unchanged:

the method consumes the old distribution and moves each T into f

`map` Versus `flat_map`

Use map when each possible value becomes one transformed value:

T -> U

So:

Distribution<T> -> Distribution<U>

Example:

TokenId -> String

Use flat_map when each possible value produces another distribution:

T -> Distribution<U>

Without flattening, the result would be:

Distribution<Distribution<U>>

The simple distinction is:

map:
  one possible value becomes one transformed value

flat_map:
  one possible value becomes many possible future values

In language modeling:

map decodes possible tokens into text
flat_map chains uncertainty across another prediction step

Algebra Version

If:

D<T> = [(t1, p1), (t2, p2), ..., (tn, pn)]

and:

f : T -> U

then:

map(f, D<T>) = [(f(t1), p1), (f(t2), p2), ..., (f(tn), pn)]

The probabilities are untouched.

Only the values move through f.

Core Mental Model

In Rust terms:

consume self, move each T into f, preserve each Probability, collect into
Distribution<U>

In ML terms:

decode or transform every possible outcome without changing model confidence

In category-theory terms:

lift a deterministic morphism T -> U into the probabilistic context
Distribution<T> -> Distribution<U>

`NaturalTransformation<A>`

The problem this block solves is:

Sometimes you need to convert one wrapper shape into another without caring about the specific item type.

The beginner trap is to think a natural transformation is just any conversion. It is more disciplined than that. The conversion must be compatible with mapping. If you transform the wrapper first and then map the item, you should get the same result as mapping first and then transforming the wrapper.

The block:

/// A structure-preserving conversion between wrappers.
pub trait NaturalTransformation<A> {
    type From;
    type To;

    fn transform(from: Self::From) -> Self::To;
}

Rust Syntax

The associated types are:

From
To

The method:

fn transform(from: Self::From) -> Self::To;

converts the wrapper.

The type parameter A represents the item type inside the wrapper.

This trait has the same shape as Functor: the abstract contract is public, but each implementation chooses the concrete wrapper types through associated types.

That is why the implementation can be generic over A without knowing whether A is a token, a sentence, a loss value, or a trace step.

ML Concept

Data pipelines often convert shapes:

list of candidates -> optional selected candidate
batch -> first example
many diagnostics -> maybe first failure

The conversion should not depend on whether the item is a token, vector, or loss.

Category-Theory Concept

A natural transformation converts one functor into another in a way that is uniform over the inner type.

The important word is uniform.

It should not inspect special details of A.

`VecToFirstOption`

The problem this block solves is:

Convert a list into an optional first item in a type-uniform way.

The block:

/// Natural transformation from `Vec<A>` to `Option<A>` by taking the first item.
pub struct VecToFirstOption;

impl<A> NaturalTransformation<A> for VecToFirstOption {
    type From = Vec<A>;
    type To = Option<A>;

    fn transform(from: Vec<A>) -> Option<A> {
        from.into_iter().next()
    }
}

Rust Syntax

The implementation works for every A.

That is the role of:

impl<A> NaturalTransformation<A> for VecToFirstOption

The implementation does not ask for any bound on A. There is no A: Clone, no A: Debug, and no A: PartialEq. It does not need those capabilities because it never looks inside the item. It only changes the outer shape.

It consumes a vector:

from.into_iter()

then takes the first item:

.next()

If the vector is empty, the result is None.

If it has at least one item, the result is Some(first_item).

use category_theory_transformer_rs::{NaturalTransformation, VecToFirstOption};

let first = VecToFirstOption::transform(vec!["embed", "linear", "softmax"]);
let empty = VecToFirstOption::transform(Vec::<&str>::new());

assert_eq!(first, Some("embed"));
assert_eq!(empty, None);

ML Concept

This is like selecting the first candidate from a batch while preserving the possibility that the batch was empty.

It does not care what the candidate type is.

Category-Theory Concept

This is the example transformation:

Vec<A> -> Option<A>

It is natural because it is uniform over A.

Naturality Check

The problem this block solves is:

Show that mapping first, then converting gives the same result as converting first, then mapping.

The function:

pub fn naturality_square_holds_for_first_option() -> bool {
    let xs = vec![1, 2, 3];
    let f = |x| x * 10;

    let path_top_then_right = VecToFirstOption::transform(VecFunctor::fmap(xs.clone(), f));
    let path_left_then_bottom = OptionFunctor::fmap(VecToFirstOption::transform(xs), f);

    path_top_then_right == path_left_then_bottom
}

Rust Syntax

There are two paths.

Path one:

Vec<i32> -> Vec<i32> -> Option<i32>

Path two:

Vec<i32> -> Option<i32> -> Option<i32>

Both should produce the same value.

The crate exposes this as a small law check:

use category_theory_transformer_rs::naturality_square_holds_for_first_option;

assert!(naturality_square_holds_for_first_option());

ML Concept

This is a consistency check for pipeline shape conversions.

It says:

transform values, then select

matches:

select, then transform the selected value

for this operation.

Category-Theory Concept

This is the naturality square:

Vec<A> ----fmap f----> Vec<B>
  |                     |
  | transform           | transform
  v                     v
Option<A> --fmap f--> Option<B>

The square commutes when both paths agree.

`Monoid`

The problem this block solves is:

Some values can be combined repeatedly, and there should be an empty value that changes nothing.

The first-principles version is string concatenation. The empty string changes nothing, and grouping does not change the final text:

#![allow(unused)]
fn main() {
let empty = String::new();
let a = String::from("embed");
let b = String::from(" -> linear");
let c = String::from(" -> softmax");

assert_eq!(format!("{empty}{a}"), "embed");
assert_eq!(format!("{}{}", format!("{a}{b}"), c), format!("{a}{}", format!("{b}{c}")));
}

PipelineTrace uses the same idea, but with named pipeline steps instead of raw text.

The trait:

pub trait Monoid: Sized {
    fn empty() -> Self;
    fn combine(&self, other: &Self) -> Self;
}

Rust Syntax

Sized means values of this type have a known size at compile time. In this trait, it keeps the return type Self straightforward:

fn empty() -> Self;

Self means “the type implementing this trait.”

The trait requires:

empty() -> Self
combine(&self, other: &Self) -> Self

So a monoid can produce an identity value and combine two values into one.

The combine method borrows both traces:

fn combine(&self, other: &Self) -> Self;

&self and &Self are references. They let the method read the two existing values without taking ownership of them. The method returns a new combined value.

ML Concept

Common monoid-like values in ML systems include logs, traces, metrics, batches, and accumulated gradients.

You often need to combine many small values into one larger value.

Category-Theory Concept

A monoid has:

identity element
associative binary operation

The laws are:

empty combine a = a
a combine empty = a
(a combine b) combine c = a combine (b combine c)

`TraceStep` and `PipelineTrace`

The problem this block solves is:

Pipeline execution steps should be combinable into a larger trace.

The key types:

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct TraceStep(&'static str);

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PipelineTrace(Vec<TraceStep>);

Rust Syntax

TraceStep wraps one static string.

PipelineTrace wraps a vector of steps.

PipelineTrace::from_steps collects any iterable of steps.

names() returns the raw names for display:

self.0.iter().map(TraceStep::name).collect()

These wrappers matter because two values may both be strings but mean different things. A TraceStep is not a model name, a token, or a user-facing sentence. The type keeps that meaning attached to the value.

ML Concept

A trace can record:

embedding
linear
softmax
cross_entropy

This is useful for understanding which stages ran.

Category-Theory Concept

A pipeline trace is a sequence-like monoid.

The empty trace is the identity.

Combining traces is concatenation.

`PipelineTrace` as Monoid

The problem this block solves is:

Make traces obey the monoid interface.

The implementation:

impl Monoid for PipelineTrace {
    fn empty() -> Self {
        PipelineTrace(vec![])
    }

    fn combine(&self, other: &Self) -> Self {
        let mut combined = self.0.clone();
        combined.extend_from_slice(&other.0);
        PipelineTrace(combined)
    }
}

Rust Syntax

empty returns an empty vector.

combine clones the first trace, appends the second trace, and wraps the result again as PipelineTrace.

The clone is intentional here: combine borrows both inputs, so it cannot move steps out of either trace. Cloning the small TraceStep values lets the method produce a fresh trace while leaving the inputs usable.

use category_theory_transformer_rs::{Monoid, PipelineTrace, TraceStep};

let encoder = PipelineTrace::from_steps([TraceStep::new("embedding")]);
let head = PipelineTrace::from_steps([TraceStep::new("softmax")]);
let trace = encoder.combine(&head);

assert_eq!(trace.names(), vec!["embedding", "softmax"]);

ML Concept

This is how many execution logs work:

trace_a + trace_b = longer trace

Category-Theory Concept

This is the list monoid specialized to trace steps.

Monoid Law Check

The problem this block solves is:

Verify the identity and associativity laws for the trace type.

The function:

pub fn monoid_laws_hold_for_pipeline_trace() -> bool {
    let a = PipelineTrace::from_steps(vec![TraceStep::new("embedding")]);
    let b = PipelineTrace::from_steps(vec![TraceStep::new("linear")]);
    let c = PipelineTrace::from_steps(vec![TraceStep::new("softmax")]);
    let identity = PipelineTrace::empty();

    let left_identity = identity.combine(&a) == a;
    let right_identity = a.combine(&identity) == a;
    let associativity = a.combine(&b).combine(&c) == a.combine(&b.combine(&c));

    left_identity && right_identity && associativity
}

Rust Syntax

The function constructs three traces and checks three booleans.

It returns true only if all monoid laws hold for those examples.

use category_theory_transformer_rs::monoid_laws_hold_for_pipeline_trace;

assert!(monoid_laws_hold_for_pipeline_trace());

ML Concept

Grouping trace combination should not change the final trace.

This matters when systems combine logs from nested pipelines.

Category-Theory Concept

This directly checks the monoid laws:

identity
associativity

The Calculus File

The problem src/calculus.rs solves is:

Backpropagation is easier to understand if you first see one local derivative rule in isolation.

What to notice: Backpropagation does not require every operation to know the whole model. Each operation only needs to know how its own output changes when its own inputs change. Composition does the rest.

The file defines:

Scalar
LocalGradient
MulOp

`Scalar`

The block:

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Scalar(f32);

Rust Syntax

Scalar wraps one f32.

Scalar::new rejects non-finite values.

value() returns the raw float.

This repeats a pattern from the earlier domain-object chapter:

private field
validating constructor
accessor

The raw f32 is private, so callers cannot construct Scalar(f32::NAN) directly. They must use Scalar::new, which returns a Result.

use category_theory_transformer_rs::Scalar;

let scalar = Scalar::new(2.5)?;

assert_eq!(scalar.value(), 2.5);

Ok::<(), category_theory_transformer_rs::CtError>(())

ML Concept

This is a single numeric value in a computation graph.

Examples:

activation
loss component
weight

Category-Theory Concept

It is the simple numeric object used in the local derivative example.

`LocalGradient`

The block:

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct LocalGradient(f32);

Rust Syntax

This is another f32 wrapper.

It has the same finite-value validation as Scalar.

The semantic difference is important:

Scalar = forward value
LocalGradient = derivative signal

This is the same newtype move used throughout the crate. Both wrappers store an f32, but the types prevent accidental mixing at function boundaries.

ML Concept

A gradient tells how a loss changes when an intermediate value changes.

For example:

dL/dz

Category-Theory Concept

This is information flowing backward through a composed computation.

`MulOp`

The problem this block solves is:

Show a forward operation and its local backward rule.

The important methods:

pub fn forward(&self, x: Scalar, y: Scalar) -> CtResult<Scalar> {
    Scalar::new(x.value() * y.value())
}

pub fn backward(
    &self,
    x: Scalar,
    y: Scalar,
    upstream: LocalGradient,
) -> CtResult<(LocalGradient, LocalGradient)> {
    let dz_dx = y.value();
    let dz_dy = x.value();

    Ok((
        LocalGradient::new(upstream.value() * dz_dx)?,
        LocalGradient::new(upstream.value() * dz_dy)?,
    ))
}

Rust Syntax

forward multiplies two scalars and validates the result.

backward takes:

x
y
upstream gradient dL/dz

and returns:

(dL/dx, dL/dy)

The return type is a Rust tuple.

The ? operator appears twice in backward:

LocalGradient::new(upstream.value() * dz_dx)?

It means:

if construction succeeded, keep the value
if construction failed, return the error from backward immediately

That keeps invalid numeric states at the boundary where they are created.

use category_theory_transformer_rs::{LocalGradient, MulOp, Scalar};

let mul = MulOp;
let x = Scalar::new(2.0)?;
let y = Scalar::new(3.0)?;
let z = mul.forward(x, y)?;
let (dl_dx, dl_dy) = mul.backward(x, y, LocalGradient::new(1.0)?)?;

assert_eq!(z.value(), 6.0);
assert_eq!(dl_dx.value(), 3.0);
assert_eq!(dl_dy.value(), 2.0);

Ok::<(), category_theory_transformer_rs::CtError>(())

ML Concept

For:

z = x * y

the local derivatives are:

dz/dx = y
dz/dy = x

By the chain rule:

dL/dx = dL/dz * dz/dx = dL/dz * y
dL/dy = dL/dz * dz/dy = dL/dz * x

Category-Theory Concept

The chain rule is composition of local derivative maps.

A big neural network is many small maps composed forward, then many local gradient rules composed backward.

Run The Example

Source snapshot: examples/04_structure_and_calculus.rs

use category_theory_transformer_rs::{
    CtResult, Functor, LocalGradient, Monoid, MulOp, OptionFunctor, PipelineTrace, Scalar,
    TraceStep, VecFunctor, monoid_laws_hold_for_pipeline_trace,
    naturality_square_holds_for_first_option,
};

fn main() -> CtResult<()> {
    let squared = VecFunctor::fmap(vec![1, 2, 3], |x| x * x);
    let shifted = OptionFunctor::fmap(Some(7), |x| x + 1);

    println!("Vec fmap square: {squared:?}");
    println!("Option fmap +1: {shifted:?}");
    println!(
        "naturality square holds: {}",
        naturality_square_holds_for_first_option()
    );

    let trace = PipelineTrace::from_steps(vec![TraceStep::new("embedding")])
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("linear")]))
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("softmax")]));

    println!("trace: {:?}", trace.names());
    println!(
        "monoid laws hold: {}",
        monoid_laws_hold_for_pipeline_trace()
    );

    let mul = MulOp;
    let x = Scalar::new(2.0)?;
    let y = Scalar::new(3.0)?;
    let upstream = LocalGradient::new(1.0)?;
    let (dl_dx, dl_dy) = mul.backward(x, y, upstream)?;

    println!("dL/dx: {}", dl_dx.value());
    println!("dL/dy: {}", dl_dy.value());

    Ok(())
}

Run:

cargo run --example 04_structure_and_calculus

You should see mapping over Vec, mapping over Option, a naturality check, a combined trace, a monoid law check, and local gradients for multiplication.

Core Mental Model

In Rust terms:

traits name reusable operation shapes
unit structs demonstrate stateless operations
tests check laws

In ML terms:

map over structures, combine traces, compose local gradients

In category-theory terms:

functors preserve structure
natural transformations commute with mapping
monoids combine associatively with identity
chain rule composes derivative information

Checkpoint

Why is “local rule plus composition” the core idea behind backpropagation?

A strong answer:

Each operation only needs its local derivative; the chain rule composes those local derivatives into the gradient for the whole computation.

Where This Leaves Us

This chapter named the repeated structures that sit underneath the small ML pipeline. A functor explains mapping inside a wrapper. A natural transformation explains changing wrappers consistently. A monoid explains safe accumulation. A local gradient explains why a large training computation can be assembled from small derivative rules.

The next chapter uses the same engineering habit on a wider set of ideas from applied category theory. Instead of adding a larger ML model, it shows how the same typed-Rust style can model orders, resources, database instances, design relations, signal flow, circuits, and local-to-global behavior.

Retrieval Practice

Recall

What does a functor do to values inside a wrapper?

Explain

Why is a pipeline trace a good example of a monoid?

Apply

Write a small example of a value in your own codebase that has an empty value and a combine operation. State the identity law it should satisfy.

Seven Sketches Through Rust

The problem this chapter solves is:

Applied category theory can feel too large to connect to code. This chapter turns the seven major themes of Seven Sketches in Compositionality into small Rust blocks with tests.

The previous chapter introduced reusable structures such as functors, natural transformations, monoids, and local derivative rules. This chapter keeps the same reading style but changes scale. Instead of following the tiny ML pipeline, it takes seven broader ideas and asks how each one can be made concrete enough to inspect in Rust.

This chapter does not reproduce the paper. It gives executable handles for the ideas.

The repeated pattern is:

mathematical structure
  -> Rust type
  -> constructor or method
  -> law check

The Rust lesson is:

newtypes + private fields + validation + explicit composition

The category-theory lesson is:

objects + relationships + composition + laws

Reader orientation: Treat each sketch as a small modeling exercise. You are not expected to know the full mathematical theory before reading the Rust. Start from the type, then read the constructor, then read the law check.

What You Already Know

If you have modeled business rules, resource limits, database relationships, or state machines in Rust, you already know the software version of this chapter. The new idea is that applied category theory gives names to those compositional structures and asks which laws make them trustworthy.

Source Snapshots

The main module:

Source snapshot: src/sketches.rs

//! Rust models for the seven applied-category-theory sketches.
//!
//! This module is a study companion for *Seven Sketches in Compositionality*.
//! It does not try to encode all of category theory. Instead, each section
//! gives one small Rust model for the main structure of a sketch: orders,
//! resources, databases, co-design, signal flow, circuits, and logic of
//! behavior.

use crate::error::{CtError, CtResult};

/// A finite order used for the first sketch: observations can be refined into
/// features, scores, and decisions.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum InformationLevel {
    Observation,
    Feature,
    Score,
    Decision,
}

impl InformationLevel {
    /// Returns true when information at this level can flow into `target`.
    pub fn can_flow_to(self, target: Self) -> bool {
        self <= target
    }

    /// Least upper bound in this small total order.
    pub fn join(self, other: Self) -> Self {
        self.max(other)
    }
}

const INFORMATION_LEVELS: [InformationLevel; 4] = [
    InformationLevel::Observation,
    InformationLevel::Feature,
    InformationLevel::Score,
    InformationLevel::Decision,
];

/// Checks reflexivity and transitivity for the finite information order.
pub fn information_order_obeys_preorder_laws() -> bool {
    for level in INFORMATION_LEVELS {
        if !level.can_flow_to(level) {
            return false;
        }
    }

    for first in INFORMATION_LEVELS {
        for second in INFORMATION_LEVELS {
            for third in INFORMATION_LEVELS {
                let premise = first.can_flow_to(second) && second.can_flow_to(third);
                if premise && !first.can_flow_to(third) {
                    return false;
                }
            }
        }
    }

    true
}

/// Number of concrete features in a tiny model.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct FeatureCount(usize);

impl FeatureCount {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("feature count"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Number of abstract layers used to summarize feature capacity.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct LayerBudget(usize);

impl LayerBudget {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("layer budget"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

const FEATURES_PER_LAYER: usize = 4;

/// Abstracts concrete features to the minimum layer budget that can hold them.
pub fn abstract_to_layer_budget(features: FeatureCount) -> CtResult<LayerBudget> {
    LayerBudget::new(features.value().div_ceil(FEATURES_PER_LAYER))
}

/// Concretizes an abstract layer budget back to feature capacity.
pub fn concretize_layer_budget(layers: LayerBudget) -> FeatureCount {
    FeatureCount(layers.value() * FEATURES_PER_LAYER)
}

/// Checks the Galois-connection law for the feature/layer abstraction.
pub fn feature_layer_galois_law_holds(
    features: FeatureCount,
    layers: LayerBudget,
) -> CtResult<bool> {
    let abstracted_fits = abstract_to_layer_budget(features)?.value() <= layers.value();
    let concrete_fits = features.value() <= concretize_layer_budget(layers).value();

    Ok(abstracted_fits == concrete_fits)
}

/// A non-negative amount in a resource bundle.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct ResourceAmount(usize);

impl ResourceAmount {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Compute and memory resources composed by component-wise addition.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ResourceBundle {
    compute: ResourceAmount,
    memory: ResourceAmount,
}

impl ResourceBundle {
    pub fn new(compute: ResourceAmount, memory: ResourceAmount) -> Self {
        Self { compute, memory }
    }

    pub fn compute(&self) -> ResourceAmount {
        self.compute
    }

    pub fn memory(&self) -> ResourceAmount {
        self.memory
    }

    /// Monoidal product for independent resources.
    pub fn tensor(&self, other: &Self) -> Self {
        Self {
            compute: ResourceAmount::new(self.compute.value() + other.compute.value()),
            memory: ResourceAmount::new(self.memory.value() + other.memory.value()),
        }
    }

    /// Resource preorder: supply can cover demand when every component is large
    /// enough.
    pub fn can_supply(&self, demand: &Self) -> bool {
        self.compute >= demand.compute && self.memory >= demand.memory
    }
}

/// Demonstrates monotonicity of the monoidal resource operation.
pub fn resource_tensor_is_monotone() -> bool {
    let small_supply = ResourceBundle::new(ResourceAmount::new(2), ResourceAmount::new(8));
    let large_supply = ResourceBundle::new(ResourceAmount::new(4), ResourceAmount::new(16));
    let fixed = ResourceBundle::new(ResourceAmount::new(1), ResourceAmount::new(2));

    large_supply.can_supply(&small_supply)
        && large_supply
            .tensor(&fixed)
            .can_supply(&small_supply.tensor(&fixed))
}

/// Identifier for a department row.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct DepartmentId(usize);

impl DepartmentId {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Identifier for an employee row.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct EmployeeId(usize);

impl EmployeeId {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// A row in the employee table. The department field is the schema arrow
/// `Employee -> Department`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct EmployeeRecord {
    id: EmployeeId,
    department: DepartmentId,
}

impl EmployeeRecord {
    pub fn new(id: EmployeeId, department: DepartmentId) -> Self {
        Self { id, department }
    }

    pub fn id(&self) -> EmployeeId {
        self.id
    }

    pub fn department(&self) -> DepartmentId {
        self.department
    }
}

/// A tiny database instance: schema objects become sets of ids, and schema
/// arrows become functions between those sets.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct CompanyInstance {
    departments: Vec<DepartmentId>,
    employees: Vec<EmployeeRecord>,
}

impl CompanyInstance {
    pub fn new(
        departments: impl IntoIterator<Item = DepartmentId>,
        employees: impl IntoIterator<Item = EmployeeRecord>,
    ) -> CtResult<Self> {
        let departments = departments.into_iter().collect::<Vec<_>>();
        let employees = employees.into_iter().collect::<Vec<_>>();

        for employee in &employees {
            if !departments.contains(&employee.department()) {
                return Err(CtError::ShapeMismatch {
                    op: "database instance",
                    expected: String::from("employee departments exist in Department"),
                    got: format!("missing department {}", employee.department().value()),
                });
            }
        }

        Ok(Self {
            departments,
            employees,
        })
    }

    pub fn departments(&self) -> &[DepartmentId] {
        &self.departments
    }

    pub fn employees(&self) -> &[EmployeeRecord] {
        &self.employees
    }

    pub fn department_of(&self, employee_id: EmployeeId) -> Option<DepartmentId> {
        self.employees
            .iter()
            .find(|employee| employee.id() == employee_id)
            .map(EmployeeRecord::department)
    }

    pub fn foreign_keys_resolve(&self) -> bool {
        self.employees
            .iter()
            .all(|employee| self.departments.contains(&employee.department()))
    }
}

/// Requests per second needed by a design.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct Throughput(usize);

impl Throughput {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("throughput"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Millisecond latency boundary.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct LatencyMs(usize);

impl LatencyMs {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("latency"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Functional need in a co-design problem.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct DesignRequirement {
    minimum_throughput: Throughput,
    maximum_latency: LatencyMs,
}

impl DesignRequirement {
    pub fn new(minimum_throughput: Throughput, maximum_latency: LatencyMs) -> Self {
        Self {
            minimum_throughput,
            maximum_latency,
        }
    }
}

/// Implementation offered by a candidate component.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ImplementationOffer {
    throughput: Throughput,
    latency: LatencyMs,
}

impl ImplementationOffer {
    pub fn new(throughput: Throughput, latency: LatencyMs) -> Self {
        Self {
            throughput,
            latency,
        }
    }
}

/// A Bool-valued feasibility relation between requirements and offers.
#[derive(Debug, Clone, Copy)]
pub struct FeasibilityRelation;

impl FeasibilityRelation {
    pub fn relates(requirement: DesignRequirement, offer: ImplementationOffer) -> bool {
        offer.throughput >= requirement.minimum_throughput
            && offer.latency <= requirement.maximum_latency
    }
}

/// A scalar coefficient in a signal-flow matrix.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct SignalCoefficient(i32);

impl SignalCoefficient {
    pub fn new(value: i32) -> Self {
        Self(value)
    }

    pub fn zero() -> Self {
        Self(0)
    }

    pub fn value(&self) -> i32 {
        self.0
    }

    fn add(self, other: Self) -> Self {
        Self(self.value() + other.value())
    }

    fn multiply(self, other: Self) -> Self {
        Self(self.value() * other.value())
    }
}

/// Number of rows in a signal-flow matrix.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MatrixRows(usize);

impl MatrixRows {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("matrix rows"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Number of columns in a signal-flow matrix.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MatrixCols(usize);

impl MatrixCols {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("matrix columns"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Matrix semantics for a signal-flow graph.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct SignalMatrix {
    rows: MatrixRows,
    cols: MatrixCols,
    coefficients: Vec<Vec<SignalCoefficient>>,
}

impl SignalMatrix {
    pub fn new(
        rows: MatrixRows,
        cols: MatrixCols,
        coefficients: Vec<Vec<SignalCoefficient>>,
    ) -> CtResult<Self> {
        if coefficients.len() != rows.value() {
            return Err(CtError::ShapeMismatch {
                op: "signal matrix",
                expected: format!("{} rows", rows.value()),
                got: format!("{} rows", coefficients.len()),
            });
        }

        for row in &coefficients {
            if row.len() != cols.value() {
                return Err(CtError::ShapeMismatch {
                    op: "signal matrix",
                    expected: format!("{} columns", cols.value()),
                    got: format!("{} columns", row.len()),
                });
            }
        }

        Ok(Self {
            rows,
            cols,
            coefficients,
        })
    }

    pub fn rows(&self) -> MatrixRows {
        self.rows
    }

    pub fn cols(&self) -> MatrixCols {
        self.cols
    }

    pub fn coefficients(&self) -> &[Vec<SignalCoefficient>] {
        &self.coefficients
    }

    /// Matrix composition. If `previous` is `A -> B` and `self` is `B -> C`,
    /// the result is `A -> C`.
    pub fn compose_after(&self, previous: &Self) -> CtResult<Self> {
        if previous.rows.value() != self.cols.value() {
            return Err(CtError::ShapeMismatch {
                op: "signal matrix composition",
                expected: format!("middle dimension {}", self.cols.value()),
                got: format!("middle dimension {}", previous.rows.value()),
            });
        }

        let mut coefficients =
            vec![vec![SignalCoefficient::zero(); previous.cols.value()]; self.rows.value()];

        for (output_row, output_coefficients) in coefficients.iter_mut().enumerate() {
            for (input_col, coefficient) in output_coefficients.iter_mut().enumerate() {
                let mut total = SignalCoefficient::zero();

                for middle in 0..self.cols.value() {
                    total = total.add(
                        self.coefficients[output_row][middle]
                            .multiply(previous.coefficients[middle][input_col]),
                    );
                }

                *coefficient = total;
            }
        }

        Self::new(self.rows, previous.cols, coefficients)
    }
}

/// A named boundary port in an open circuit.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct PortName(&'static str);

impl PortName {
    pub fn new(value: &'static str) -> CtResult<Self> {
        if value.trim().is_empty() {
            return Err(CtError::EmptyInput("port name"));
        }

        Ok(Self(value))
    }

    pub fn as_str(&self) -> &'static str {
        self.0
    }
}

/// Positive resistance value.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ResistanceOhms(usize);

impl ResistanceOhms {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::InvalidQuantity {
                kind: "resistance",
                value: value as i64,
            });
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// A simple resistor component between two ports.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct CircuitComponent {
    from: PortName,
    to: PortName,
    resistance: ResistanceOhms,
}

impl CircuitComponent {
    pub fn resistor(from: PortName, to: PortName, resistance: ResistanceOhms) -> Self {
        Self {
            from,
            to,
            resistance,
        }
    }

    pub fn from(&self) -> PortName {
        self.from
    }

    pub fn to(&self) -> PortName {
        self.to
    }

    pub fn resistance(&self) -> ResistanceOhms {
        self.resistance
    }
}

/// An open circuit with input ports, output ports, and internal components.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct OpenCircuit {
    inputs: Vec<PortName>,
    outputs: Vec<PortName>,
    components: Vec<CircuitComponent>,
}

impl OpenCircuit {
    pub fn new(
        inputs: impl IntoIterator<Item = PortName>,
        outputs: impl IntoIterator<Item = PortName>,
        components: impl IntoIterator<Item = CircuitComponent>,
    ) -> CtResult<Self> {
        let inputs = inputs.into_iter().collect::<Vec<_>>();
        let outputs = outputs.into_iter().collect::<Vec<_>>();
        let components = components.into_iter().collect::<Vec<_>>();

        if inputs.is_empty() {
            return Err(CtError::EmptyInput("circuit inputs"));
        }

        if outputs.is_empty() {
            return Err(CtError::EmptyInput("circuit outputs"));
        }

        Ok(Self {
            inputs,
            outputs,
            components,
        })
    }

    pub fn input_count(&self) -> usize {
        self.inputs.len()
    }

    pub fn output_count(&self) -> usize {
        self.outputs.len()
    }

    pub fn component_count(&self) -> usize {
        self.components.len()
    }

    /// Serial composition wires this circuit's outputs into the next circuit's
    /// inputs.
    pub fn then(&self, next: &Self) -> CtResult<Self> {
        if self.output_count() != next.input_count() {
            return Err(CtError::ShapeMismatch {
                op: "open circuit serial composition",
                expected: format!("{} next inputs", self.output_count()),
                got: format!("{} next inputs", next.input_count()),
            });
        }

        let mut components = self.components.clone();
        components.extend_from_slice(&next.components);

        Self::new(self.inputs.clone(), next.outputs.clone(), components)
    }

    /// Parallel composition keeps the two open interfaces side by side.
    pub fn parallel(&self, other: &Self) -> CtResult<Self> {
        let mut inputs = self.inputs.clone();
        inputs.extend_from_slice(&other.inputs);

        let mut outputs = self.outputs.clone();
        outputs.extend_from_slice(&other.outputs);

        let mut components = self.components.clone();
        components.extend_from_slice(&other.components);

        Self::new(inputs, outputs, components)
    }
}

/// Truth values used for behavior classification.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TruthValue {
    False,
    True,
}

impl TruthValue {
    pub fn and(self, other: Self) -> Self {
        match (self, other) {
            (TruthValue::True, TruthValue::True) => TruthValue::True,
            _ => TruthValue::False,
        }
    }

    pub fn implies(self, other: Self) -> Self {
        match (self, other) {
            (TruthValue::True, TruthValue::False) => TruthValue::False,
            _ => TruthValue::True,
        }
    }
}

/// Discrete time point in a behavior trace.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct TimeTick(usize);

impl TimeTick {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Closed interval of time ticks.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct TimeInterval {
    start: TimeTick,
    end: TimeTick,
}

impl TimeInterval {
    pub fn new(start: TimeTick, end: TimeTick) -> CtResult<Self> {
        if start > end {
            return Err(CtError::InvalidInterval {
                start: start.value(),
                end: end.value(),
            });
        }

        Ok(Self { start, end })
    }

    pub fn start(&self) -> TimeTick {
        self.start
    }

    pub fn end(&self) -> TimeTick {
        self.end
    }
}

/// Local safety result on one interval.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct LocalSafetyCheck {
    interval: TimeInterval,
    truth: TruthValue,
}

impl LocalSafetyCheck {
    pub fn new(interval: TimeInterval, truth: TruthValue) -> Self {
        Self { interval, truth }
    }

    pub fn interval(&self) -> TimeInterval {
        self.interval
    }

    pub fn truth(&self) -> TruthValue {
        self.truth
    }
}

/// A cover of local behavior checks. The global result is true only when all
/// local checks are true.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct SafetyCover(Vec<LocalSafetyCheck>);

impl SafetyCover {
    pub fn new(checks: impl IntoIterator<Item = LocalSafetyCheck>) -> CtResult<Self> {
        let checks = checks.into_iter().collect::<Vec<_>>();

        if checks.is_empty() {
            return Err(CtError::EmptyInput("safety cover"));
        }

        Ok(Self(checks))
    }

    pub fn global_truth(&self) -> TruthValue {
        self.0
            .iter()
            .fold(TruthValue::True, |truth, check| truth.and(check.truth()))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn information_order_has_preorder_laws() {
        assert!(information_order_obeys_preorder_laws());
        assert_eq!(
            InformationLevel::Feature.join(InformationLevel::Decision),
            InformationLevel::Decision
        );
    }

    #[test]
    fn feature_layer_abstraction_obeys_galois_law() -> CtResult<()> {
        let features = FeatureCount::new(9)?;
        let layers = LayerBudget::new(3)?;

        assert!(feature_layer_galois_law_holds(features, layers)?);
        assert_eq!(abstract_to_layer_budget(features)?.value(), 3);
        assert_eq!(concretize_layer_budget(layers).value(), 12);
        Ok(())
    }

    #[test]
    fn resource_tensor_is_order_preserving() {
        assert!(resource_tensor_is_monotone());
    }

    #[test]
    fn database_instance_resolves_schema_arrow() -> CtResult<()> {
        let research = DepartmentId::new(1);
        let platform = DepartmentId::new(2);
        let ada = EmployeeId::new(7);
        let instance =
            CompanyInstance::new([research, platform], [EmployeeRecord::new(ada, research)])?;

        assert!(instance.foreign_keys_resolve());
        assert_eq!(instance.department_of(ada), Some(research));
        Ok(())
    }

    #[test]
    fn feasibility_relation_matches_requirement_to_offer() -> CtResult<()> {
        let requirement = DesignRequirement::new(Throughput::new(100)?, LatencyMs::new(80)?);
        let offer = ImplementationOffer::new(Throughput::new(120)?, LatencyMs::new(50)?);

        assert!(FeasibilityRelation::relates(requirement, offer));
        Ok(())
    }

    #[test]
    fn signal_matrices_compose_like_flow_graph_semantics() -> CtResult<()> {
        let duplicate = SignalMatrix::new(
            MatrixRows::new(2)?,
            MatrixCols::new(1)?,
            vec![
                vec![SignalCoefficient::new(1)],
                vec![SignalCoefficient::new(1)],
            ],
        )?;
        let add_weighted = SignalMatrix::new(
            MatrixRows::new(1)?,
            MatrixCols::new(2)?,
            vec![vec![SignalCoefficient::new(2), SignalCoefficient::new(3)]],
        )?;

        let composed = add_weighted.compose_after(&duplicate)?;

        assert_eq!(composed.coefficients(), &[vec![SignalCoefficient::new(5)]]);
        Ok(())
    }

    #[test]
    fn open_circuits_compose_in_series_and_parallel() -> CtResult<()> {
        let input = PortName::new("input")?;
        let middle = PortName::new("middle")?;
        let output = PortName::new("output")?;
        let first = OpenCircuit::new(
            [input],
            [middle],
            [CircuitComponent::resistor(
                input,
                middle,
                ResistanceOhms::new(10)?,
            )],
        )?;
        let second = OpenCircuit::new(
            [middle],
            [output],
            [CircuitComponent::resistor(
                middle,
                output,
                ResistanceOhms::new(20)?,
            )],
        )?;

        let serial = first.then(&second)?;
        let parallel = first.parallel(&second)?;

        assert_eq!(serial.input_count(), 1);
        assert_eq!(serial.output_count(), 1);
        assert_eq!(serial.component_count(), 2);
        assert_eq!(parallel.input_count(), 2);
        assert_eq!(parallel.output_count(), 2);
        Ok(())
    }

    #[test]
    fn local_behavior_truth_glues_to_global_truth() -> CtResult<()> {
        let first = LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(0), TimeTick::new(5))?,
            TruthValue::True,
        );
        let second = LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(5), TimeTick::new(10))?,
            TruthValue::True,
        );
        let cover = SafetyCover::new([first, second])?;

        assert_eq!(cover.global_truth(), TruthValue::True);
        assert_eq!(
            TruthValue::True.implies(TruthValue::False),
            TruthValue::False
        );
        Ok(())
    }
}

The runnable companion:

Source snapshot: examples/05_seven_sketches.rs

use category_theory_transformer_rs::{
    CircuitComponent, CompanyInstance, CtResult, DepartmentId, DesignRequirement, EmployeeId,
    EmployeeRecord, FeasibilityRelation, FeatureCount, ImplementationOffer, InformationLevel,
    LatencyMs, LayerBudget, LocalSafetyCheck, MatrixCols, MatrixRows, OpenCircuit, PortName,
    ResistanceOhms, ResourceAmount, ResourceBundle, SafetyCover, SignalCoefficient, SignalMatrix,
    Throughput, TimeInterval, TimeTick, TruthValue, abstract_to_layer_budget,
    concretize_layer_budget, feature_layer_galois_law_holds, information_order_obeys_preorder_laws,
    resource_tensor_is_monotone,
};

fn main() -> CtResult<()> {
    println!(
        "orders obey preorder laws: {}",
        information_order_obeys_preorder_laws()
    );
    println!(
        "join(feature, decision): {:?}",
        InformationLevel::Feature.join(InformationLevel::Decision)
    );

    let features = FeatureCount::new(9)?;
    let layers = LayerBudget::new(3)?;
    println!(
        "feature/layer Galois law: {}",
        feature_layer_galois_law_holds(features, layers)?
    );
    println!(
        "abstract 9 features to {} layers; concretize 3 layers to {} features",
        abstract_to_layer_budget(features)?.value(),
        concretize_layer_budget(layers).value()
    );

    let encoder = ResourceBundle::new(ResourceAmount::new(2), ResourceAmount::new(8));
    let decoder = ResourceBundle::new(ResourceAmount::new(3), ResourceAmount::new(10));
    println!("combined resource bundle: {:?}", encoder.tensor(&decoder));
    println!(
        "resource tensor monotone: {}",
        resource_tensor_is_monotone()
    );

    let research = DepartmentId::new(1);
    let platform = DepartmentId::new(2);
    let ada = EmployeeId::new(7);
    let instance =
        CompanyInstance::new([research, platform], [EmployeeRecord::new(ada, research)])?;
    println!(
        "employee {:?} belongs to department {:?}",
        ada,
        instance.department_of(ada)
    );

    let requirement = DesignRequirement::new(Throughput::new(100)?, LatencyMs::new(80)?);
    let offer = ImplementationOffer::new(Throughput::new(120)?, LatencyMs::new(50)?);
    println!(
        "co-design offer feasible: {}",
        FeasibilityRelation::relates(requirement, offer)
    );

    let duplicate = SignalMatrix::new(
        MatrixRows::new(2)?,
        MatrixCols::new(1)?,
        vec![
            vec![SignalCoefficient::new(1)],
            vec![SignalCoefficient::new(1)],
        ],
    )?;
    let add_weighted = SignalMatrix::new(
        MatrixRows::new(1)?,
        MatrixCols::new(2)?,
        vec![vec![SignalCoefficient::new(2), SignalCoefficient::new(3)]],
    )?;
    let composed = add_weighted.compose_after(&duplicate)?;
    println!(
        "signal-flow matrix semantics: {:?}",
        composed.coefficients()
    );

    let input = PortName::new("input")?;
    let middle = PortName::new("middle")?;
    let output = PortName::new("output")?;
    let first_circuit = OpenCircuit::new(
        [input],
        [middle],
        [CircuitComponent::resistor(
            input,
            middle,
            ResistanceOhms::new(10)?,
        )],
    )?;
    let second_circuit = OpenCircuit::new(
        [middle],
        [output],
        [CircuitComponent::resistor(
            middle,
            output,
            ResistanceOhms::new(20)?,
        )],
    )?;
    println!(
        "serial circuit component count: {}",
        first_circuit.then(&second_circuit)?.component_count()
    );

    let safety = SafetyCover::new([
        LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(0), TimeTick::new(5))?,
            TruthValue::True,
        ),
        LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(5), TimeTick::new(10))?,
            TruthValue::True,
        ),
    ])?;
    println!("global behavior truth: {:?}", safety.global_truth());

    Ok(())
}

Paper Map To Rust

Use this table as the navigation layer.

Paper area	Main content	Rust companion
Generative effects	preorders, monotone maps, Galois connections	`InformationLevel`, `FeatureCount`, `LayerBudget`
Resources	monoidal preorders, resource composition, enrichment	`ResourceBundle`, `ResourceAmount`
Databases	schemas as categories, instances as functors	`CompanyInstance`, `EmployeeRecord`, `DepartmentId`
Co-design	feasibility relations and profunctor-like reasoning	`DesignRequirement`, `ImplementationOffer`, `FeasibilityRelation`
Signal flow	syntax, semantics, matrices, composition	`SignalMatrix`, `SignalCoefficient`
Circuits	open systems, ports, serial and parallel composition	`OpenCircuit`, `CircuitComponent`, `PortName`
Logic of behavior	truth values, intervals, local-to-global checks	`TruthValue`, `TimeInterval`, `SafetyCover`

Each section below uses the same three-part lens:

Rust syntax
ML or software concept
Category theory concept

Worked Example: Ordering Information Levels

The smallest first-principles version of this chapter is an ordered enum. Rust can derive an order for enum variants, and that gives the code a concrete way to ask whether one information level can safely flow into another:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
enum Level {
    Observation,
    Feature,
    Decision,
}

assert!(Level::Observation <= Level::Decision);
assert!(Level::Feature <= Level::Feature);
}

The real src/sketches.rs module uses the same idea with named domain types, validation, and law checks.

Self-Check

Before reading the first sketch, explain why Observation <= Decision is a modeling rule, not just a comparison between enum variants.

Sketch 1: Information Order

The problem this block solves is:

Some concepts are ordered by refinement. An observation can be refined into a feature, a feature into a score, and a score into a decision.

The block begins:

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum InformationLevel {
    Observation,
    Feature,
    Score,
    Decision,
}

Rust Syntax

This is an enum.

The variants are ordered because the enum derives:

PartialOrd, Ord

That means Rust can compare:

InformationLevel::Observation <= InformationLevel::Decision

The methods:

pub fn can_flow_to(self, target: Self) -> bool {
    self <= target
}

pub fn join(self, other: Self) -> Self {
    self.max(other)
}

reuse that ordering.

can_flow_to checks whether information can move upward.

join returns the more informative of two levels in this total order.

ML Or Software Concept

ML systems often move through levels of processed information:

raw observation
  -> extracted feature
  -> model score
  -> final decision

The order prevents treating a low-level observation as if it were already a decision.

Category Theory Concept

This is a preorder-shaped example.

A preorder needs:

reflexivity: a <= a
transitivity: if a <= b and b <= c, then a <= c

The law check:

information_order_obeys_preorder_laws()

iterates over the finite set and verifies those rules.

Sketch 1 Continued: Feature And Layer Galois Law

The problem this block solves is:

A concrete feature count and an abstract layer budget are different worlds, but they can be coordinated by a law.

The key types:

pub struct FeatureCount(usize);
pub struct LayerBudget(usize);

The conversion functions:

pub fn abstract_to_layer_budget(features: FeatureCount) -> CtResult<LayerBudget>

pub fn concretize_layer_budget(layers: LayerBudget) -> FeatureCount

Rust Syntax

Both FeatureCount and LayerBudget are newtypes around usize.

Their constructors reject zero because neither a zero feature count nor a zero layer budget is useful in this model.

abstract_to_layer_budget divides feature count by FEATURES_PER_LAYER and rounds up:

features.value().div_ceil(FEATURES_PER_LAYER)

concretize_layer_budget multiplies layers by features per layer.

ML Or Software Concept

This models capacity planning.

Concrete features might be:

9 measured feature channels

The abstract layer budget might be:

3 layers

The law says the two planning views agree about what fits.

Category Theory Concept

The checked law is:

abstract(features) <= layers
if and only if
features <= concretize(layers)

That is the shape of a Galois connection.

The two directions are not inverses.

They are coordinated by an order law.

Sketch 2: Resources

The problem this block solves is:

Independent resources need a way to combine, and resource supply must respect demand ordering.

The key block:

pub struct ResourceBundle {
    compute: ResourceAmount,
    memory: ResourceAmount,
}

Rust Syntax

ResourceAmount wraps a usize.

ResourceBundle stores two resource dimensions:

compute
memory

The monoidal operation is:

pub fn tensor(&self, other: &Self) -> Self

It adds compute to compute and memory to memory.

The preorder check is:

pub fn can_supply(&self, demand: &Self) -> bool

It returns true only when every resource component is large enough.

ML Or Software Concept

This is the shape of deployment capacity:

encoder resources + decoder resources = combined resources

A machine can satisfy a demand only when it has enough compute and memory.

Category Theory Concept

This is a monoidal preorder.

The preorder is can_supply.

The monoidal product is tensor.

The law check:

resource_tensor_is_monotone()

shows that adding the same fixed resource bundle to both sides preserves the order.

Sketch 3: Database Instance

The problem this block solves is:

A database row with a foreign key should not point at a missing row.

The key types:

pub struct DepartmentId(usize);
pub struct EmployeeId(usize);

pub struct EmployeeRecord {
    id: EmployeeId,
    department: DepartmentId,
}

pub struct CompanyInstance {
    departments: Vec<DepartmentId>,
    employees: Vec<EmployeeRecord>,
}

Rust Syntax

DepartmentId and EmployeeId are distinct newtypes.

That prevents mixing department IDs and employee IDs.

EmployeeRecord contains:

employee id
department id

CompanyInstance::new collects departments and employees, then checks every employee department exists:

if !departments.contains(&employee.department()) {
    return Err(CtError::ShapeMismatch { ... });
}

ML Or Software Concept

Many ML systems depend on structured data.

If a row references missing data, downstream feature extraction or training can fail later in a confusing place.

This code rejects invalid relational structure at construction time.

Category Theory Concept

The schema can be read as:

Employee -> Department

An instance assigns sets of rows to schema objects.

The foreign key is a function from employees to departments.

CompanyInstance::new checks that the function is defined for every employee.

Sketch 4: Co-Design Feasibility

The problem this block solves is:

Some relationships are not functions. A requirement and an implementation offer are related only when constraints are satisfied.

The key types:

pub struct DesignRequirement {
    minimum_throughput: Throughput,
    maximum_latency: LatencyMs,
}

pub struct ImplementationOffer {
    throughput: Throughput,
    latency: LatencyMs,
}

pub struct FeasibilityRelation;

Rust Syntax

Throughput and LatencyMs are validated newtypes.

DesignRequirement stores the minimum acceptable throughput and maximum acceptable latency.

ImplementationOffer stores what an implementation actually provides.

The relation is:

pub fn relates(requirement: DesignRequirement, offer: ImplementationOffer) -> bool {
    offer.throughput >= requirement.minimum_throughput
        && offer.latency <= requirement.maximum_latency
}

ML Or Software Concept

This models feasibility:

Can this model/service/deployment satisfy this requirement?

For example:

required throughput: at least 100 requests/sec
required latency: at most 80 ms
offer: 120 requests/sec and 50 ms

The offer is feasible.

Category Theory Concept

This is relation-shaped rather than function-shaped.

It is the small Bool-valued version of profunctor-like reasoning:

Requirement x Offer -> Bool

Not every design problem should be forced into:

A -> B

Some should be modeled as constraints or relations.

Sketch 5: Signal Matrices

The problem this block solves is:

Signal-flow diagrams need executable semantics. In this companion, matrices provide that meaning.

The key types:

pub struct SignalCoefficient(i32);
pub struct MatrixRows(usize);
pub struct MatrixCols(usize);

pub struct SignalMatrix {
    rows: MatrixRows,
    cols: MatrixCols,
    coefficients: Vec<Vec<SignalCoefficient>>,
}

Rust Syntax

MatrixRows and MatrixCols reject zero.

SignalMatrix::new validates that the coefficient matrix has the promised shape:

number of rows matches MatrixRows
number of columns in every row matches MatrixCols

The composition method:

pub fn compose_after(&self, previous: &Self) -> CtResult<Self>

requires compatible middle dimensions.

Then it performs matrix multiplication using:

add
multiply
sum over the middle dimension

ML Or Software Concept

This is the same shape as composing linear layers or signal-processing stages.

If one stage maps:

A -> B

and another maps:

B -> C

then the composite maps:

A -> C

The dimensions must line up.

Category Theory Concept

Signal-flow syntax gets matrix semantics.

The important principle is functorial semantics:

meaning(composed diagram)
=
composition of meanings

The code enforces the same middle-dimension law that ordinary morphism composition enforces.

Sketch 6: Open Circuits

The problem this block solves is:

A circuit is not only internal components. It also has a boundary where it can connect to other circuits.

The key types:

pub struct PortName(&'static str);
pub struct ResistanceOhms(usize);

pub struct CircuitComponent {
    from: PortName,
    to: PortName,
    resistance: ResistanceOhms,
}

pub struct OpenCircuit {
    inputs: Vec<PortName>,
    outputs: Vec<PortName>,
    components: Vec<CircuitComponent>,
}

Rust Syntax

PortName::new rejects empty names.

ResistanceOhms::new rejects zero resistance.

OpenCircuit::new rejects circuits with no inputs or no outputs.

Serial composition:

pub fn then(&self, next: &Self) -> CtResult<Self>

checks:

self output count == next input count

Parallel composition:

pub fn parallel(&self, other: &Self) -> CtResult<Self>

puts the two boundaries side by side.

ML Or Software Concept

This looks like component architecture:

input interface
internal implementation
output interface

Composition should fail when interfaces do not match.

That rule applies to services, data pipelines, neural layers, and circuit-like systems.

Category Theory Concept

This is the open-system idea.

The boundary is part of the object.

Composition is controlled by boundary compatibility.

The paper develops this with cospans, hypergraph categories, decorated cospans, and operads. The Rust code gives a small typed analogue.

Sketch 7: Logic Of Behavior

The problem this block solves is:

A system may be safe on local time intervals. The code needs a way to combine local safety checks into one global result.

The key types:

pub enum TruthValue {
    False,
    True,
}

pub struct TimeTick(usize);

pub struct TimeInterval {
    start: TimeTick,
    end: TimeTick,
}

pub struct LocalSafetyCheck {
    interval: TimeInterval,
    truth: TruthValue,
}

pub struct SafetyCover(Vec<LocalSafetyCheck>);

Rust Syntax

TruthValue implements Boolean-style operations:

and
implies

TimeInterval::new rejects intervals where start is after end.

SafetyCover::new rejects an empty list of checks.

global_truth folds all local truths with and:

self.0
    .iter()
    .fold(TruthValue::True, |truth, check| truth.and(check.truth()))

ML Or Software Concept

This models safety or behavior validation over time:

check interval 0..5
check interval 5..10
combine into global result

If every local check is true, global truth is true.

If any local check is false, global truth is false.

Category Theory Concept

This is a small analogue of local-to-global reasoning.

The sheaf-like idea is:

local facts can determine a global fact when they glue coherently

The code uses a simple conjunction model, not full sheaf theory.

The important lesson is that proof-like information becomes explicit data, not an informal comment.

Tests As Exercise Solutions

The problem this block solves is:

The laws should be runnable, not only described in prose.

The test module checks preorder laws, the feature/layer Galois law, resource tensor monotonicity, database foreign-key resolution, feasibility relation behavior, signal matrix composition, open circuit serial and parallel composition, and local-to-global truth.

Rust Syntax

Every law is a normal Rust test marked with:

#[test]

Tests that may fail through constructors return:

CtResult<()>

so they can use ?.

ML Or Software Concept

The tests act as executable learning checks.

If a future change breaks a law, the project should fail quickly.

Category Theory Concept

The tests are small law checks.

They are not formal proofs, but they keep the implementation aligned with the claimed structure.

Run The Companion

Run:

cargo run --example 05_seven_sketches

For the full validation gate:

bash scripts/check.sh

Core Mental Model

In Rust terms:

each sketch becomes concrete types, constructors, methods, and tests

In ML or software terms:

orders, resources, schemas, feasibility, signal flow, interfaces, and safety
are all engineering structures

In category-theory terms:

the useful part is compositionality: make structure visible, then make
composition obey laws

What To Remember

The seven sketches are not seven disconnected topics.

They repeat one engineering discipline:

name the objects
name the relationships
control construction
define composition
check the law

That is also the discipline used by the tiny ML pipeline in the rest of the course.

Where This Leaves Us

This chapter showed that the book’s main discipline is not limited to language modeling. Orders, resources, database instances, feasibility relations, signal matrices, open circuits, and local behavior checks can all be read in the same way: name the values, constrain construction, define composition, and test the law that makes the composition trustworthy.

The remaining practice material asks you to use that reading method yourself. The exercises are not meant to test memorized definitions. They are meant to train the habit of translating one Rust block into its software role and its categorical shape.

Retrieval Practice

Recall

Name three applied structures modeled in this chapter.

Explain

Why does each sketch include constructors or law checks instead of only type definitions?

Apply

Pick one engineering concept from your own work, such as permissions, queues, or resource budgets. Describe the objects, relationships, and one law you would want tests to check.

Exercises

The problem this chapter solves is:

Reading detailed explanations is not enough. You need to practice explaining the code through Rust syntax, ML concept, and category-theory concept.

The exercises are deliberately small. A strong answer is not a long essay; it is a precise explanation that connects a line of Rust to the value it protects, the ML step it supports, and the categorical shape it names. When an exercise asks you to edit code, make the smallest change, run the command, and then explain what changed.

For every exercise, use this answer shape:

Rust syntax:
...

ML concept:
...

Category theory concept:
...

The point is not to write long answers.

The point is to connect the same block of code across all three meanings.

Before starting, make sure the basic Rust feedback loop works:

cargo test --all-targets --all-features

That command is part of the learning method. It proves that the examples in the book are not only explanatory text; they are tied to code that the compiler can check.

Worked Example

First study a complete answer. The exercise is:

Explain why TokenId is not a raw usize.

A strong answer:

Rust syntax:
TokenId is a tuple struct around usize. The field is private, so callers use
TokenId::new and index() instead of reaching into the raw value directly.

ML concept:
The number represents a vocabulary position, not an arbitrary count or shape.

Category theory concept:
TokenId is one object in the small category of typed pipeline values. Morphisms
such as Embedding can start from it.

Notice the order: name the syntax, connect it to the ML role, then name only the categorical shape the code supports.

Partially Completed Example

Complete the missing lines for Distribution:

Rust syntax:
Distribution wraps ________ and construction can return ________.

ML concept:
It represents probabilities over possible next tokens, so the values must be
non-negative and sum to ________.

Category theory concept:
It is an object produced by ________ and consumed with a target token by
________.

Expected completion:

Vec<f32>
CtResult<Self>
one
Softmax
CrossEntropy

Your Turn

Now solve the same kind of exercise without the filled answer. Pick Loss, TrainingSet, or LearningRate and explain it through the same three lenses.

Transfer Exercise

Design a new wrapper type for a future Transformer chapter, such as SequenceLength, HeadCount, or AttentionScore. State the raw representation, the invariant, and one function that should consume or produce it.

Exercise 1: Explain One Domain Type

Use Domain Objects.

Pick one type:

Vector
Logits
Distribution
Loss
TrainingSet
Parameters

Write:

The problem this solves:

Rust syntax:

ML concept:

Category theory concept:

Pass condition:

You name the raw representation.
You name the invariant or semantic distinction.
You name the pipeline stage where the type appears.

First-principles hint:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
struct LocalTokenId(usize);

impl LocalTokenId {
    fn new(index: usize) -> Self {
        Self(index)
    }

    fn index(self) -> usize {
        self.0
    }
}

assert_eq!(LocalTokenId::new(7).index(), 7);
}

That snippet is intentionally smaller than the real crate. It shows the raw idea: a named wrapper can make one usize mean “token id” instead of “any number.”

Exercise 2: Add A Token

Use the src/demo.rs snapshot in Course Map.

Add one new vocabulary item and extend the token sequence.

Run:

cargo run --bin category_ml

Pass condition:

the demo still runs
the dataset windowing output includes your new transition
you can explain why a longer TokenSequence creates more training examples

Exercise 3: Trace `DatasetWindowing`

Use The Tiny ML Pipeline.

For this input:

[TokenId(4), TokenId(8), TokenId(15), TokenId(16)]

write the training examples produced by windows(2).

Then explain:

Rust syntax:
what does `.windows(2)` do?

ML concept:
why does next-token training need adjacent pairs?

Category theory concept:
why is each example a product object?

Exercise 4: Break A Composition

Use the examples/02_morphism_composition.rs snapshot in Morphism and Composition.

Try to compose Embedding directly with Softmax.

Expected failure shape:

the trait bound ... is not satisfied

Then restore the working version.

Explain:

Rust syntax:
which type did the compiler reject?

ML concept:
which prediction stage was skipped?

Category theory concept:
which middle object failed to match?

Exercise 5: Change The Training Repetition Count

Use the examples/03_training_endomorphism.rs snapshot in Training as an Endomorphism.

Change:

StepCount::new(80)

to:

StepCount::new(1)
StepCount::new(10)
StepCount::new(200)

Run:

cargo run --example 03_training_endomorphism

Explain the result:

Rust syntax:
where is the count used?

ML concept:
what happens when training repeats more times?

Category theory concept:
why can the update be repeated?

Exercise 6: Explain `Distribution<T>::map`

Use Functors, Naturality, Monoids, and Chain Rule.

Explain the conceptual Distribution<T>::map example.

Use this input distribution:

TokenId(2) -> 0.70
TokenId(3) -> 0.30

and this function:

TokenId -> String

where:

TokenId(2) -> "Rust"
TokenId(3) -> "."

Write the output distribution.

Then explain:

Rust syntax:
why does `self` plus `into_iter()` move the old outcomes?

ML concept:
why do the probabilities stay the same?

Category theory concept:
what does it mean to lift `T -> U` into `Distribution<T> -> Distribution<U>`?

Exercise 7: Explain One Validation Boundary

Pick one constructor:

Distribution::new
Loss::new
LearningRate::new
TrainingSet::new
SignalMatrix::new
OpenCircuit::new

Write:

The problem this solves:

Rust syntax:
which condition returns `Err(...)`?

ML or software concept:
what bad runtime behavior does this prevent?

Category theory concept:
what intended object or relationship is being protected?

Exercise 8: Trace A Full Source File

Use Repository Source Snapshots.

Pick one complete source file and write a five-sentence summary:

What problem does the file solve?
What are the main Rust types or traits?
What ML or software concept does it model?
What category-theory concept does it teach?
Which command proves the file still works?

Exercise 9: Connect One External Reference

Use References.

Pick one external resource and connect it to one source file in this course.

Answer:

External resource:
Source file:
Rust syntax connection:
ML or software concept connection:
Category theory concept connection:
One difference between the full treatment and this tiny implementation:

Exercise 10: Test One Sketch Law

Use Seven Sketches Through Rust.

Pick one law from src/sketches.rs:

preorder laws
feature/layer Galois law
resource monotonicity
foreign-key resolution
signal-flow matrix composition
local-to-global safety truth

Change one input in examples/05_seven_sketches.rs, then run:

cargo run --example 05_seven_sketches

Pass condition:

you can explain which law still holds
you can explain which constructor or method prevents invalid structure
your explanation uses Rust syntax, ML or software concept, and category theory concept

Exercise 11: Write A New Block Explanation

Choose any block from the source snapshots that the chapter did not explain in enough detail for you.

Write a block explanation using this structure:

The problem this block solves:

The whole block:

Rust syntax:

ML or software concept:

Category theory concept:

Core mental model:

Pass condition:

A beginner can understand the Rust syntax.
An ML learner can understand why the block exists.
A category-theory learner can name the shape.

Where This Leaves Us

If you can complete these exercises, you can read the project without treating category theory, Rust, and ML as three disconnected subjects. You can start from a line of code, name the syntax, identify the software or ML role, and then describe the categorical shape only as far as the code justifies it.

Glossary

The problem this chapter solves is:

Abstract terms are easier to remember when each term is tied to a Rust type, an ML role, and a category-theory shape.

Use this glossary as a lookup table while reading the source snapshots.

Do not read it as a separate dictionary. Each entry is deliberately anchored to the codebase. If a definition sounds abstract, jump from the term to the Rust syntax and then back to the chapter where the type or trait appears.

Reader orientation: The glossary uses compact entries, but the entries still follow the book’s main discipline: first the Rust handle, then the ML or software role, then the categorical shape.

Category-Theory Terms

Object

Rust syntax:

TokenId
Vector
Logits
Distribution
Loss
Parameters

ML concept:

An object is one kind of value in the pipeline, such as a token, vector, probability distribution, loss, or model state.

Category theory concept:

An object is something a morphism can start from or end at.

First-principles reading:

An object is the kind of thing an arrow is allowed to receive or return. In this book, TokenId and Vector are different objects because the pipeline should not confuse a vocabulary index with a dense numeric representation.

Morphism

Rust syntax:

pub trait Morphism<Input, Output>

ML concept:

A morphism is one transformation stage, such as embedding lookup or softmax.

Category theory concept:

A morphism is a typed arrow:

Input -> Output

Identity Morphism

Rust syntax:

Identity<T>

ML concept:

Identity is a stage that leaves a value unchanged. It is useful for testing the idea of neutral transformations.

Category theory concept:

Every object has an identity arrow:

id_A : A -> A

Composition

Rust syntax:

Compose<F, G, Middle>

ML concept:

Composition connects stages:

Embedding then LinearToLogits then Softmax

Category theory concept:

If:

f : A -> B
g : B -> C

then:

g after f : A -> C

Product Object

Rust syntax:

Product<A, B>

ML concept:

A product stores paired values, such as:

input token x target token
prediction distribution x target token

Category theory concept:

The product object is written:

A x B

Its projections correspond to first() and second().

Endomorphism

Rust syntax:

Endomorphism<T>
TrainStep : Parameters -> Parameters

ML concept:

A training step updates parameters and returns parameters again.

Category theory concept:

An endomorphism is an arrow from an object back to itself:

A -> A

Functor

Rust syntax:

Functor<A, B>
VecFunctor
OptionFunctor

ML concept:

Apply a transformation inside a wrapper such as a batch or optional value.

Category theory concept:

A functor maps objects and arrows while preserving structure.

Functor Map

Rust syntax:

fn map<U>(self, f: impl Fn(T) -> U) -> Distribution<U>

ML concept:

For a probabilistic output, map transforms every possible outcome while leaving the attached probabilities unchanged.

Category theory concept:

map lifts a deterministic function:

T -> U

into a context-aware transformation:

Distribution<T> -> Distribution<U>

Natural Transformation

Rust syntax:

VecToFirstOption : Vec<A> -> Option<A>

ML concept:

Convert one container shape into another consistently, such as many candidates to maybe one selected candidate.

Category theory concept:

A natural transformation converts one functor shape into another and commutes with mapping.

Monoid

Rust syntax:

PipelineTrace
Monoid::empty()
Monoid::combine()

ML concept:

Traces, logs, batches, and metric accumulators often need an empty value and a combine operation.

Category theory concept:

A monoid has an identity element and an associative binary operation.

Preorder

Rust syntax:

InformationLevel::can_flow_to

ML or software concept:

Information can flow from observation to feature to score to decision.

Category theory concept:

A preorder is reflexive and transitive.

Galois Connection

Rust syntax:

abstract_to_layer_budget
concretize_layer_budget

ML or software concept:

Concrete feature counts and abstract layer budgets can be coordinated.

Category theory concept:

Two order-preserving views are connected by a law:

abstract(x) <= y iff x <= concretize(y)

Monoidal Preorder

Rust syntax:

ResourceBundle::tensor
ResourceBundle::can_supply

ML or software concept:

Independent compute and memory resources can be combined.

Category theory concept:

A preorder with a product-like composition operation that preserves order.

Profunctor

Rust syntax:

FeasibilityRelation::relates(requirement, offer)

ML or software concept:

A requirement and implementation offer are related if constraints are satisfied.

Category theory concept:

A profunctor generalizes a relationship between categories. This course uses a small Bool-valued relation as the practical handle.

Functorial Semantics

Rust syntax:

SignalMatrix::compose_after

ML or software concept:

Composed signal-flow stages should have the same meaning as composing their matrix interpretations.

Category theory concept:

Interpretation preserves composition.

Open System

Rust syntax:

OpenCircuit
OpenCircuit::then
OpenCircuit::parallel

ML or software concept:

A component has an external interface plus internal implementation details.

Category theory concept:

An open system composes through typed boundaries.

Sheaf-Style Locality

Rust syntax:

SafetyCover::global_truth

ML or software concept:

Local safety checks over time intervals combine into a global safety result.

Category theory concept:

Local facts can determine a global fact when they glue coherently.

Rust Terms

Newtype

Rust syntax:

pub struct TokenId(usize);

ML concept:

The same raw number type can represent different concepts. Newtypes prevent accidental mixing.

Category theory concept:

A newtype names a specific object instead of treating all raw representations as the same object.

First-principles reading:

A newtype is the smallest move from “just data” to “data with a role.” The runtime representation can stay cheap, but the type checker now knows that a token id, vocabulary size, and model dimension are not the same concept.

Smart Constructor

Rust syntax:

pub fn new(value: Raw) -> CtResult<Self>

ML concept:

Invalid training inputs, probabilities, dimensions, or hyperparameters should be rejected early.

Category theory concept:

A smart constructor maps raw data into a validated subobject, using Result when the mapping can fail.

Invariant

Rust syntax:

Distribution must be non-empty, finite, non-negative, and sum to one.

ML concept:

The model can trust a value only if the type protects the rule that makes it meaningful.

Category theory concept:

An invariant describes the subset or structure the object is meant to inhabit.

Typed Error

Rust syntax:

CtError
CtResult<T>

ML concept:

Bad data should fail with a meaningful cause, not with a vague panic later.

Category theory concept:

Result turns a partial construction or morphism into a total error-aware mapping.

Machine-Learning Terms

Token

Rust syntax:

TokenId

ML concept:

A token is a discrete symbol from a vocabulary.

Category theory concept:

The vocabulary is a finite discrete set of possible token objects.

Embedding

Rust syntax:

Embedding : TokenId -> Vector

ML concept:

An embedding maps a discrete token to a dense numerical representation.

Category theory concept:

It is a morphism from a finite token object into a vector-space-like object.

Logits

Rust syntax:

Logits(Vec<f32>)

ML concept:

Logits are raw scores before softmax.

Category theory concept:

They live in a vector-space-like object:

R^vocab_size

Softmax

Rust syntax:

Softmax : Logits -> Distribution

ML concept:

Softmax turns raw scores into probabilities.

Category theory concept:

It maps from a score vector into the probability simplex.

Cross Entropy

Rust syntax:

CrossEntropy : Product<Distribution, TokenId> -> Loss

ML concept:

Cross entropy measures how much probability the model assigned to the correct target.

Category theory concept:

It is a morphism from prediction-target product into non-negative scalar loss.

Parameters

Rust syntax:

Parameters

ML concept:

The trainable state of the model: embedding table, output head, and bias.

Category theory concept:

The object transformed by the training endomorphism.

Gradient

Rust syntax:

LocalGradient
grad_embedding
grad_lm_head
grad_bias

ML concept:

A gradient tells how parameters should change to reduce loss.

Category theory concept:

Gradient flow is local derivative information composed backward through a composed computation.

Learning Rate

Rust syntax:

LearningRate

ML concept:

The scalar step size in gradient descent.

Category theory concept:

It chooses a specific update morphism from a family of parameter endomorphisms.

Chain Rule

Rust syntax:

MulOp::backward

ML concept:

The chain rule lets local derivatives combine into gradients for a larger computation.

Category theory concept:

It is composition of local derivative maps.

Where This Leaves Us

The glossary is not a substitute for the chapters. It is the index of the book’s repeated translation habit. When a term feels unfamiliar, connect it back to one of three things: the Rust syntax that names it, the ML or software role that motivates it, and the categorical shape that explains how it composes.

References

The problem this chapter solves is:

The course uses small Rust examples. These references point to the larger Rust, ML, and category-theory treatments behind those examples.

Use each reference with the same three questions:

Rust syntax:
which source file in this course uses the idea?

ML concept:
which model or training behavior does the reference explain?

Category theory concept:
which object, morphism, composition, or law does it deepen?

Rust

The Rust Programming Language: Packages, Crates, and Modules explains how Rust packages are organized into library and binary crates. Use it with src/lib.rs, src/bin/category_ml.rs, and the examples/ files.
The Rust Programming Language: Recoverable Errors with Result explains the error pattern behind CtResult<T> and constructors such as Distribution::new.
The rustdoc book: How to write documentation explains the documentation comments used above public types and methods.
Rust API Guidelines Checklist is a practical review checklist for naming, documentation, type conversions, and error design.

Category Theory

Seven Sketches in Compositionality: An Invitation to Applied Category Theory is the larger applied-category-theory text behind the companion chapter. Use it with src/sketches.rs.
Seven Sketches in Compositionality PDF is the direct paper file for offline reading and page-by-page study.

Machine Learning

Dive into Deep Learning: Softmax Regression explains multiclass classification, logits, softmax, and cross entropy. Use it with src/ml.rs.
Dive into Deep Learning: Softmax Regression Implementation from Scratch shows the implementation path behind this course’s smaller Rust version.
Stanford CS231n: Linear Classification explains linear classifiers, scores, losses, and the softmax classifier from a widely used university course.

Transformers

Attention Is All You Need on arXiv is the original Transformer paper.
Attention Is All You Need on the NeurIPS proceedings site is the archival conference listing.
Dive into Deep Learning: Attention Mechanisms and Transformers is a practical bridge from softmax and vector operations to attention and Transformer blocks.

Transformer Roadmap

The problem this chapter solves is:

The repository name points toward Transformers, but the current code is a foundation course. This chapter explains exactly how the current objects and morphisms point toward a future attention-based model.

The current code is not a full Transformer.

It teaches the typed pieces you need first:

tokens
vectors
logits
probabilities
loss
training updates
composition

This distinction matters. A roadmap should not pretend the current crate already implements attention, residual blocks, or a full sequence model. It should show how the current typed skeleton can grow without losing the discipline that made the small examples understandable.

Reader orientation: Read this chapter as an engineering migration plan, not as a promise that the current code already contains every Transformer component.

What You Already Know

If you understand the current prediction path, you already know the skeleton a Transformer will extend. Tokens become vectors, vectors move through typed transformations, and probabilities feed a loss. The future work is to replace the one-token middle with sequence-aware structure.

What Exists Now

The current model has this prediction path:

TokenId -> Vector -> Logits -> Distribution

Rust Syntax

The path is implemented with:

Embedding
LinearToLogits
Softmax
Compose

The main domain objects are:

TokenId
Vector
Logits
Distribution
Parameters

The training update is:

TrainStep : Parameters -> Parameters

ML Concept

This is a tiny next-token model.

It predicts from one token at a time.

It does not yet model attention across a sequence.

Still, it already teaches the core path:

discrete token
  -> dense representation
  -> vocabulary scores
  -> next-token probabilities

Category Theory Concept

The current system teaches composition:

TokenId -> Vector -> Logits -> Distribution

and endomorphism:

Parameters -> Parameters

Those two shapes remain central in Transformers.

Step 1: Sequences As First-Class Objects

The future problem:

Attention does not operate on one token alone. It operates on a sequence of hidden states.

Worked Example: Validating Sequence Length

The first-principles Rust move is the same one used throughout the book: do not let a meaningful value travel as a raw primitive once it crosses a conceptual boundary. A future sequence length can start as a small validating type:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
struct SequenceLength(usize);

impl SequenceLength {
    fn new(value: usize) -> Result<Self, &'static str> {
        if value == 0 {
            return Err("sequence length must be positive");
        }

        Ok(Self(value))
    }

    fn value(self) -> usize {
        self.0
    }
}

assert_eq!(SequenceLength::new(3)?.value(), 3);
assert!(SequenceLength::new(0).is_err());
Ok::<(), &'static str>(())
}

Self-Check

Before reading the roadmap steps, explain why a future SequenceLength should not be passed around as a bare usize.

Rust Syntax

A future extension should introduce types such as:

pub struct Position(usize);
pub struct SequenceLength(usize);
pub struct HiddenSequence(Vec<Vector>);
pub struct AttentionMask(/* validated mask representation */);

The important rule is the same as this course:

do not pass raw vectors across architectural boundaries

ML Concept

Attention needs a representation like:

[hidden_0, hidden_1, hidden_2, ...]

plus position and mask information.

Category Theory Concept

The object changes from:

Vector

to:

Sequence(Vector)

The next morphisms operate on structured sequences.

Step 2: Query, Key, And Value Projections

The future problem:

Attention compares tokens by projecting hidden states into query, key, and value spaces.

Rust Syntax

Future morphisms might have shapes:

HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence

Each output type should be distinct.

Queries, keys, and values are all vectors underneath, but they have different roles.

ML Concept

Queries ask:

what am I looking for?

Keys answer:

what do I contain?

Values provide:

what information should be mixed?

Category Theory Concept

These are parallel morphisms out of the same object:

HiddenSequence -> QuerySequence
HiddenSequence -> KeySequence
HiddenSequence -> ValueSequence

The future attention block combines their results.

Step 3: Scaled Dot-Product Attention

The future problem:

Convert query-key similarity into a probability distribution over positions, then use it to mix values.

Rust Syntax

A typed shape could be:

QuerySequence x KeySequence -> AttentionScores
AttentionScores -> AttentionWeights
AttentionWeights x ValueSequence -> AttentionOutput

AttentionWeights should be validated like Distribution, but over sequence positions instead of vocabulary tokens.

ML Concept

Attention computes:

scores = QK^T / sqrt(d)
weights = softmax(scores)
output = weights V

This is softmax again, but applied to token-to-token interaction scores.

Category Theory Concept

The attention block is a composition of typed maps with a product input:

(Q, K, V) -> scores -> weights -> mixed values

Step 4: Multi-Head Attention

The future problem:

One attention head sees one interaction pattern. Multiple heads let the model learn several patterns in parallel.

Rust Syntax

Future types might include:

pub struct AttentionHead(/* head parameters */);
pub struct HeadCount(usize);
pub struct MultiHeadOutput(/* concatenated or projected heads */);

HeadCount should reject zero.

ML Concept

Each head performs attention separately.

The outputs are combined and projected back into the model dimension.

Category Theory Concept

This is parallel composition followed by recombination:

head_1 x head_2 x ... x head_n -> combined output

Step 5: Residual Blocks And Normalization

The future problem:

Transformer blocks repeatedly map a hidden sequence back to a hidden sequence.

Rust Syntax

A future block should have shape:

HiddenSequence -> HiddenSequence

That means it can implement an endomorphism-like trait or reuse the existing Morphism<HiddenSequence, HiddenSequence> shape.

ML Concept

Transformer blocks use:

attention
residual connection
normalization
feed-forward network

The block output has the same shape as the input.

Category Theory Concept

This is another endomorphism:

HiddenSequence -> HiddenSequence

Stacking layers is repeated endomorphism application.

Step 6: Training And Evaluation

The future problem:

Once the model has attention parameters, training must update a larger parameter object without losing type structure.

Rust Syntax

The current:

Parameters

would need to grow into a structured parameter type:

pub struct TransformerParameters {
    token_embedding: ...,
    attention_blocks: ...,
    lm_head: ...,
}

The update should still have the shape:

TransformerParameters -> TransformerParameters

ML Concept

The same high-level training loop remains:

predict
compute loss
backpropagate
update parameters

The internal model becomes richer.

Category Theory Concept

The training endomorphism generalizes:

Parameters -> Parameters

to:

TransformerParameters -> TransformerParameters

Core Mental Model

The current course teaches the typed skeleton:

TokenId -> Vector -> Logits -> Distribution
Distribution x TokenId -> Loss
Parameters -> Parameters

A Transformer extension grows the middle:

TokenSequence
  -> HiddenSequence
  -> AttentionOutput
  -> HiddenSequence
  -> Logits
  -> Distribution

The practical rule stays the same:

Make every intermediate object explicit, then compose only arrows whose types actually match.

Where This Leaves Us

The roadmap keeps the book honest. The current implementation is a tiny next-token system, not a production Transformer. Its value is that it gives the future system a typed foundation: tokens become vectors, vectors become logits, logits become probabilities, probabilities become loss, and training updates parameters through a repeatable endomorphism.

A future Transformer should extend that foundation by adding sequence objects, position information, query/key/value projections, attention weights, multi-head structure, residual blocks, normalization, and richer training state. Each new concept should enter the codebase the same way the current concepts did: as a named type, a validated boundary, a typed morphism, a compiled example, and a law or regression test where the concept has a law worth checking.

Retrieval Practice

Recall

Which current pipeline objects would still exist in a future Transformer?

Explain

Why should attention weights be validated like probability distributions?

Apply

Sketch one future morphism for query/key/value attention and name its input and output types.

Repository Source Snapshots

This appendix collects the learner-facing source files used throughout the course. The Rust snapshots are complete files, so they are marked ignore as standalone snippets. The real source files are still validated by the repository checks.

The problem this appendix solves is:

The explanatory chapters should never drift away from the real code.

Use this appendix as the exact-code lookup layer.

When you inspect any block here, use the same reading pattern as the chapters:

Rust syntax:
what does the code literally declare or execute?

ML or software concept:
why does this block exist in the pipeline or system model?

Category theory concept:
what object, morphism, product, composition, endomorphism, or law does it
represent?

The appendix is intentionally less narrative than the chapters. It keeps the full source available in one place so you can verify every explanation against the code itself.

Rust Library Surface

`src/lib.rs`

//! A small, modular Rust tutorial for category-theory ideas in tiny ML.
//!
//! The crate is intentionally split into small learning modules:
//! - [`domain`] defines the nouns: tokens, vectors, probabilities, losses, and parameters.
//! - [`category`] defines the arrows: morphisms, identity, composition, and endomorphisms.
//! - [`ml`] implements concrete ML morphisms.
//! - [`training`] turns one optimizer step into an endomorphism on parameters.
//! - [`structure`] covers functors, natural transformations, and monoids.
//! - [`calculus`] shows the chain rule as a local backward pass.
//! - [`sketches`] gives Rust models for the seven applied-category-theory sketches.
//! - [`demo`] connects the pieces into the terminal walkthrough.

pub mod calculus;
pub mod category;
pub mod demo;
pub mod domain;
pub mod error;
pub mod ml;
pub mod sketches;
pub mod structure;
pub mod training;

pub use calculus::{LocalGradient, MulOp, Scalar};
pub use category::{
    Compose, Endomorphism, Identity, Morphism, StepCount, apply_endomorphism_n_times,
};
pub use demo::run_demo;
pub use domain::{
    Distribution, LearningRate, Logits, Loss, ModelDimension, Parameters, Product, TokenId,
    TokenSequence, TrainingExample, TrainingSet, Vector, VocabSize,
};
pub use error::{CtError, CtResult};
pub use ml::{
    CrossEntropy, DatasetWindowing, DirectPredict, Embedding, LinearToLogits, Softmax,
    average_loss, composed_prediction_matches_direct_prediction,
};
pub use sketches::{
    CircuitComponent, CompanyInstance, DepartmentId, DesignRequirement, EmployeeId, EmployeeRecord,
    FeasibilityRelation, FeatureCount, ImplementationOffer, InformationLevel, LatencyMs,
    LayerBudget, LocalSafetyCheck, MatrixCols, MatrixRows, OpenCircuit, PortName, ResistanceOhms,
    ResourceAmount, ResourceBundle, SafetyCover, SignalCoefficient, SignalMatrix, Throughput,
    TimeInterval, TimeTick, TruthValue, abstract_to_layer_budget, concretize_layer_budget,
    feature_layer_galois_law_holds, information_order_obeys_preorder_laws,
    resource_tensor_is_monotone,
};
pub use structure::{
    Functor, Monoid, NaturalTransformation, OptionFunctor, PipelineTrace, TraceStep, VecFunctor,
    VecToFirstOption, monoid_laws_hold_for_pipeline_trace,
    naturality_square_holds_for_first_option,
};
pub use training::TrainStep;

`src/error.rs`

use std::fmt::{self, Display};

/// Shared error type for the tutorial crate.
#[derive(Debug, Clone, PartialEq)]
pub enum CtError {
    EmptyInput(&'static str),
    OutOfRange {
        kind: &'static str,
        index: usize,
        limit: usize,
    },
    ShapeMismatch {
        op: &'static str,
        expected: String,
        got: String,
    },
    InvalidProbability(&'static str),
    InvalidLoss(f32),
    InvalidLearningRate(f32),
    InvalidQuantity {
        kind: &'static str,
        value: i64,
    },
    InvalidInterval {
        start: usize,
        end: usize,
    },
}

impl Display for CtError {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            CtError::EmptyInput(context) => write!(f, "empty input in {context}"),
            CtError::OutOfRange { kind, index, limit } => {
                write!(f, "{kind} index {index} out of range; limit is {limit}")
            }
            CtError::ShapeMismatch { op, expected, got } => {
                write!(f, "shape mismatch in {op}: expected {expected}, got {got}")
            }
            CtError::InvalidProbability(context) => {
                write!(f, "invalid probability distribution in {context}")
            }
            CtError::InvalidLoss(value) => write!(f, "invalid loss value {value}"),
            CtError::InvalidLearningRate(value) => write!(f, "invalid learning rate {value}"),
            CtError::InvalidQuantity { kind, value } => {
                write!(f, "invalid {kind} quantity {value}")
            }
            CtError::InvalidInterval { start, end } => {
                write!(f, "invalid interval: start {start} is after end {end}")
            }
        }
    }
}

impl std::error::Error for CtError {}

pub type CtResult<T> = Result<T, CtError>;

`src/domain.rs`

use crate::error::{CtError, CtResult};

/// A vocabulary index. It is intentionally not a raw `usize` in public APIs.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct TokenId(usize);

impl TokenId {
    pub fn new(index: usize) -> Self {
        Self(index)
    }

    pub fn index(&self) -> usize {
        self.0
    }
}

impl From<usize> for TokenId {
    fn from(value: usize) -> Self {
        Self::new(value)
    }
}

/// A sequence of tokens before it has been converted into training pairs.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct TokenSequence(Vec<TokenId>);

impl TokenSequence {
    pub fn new(tokens: impl IntoIterator<Item = TokenId>) -> CtResult<Self> {
        let tokens = tokens.into_iter().collect::<Vec<_>>();

        if tokens.is_empty() {
            return Err(CtError::EmptyInput("token sequence"));
        }

        Ok(Self(tokens))
    }

    pub fn from_indices(indices: impl IntoIterator<Item = usize>) -> CtResult<Self> {
        Self::new(indices.into_iter().map(TokenId::new))
    }

    pub fn as_slice(&self) -> &[TokenId] {
        &self.0
    }
}

/// A dense feature vector.
#[derive(Debug, Clone, PartialEq)]
pub struct Vector(Vec<f32>);

impl Vector {
    pub fn new(values: Vec<f32>) -> Self {
        Self(values)
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

/// Unnormalized model scores.
#[derive(Debug, Clone, PartialEq)]
pub struct Logits(Vec<f32>);

impl Logits {
    pub fn new(values: Vec<f32>) -> Self {
        Self(values)
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

/// A validated probability distribution.
#[derive(Debug, Clone, PartialEq)]
pub struct Distribution(Vec<f32>);

impl Distribution {
    pub fn new(probabilities: Vec<f32>) -> CtResult<Self> {
        if probabilities.is_empty() {
            return Err(CtError::EmptyInput("distribution"));
        }

        let sum: f32 = probabilities.iter().sum();
        let all_valid = probabilities
            .iter()
            .all(|probability| probability.is_finite() && *probability >= 0.0);

        if !all_valid || !approx_eq(sum, 1.0, 1e-4) {
            return Err(CtError::InvalidProbability("distribution constructor"));
        }

        Ok(Self(probabilities))
    }

    pub fn as_slice(&self) -> &[f32] {
        &self.0
    }
}

/// A non-negative scalar objective value.
#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Loss(f32);

impl Loss {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() || value < 0.0 {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

/// Number of vocabulary entries.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct VocabSize(usize);

impl VocabSize {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("vocabulary"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Width of each embedding vector.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ModelDimension(usize);

impl ModelDimension {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("model dimension"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Positive optimizer step size.
#[derive(Debug, Clone, Copy, PartialEq)]
pub struct LearningRate(f32);

impl LearningRate {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() || value <= 0.0 {
            return Err(CtError::InvalidLearningRate(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

/// Categorical product object: `A x B`.
#[derive(Debug, Clone, PartialEq)]
pub struct Product<A, B> {
    first: A,
    second: B,
}

impl<A, B> Product<A, B> {
    pub fn new(first: A, second: B) -> Self {
        Self { first, second }
    }

    pub fn first(&self) -> &A {
        &self.first
    }

    pub fn second(&self) -> &B {
        &self.second
    }

    pub fn into_parts(self) -> (A, B) {
        (self.first, self.second)
    }
}

pub type TrainingExample = Product<TokenId, TokenId>;

/// Non-empty next-token training pairs.
#[derive(Debug, Clone, PartialEq)]
pub struct TrainingSet(Vec<TrainingExample>);

impl TrainingSet {
    pub fn new(examples: impl IntoIterator<Item = TrainingExample>) -> CtResult<Self> {
        let examples = examples.into_iter().collect::<Vec<_>>();

        if examples.is_empty() {
            return Err(CtError::EmptyInput("training set"));
        }

        Ok(Self(examples))
    }

    pub fn examples(&self) -> &[TrainingExample] {
        &self.0
    }

    pub fn len(&self) -> usize {
        self.0.len()
    }

    pub fn is_empty(&self) -> bool {
        self.0.is_empty()
    }
}

/// Tiny model parameters for an embedding plus language-model head.
#[derive(Debug, Clone, PartialEq)]
pub struct Parameters {
    pub(crate) embedding: Vec<Vec<f32>>,
    pub(crate) lm_head: Vec<Vec<f32>>,
    pub(crate) bias: Vec<f32>,
}

impl Parameters {
    pub fn init(vocab_size: VocabSize, d_model: ModelDimension) -> Self {
        let vocab_size = vocab_size.value();
        let d_model = d_model.value();

        Self {
            embedding: init_matrix(vocab_size, d_model, 0.2, 1),
            lm_head: init_matrix(d_model, vocab_size, 0.2, 2),
            bias: vec![0.0; vocab_size],
        }
    }

    pub fn vocab_size(&self) -> usize {
        self.bias.len()
    }

    pub fn d_model(&self) -> usize {
        self.embedding.first().map_or(0, Vec::len)
    }

    pub fn embedding_table(&self) -> &[Vec<f32>] {
        &self.embedding
    }

    pub fn lm_head(&self) -> &[Vec<f32>] {
        &self.lm_head
    }

    pub fn bias(&self) -> &[f32] {
        &self.bias
    }
}

pub(crate) fn init_matrix(rows: usize, cols: usize, scale: f32, seed: usize) -> Vec<Vec<f32>> {
    let mut out = vec![vec![0.0; cols]; rows];

    for (row_index, row) in out.iter_mut().enumerate() {
        for (col_index, value) in row.iter_mut().enumerate() {
            let raw = ((row_index * cols + col_index) * 37 + seed * 101) % 1000;
            let unit = raw as f32 / 1000.0;
            *value = (unit - 0.5) * scale;
        }
    }

    out
}

pub(crate) fn approx_eq(a: f32, b: f32, eps: f32) -> bool {
    (a - b).abs() <= eps
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn distribution_rejects_non_normalized_values() {
        let result = Distribution::new(vec![0.4, 0.4]);

        assert!(matches!(result, Err(CtError::InvalidProbability(_))));
    }

    #[test]
    fn token_sequence_rejects_empty_input() {
        let result = TokenSequence::new(vec![]);

        assert!(matches!(result, Err(CtError::EmptyInput("token sequence"))));
    }
}

`src/category.rs`

use std::marker::PhantomData;

use crate::error::CtResult;

/// A typed category-theory arrow: `Input -> Output`.
pub trait Morphism<Input, Output> {
    fn name(&self) -> &'static str;
    fn apply(&self, input: Input) -> CtResult<Output>;
}

/// Identity morphism: `id_A : A -> A`.
#[derive(Debug, Clone, Copy)]
pub struct Identity<T> {
    _marker: PhantomData<T>,
}

impl<T> Identity<T> {
    pub fn new() -> Self {
        Self {
            _marker: PhantomData,
        }
    }
}

impl<T> Default for Identity<T> {
    fn default() -> Self {
        Self::new()
    }
}

impl<T> Morphism<T, T> for Identity<T> {
    fn name(&self) -> &'static str {
        "identity"
    }

    fn apply(&self, input: T) -> CtResult<T> {
        Ok(input)
    }
}

/// Composition of two morphisms: if `f : A -> B` and `g : B -> C`, this is
/// `g after f : A -> C`.
#[derive(Debug, Clone)]
pub struct Compose<F, G, Middle> {
    first: F,
    second: G,
    _middle: PhantomData<Middle>,
}

impl<F, G, Middle> Compose<F, G, Middle> {
    pub fn new(first: F, second: G) -> Self {
        Self {
            first,
            second,
            _middle: PhantomData,
        }
    }
}

impl<Input, Middle, Output, F, G> Morphism<Input, Output> for Compose<F, G, Middle>
where
    F: Morphism<Input, Middle>,
    G: Morphism<Middle, Output>,
{
    fn name(&self) -> &'static str {
        "composition"
    }

    fn apply(&self, input: Input) -> CtResult<Output> {
        let middle = self.first.apply(input)?;
        self.second.apply(middle)
    }
}

/// Endomorphism: a morphism from a type back to itself.
pub trait Endomorphism<T>: Morphism<T, T> {}

impl<T, M> Endomorphism<T> for M where M: Morphism<T, T> {}

/// How many times to repeat an endomorphism.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct StepCount(usize);

impl StepCount {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Repeatedly apply an endomorphism: `A0 -> A1 -> ... -> An`.
pub fn apply_endomorphism_n_times<T, E>(endo: &E, mut value: T, count: StepCount) -> CtResult<T>
where
    E: Endomorphism<T>,
{
    for _ in 0..count.value() {
        value = endo.apply(value)?;
    }

    Ok(value)
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn identity_returns_the_same_value() -> CtResult<()> {
        let value = String::from("same");

        assert_eq!(Identity::<String>::new().apply(value.clone())?, value);
        Ok(())
    }
}

`src/ml.rs`

use crate::category::{Compose, Morphism};
use crate::domain::{
    Distribution, Logits, Loss, Parameters, Product, TokenId, TokenSequence, TrainingSet, Vector,
    approx_eq,
};
use crate::error::{CtError, CtResult};

/// Turns adjacent tokens into next-token training examples.
#[derive(Debug, Clone)]
pub struct DatasetWindowing;

impl Morphism<TokenSequence, TrainingSet> for DatasetWindowing {
    fn name(&self) -> &'static str {
        "dataset_windowing"
    }

    fn apply(&self, tokens: TokenSequence) -> CtResult<TrainingSet> {
        if tokens.as_slice().len() < 2 {
            return Err(CtError::EmptyInput(
                "dataset windowing requires at least 2 tokens",
            ));
        }

        TrainingSet::new(
            tokens
                .as_slice()
                .windows(2)
                .map(|pair| Product::new(pair[0], pair[1])),
        )
    }
}

/// Morphism from token id to embedding vector.
#[derive(Debug, Clone)]
pub struct Embedding {
    table: Vec<Vec<f32>>,
}

impl Embedding {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            table: params.embedding_table().to_vec(),
        }
    }
}

impl Morphism<TokenId, Vector> for Embedding {
    fn name(&self) -> &'static str {
        "embedding"
    }

    fn apply(&self, token: TokenId) -> CtResult<Vector> {
        let Some(row) = self.table.get(token.index()) else {
            return Err(CtError::OutOfRange {
                kind: "token",
                index: token.index(),
                limit: self.table.len(),
            });
        };

        Ok(Vector::new(row.clone()))
    }
}

/// Linear projection from hidden vector to vocabulary logits.
#[derive(Debug, Clone)]
pub struct LinearToLogits {
    weight: Vec<Vec<f32>>,
    bias: Vec<f32>,
}

impl LinearToLogits {
    pub fn from_parameters(params: &Parameters) -> Self {
        Self {
            weight: params.lm_head().to_vec(),
            bias: params.bias().to_vec(),
        }
    }

    pub(crate) fn from_parts(weight: Vec<Vec<f32>>, bias: Vec<f32>) -> Self {
        Self { weight, bias }
    }
}

impl Morphism<Vector, Logits> for LinearToLogits {
    fn name(&self) -> &'static str {
        "linear_to_logits"
    }

    fn apply(&self, input: Vector) -> CtResult<Logits> {
        let d_model = input.as_slice().len();
        let vocab_size = self.bias.len();

        if self.weight.len() != d_model {
            return Err(CtError::ShapeMismatch {
                op: "linear layer",
                expected: format!("weight rows == input dim {d_model}"),
                got: format!("weight rows {}", self.weight.len()),
            });
        }

        let mut out = self.bias.clone();

        for (feature, input_value) in input.as_slice().iter().enumerate() {
            if self.weight[feature].len() != vocab_size {
                return Err(CtError::ShapeMismatch {
                    op: "linear layer",
                    expected: format!("weight cols == vocab size {vocab_size}"),
                    got: format!("weight cols {}", self.weight[feature].len()),
                });
            }

            for (vocab_id, output_value) in out.iter_mut().enumerate() {
                *output_value += input_value * self.weight[feature][vocab_id];
            }
        }

        Ok(Logits::new(out))
    }
}

/// Converts logits to a probability distribution.
#[derive(Debug, Clone)]
pub struct Softmax;

impl Morphism<Logits, Distribution> for Softmax {
    fn name(&self) -> &'static str {
        "softmax"
    }

    fn apply(&self, logits: Logits) -> CtResult<Distribution> {
        if logits.as_slice().is_empty() {
            return Err(CtError::EmptyInput("softmax"));
        }

        let max_value = logits
            .as_slice()
            .iter()
            .copied()
            .fold(f32::NEG_INFINITY, f32::max);
        let mut exps = Vec::with_capacity(logits.as_slice().len());
        let mut sum = 0.0;

        for value in logits.as_slice() {
            let exp = (*value - max_value).exp();
            exps.push(exp);
            sum += exp;
        }

        if sum <= 0.0 || !sum.is_finite() {
            return Err(CtError::InvalidProbability("softmax"));
        }

        Distribution::new(exps.into_iter().map(|value| value / sum).collect())
    }
}

/// Negative log likelihood for `(distribution, target_token)`.
#[derive(Debug, Clone)]
pub struct CrossEntropy;

impl Morphism<Product<Distribution, TokenId>, Loss> for CrossEntropy {
    fn name(&self) -> &'static str {
        "cross_entropy"
    }

    fn apply(&self, input: Product<Distribution, TokenId>) -> CtResult<Loss> {
        let (distribution, target) = input.into_parts();

        let Some(probability) = distribution.as_slice().get(target.index()).copied() else {
            return Err(CtError::OutOfRange {
                kind: "target",
                index: target.index(),
                limit: distribution.as_slice().len(),
            });
        };

        Loss::new(-probability.max(1e-9).ln())
    }
}

/// Direct path used for a commutative-diagram check.
#[derive(Debug, Clone)]
pub struct DirectPredict {
    params: Parameters,
}

impl DirectPredict {
    pub fn new(params: Parameters) -> Self {
        Self { params }
    }
}

impl Morphism<TokenId, Distribution> for DirectPredict {
    fn name(&self) -> &'static str {
        "direct_predict"
    }

    fn apply(&self, token: TokenId) -> CtResult<Distribution> {
        let embedding = Embedding::from_parameters(&self.params).apply(token)?;
        let logits = LinearToLogits::from_parameters(&self.params).apply(embedding)?;
        Softmax.apply(logits)
    }
}

/// Average cross-entropy over a training set.
pub fn average_loss(params: &Parameters, dataset: &TrainingSet) -> CtResult<Loss> {
    let embedding = Embedding::from_parameters(params);
    let linear = LinearToLogits::from_parameters(params);
    let predict = Compose::<_, _, Vector>::new(embedding, linear);
    let predict = Compose::<_, _, Logits>::new(predict, Softmax);
    let loss_fn = CrossEntropy;

    let mut total = 0.0;

    for example in dataset.examples() {
        let distribution = predict.apply(*example.first())?;
        let loss = loss_fn.apply(Product::new(distribution, *example.second()))?;
        total += loss.value();
    }

    Loss::new(total / dataset.len() as f32)
}

/// Verifies that the composed path and direct path produce the same result.
pub fn composed_prediction_matches_direct_prediction(params: &Parameters) -> CtResult<bool> {
    let token = TokenId::new(1);

    let composed = Compose::<_, _, Vector>::new(
        Embedding::from_parameters(params),
        LinearToLogits::from_parameters(params),
    );
    let composed = Compose::<_, _, Logits>::new(composed, Softmax);
    let direct = DirectPredict::new(params.clone());

    let left_path = composed.apply(token)?;
    let right_path = direct.apply(token)?;

    Ok(left_path
        .as_slice()
        .iter()
        .zip(right_path.as_slice().iter())
        .all(|(a, b)| approx_eq(*a, *b, 1e-6)))
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::domain::{ModelDimension, VocabSize};

    #[test]
    fn dataset_windowing_builds_adjacent_pairs() -> CtResult<()> {
        let tokens = TokenSequence::from_indices([1, 2, 3])?;
        let dataset = DatasetWindowing.apply(tokens)?;

        assert_eq!(dataset.len(), 2);
        assert_eq!(dataset.examples()[0].first().index(), 1);
        assert_eq!(dataset.examples()[0].second().index(), 2);
        Ok(())
    }

    #[test]
    fn composed_and_direct_prediction_match() -> CtResult<()> {
        let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

        assert!(composed_prediction_matches_direct_prediction(&params)?);
        Ok(())
    }
}

`src/training.rs`

use crate::category::Morphism;
use crate::domain::{LearningRate, Parameters, TrainingSet, Vector};
use crate::error::{CtError, CtResult};
use crate::ml::{LinearToLogits, Softmax};

/// One full-batch optimizer update.
///
/// Categorically, this is an endomorphism:
///
/// `Parameters -> Parameters`
#[derive(Debug, Clone)]
pub struct TrainStep {
    dataset: TrainingSet,
    learning_rate: LearningRate,
}

impl TrainStep {
    pub fn new(dataset: TrainingSet, learning_rate: LearningRate) -> Self {
        Self {
            dataset,
            learning_rate,
        }
    }
}

impl Morphism<Parameters, Parameters> for TrainStep {
    fn name(&self) -> &'static str {
        "train_step_endomorphism"
    }

    fn apply(&self, params: Parameters) -> CtResult<Parameters> {
        let vocab_size = params.vocab_size();
        let d_model = params.d_model();

        if vocab_size == 0 || d_model == 0 {
            return Err(CtError::EmptyInput("parameters"));
        }

        let mut grad_embedding = vec![vec![0.0; d_model]; params.embedding.len()];
        let mut grad_lm_head = vec![vec![0.0; vocab_size]; d_model];
        let mut grad_bias = vec![0.0; vocab_size];

        for example in self.dataset.examples() {
            let input_id = example.first().index();
            let target_id = example.second().index();

            if input_id >= params.embedding.len() {
                return Err(CtError::OutOfRange {
                    kind: "input token",
                    index: input_id,
                    limit: params.embedding.len(),
                });
            }

            if target_id >= vocab_size {
                return Err(CtError::OutOfRange {
                    kind: "target token",
                    index: target_id,
                    limit: vocab_size,
                });
            }

            let x = &params.embedding[input_id];
            let logits = LinearToLogits::from_parts(params.lm_head.clone(), params.bias.clone())
                .apply(Vector::new(x.clone()))?;
            let probs = Softmax.apply(logits)?;

            let mut dlogits = probs.as_slice().to_vec();
            dlogits[target_id] -= 1.0;

            for (vocab_id, dlogit) in dlogits.iter().copied().enumerate() {
                grad_bias[vocab_id] += dlogit;

                for (feature, x_feature) in x.iter().copied().enumerate() {
                    grad_lm_head[feature][vocab_id] += x_feature * dlogit;
                }
            }

            for (feature, grad_feature) in grad_embedding[input_id].iter_mut().enumerate() {
                let dx = params.lm_head[feature]
                    .iter()
                    .zip(dlogits.iter())
                    .map(|(weight, dlogit)| weight * dlogit)
                    .sum::<f32>();

                *grad_feature += dx;
            }
        }

        let batch_scale = 1.0 / self.dataset.len() as f32;
        let learning_rate = self.learning_rate.value();
        let mut updated = params.clone();

        for (row, grad_row) in updated.embedding.iter_mut().zip(grad_embedding.iter()) {
            for (value, grad) in row.iter_mut().zip(grad_row.iter()) {
                *value -= learning_rate * grad * batch_scale;
            }
        }

        for (row, grad_row) in updated.lm_head.iter_mut().zip(grad_lm_head.iter()) {
            for (value, grad) in row.iter_mut().zip(grad_row.iter()) {
                *value -= learning_rate * grad * batch_scale;
            }
        }

        for (bias, grad) in updated.bias.iter_mut().zip(grad_bias.iter()) {
            *bias -= learning_rate * grad * batch_scale;
        }

        Ok(updated)
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use crate::category::{StepCount, apply_endomorphism_n_times};
    use crate::domain::{ModelDimension, TokenSequence, VocabSize};
    use crate::ml::{DatasetWindowing, average_loss};

    #[test]
    fn repeated_training_step_reduces_loss() -> CtResult<()> {
        let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
        let dataset = DatasetWindowing.apply(tokens)?;
        let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);
        let before = average_loss(&params, &dataset)?;
        let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
        let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
        let after = average_loss(&trained, &dataset)?;

        assert!(after.value() < before.value());
        Ok(())
    }
}

`src/structure.rs`

/// A minimal functor interface for this tutorial.
pub trait Functor<A, B> {
    type WrappedA;
    type WrappedB;

    fn fmap<F>(wrapped: Self::WrappedA, f: F) -> Self::WrappedB
    where
        F: Fn(A) -> B;
}

pub struct VecFunctor;

impl<A, B> Functor<A, B> for VecFunctor {
    type WrappedA = Vec<A>;
    type WrappedB = Vec<B>;

    fn fmap<F>(wrapped: Vec<A>, f: F) -> Vec<B>
    where
        F: Fn(A) -> B,
    {
        wrapped.into_iter().map(f).collect()
    }
}

pub struct OptionFunctor;

impl<A, B> Functor<A, B> for OptionFunctor {
    type WrappedA = Option<A>;
    type WrappedB = Option<B>;

    fn fmap<F>(wrapped: Option<A>, f: F) -> Option<B>
    where
        F: Fn(A) -> B,
    {
        wrapped.map(f)
    }
}

/// A structure-preserving conversion between wrappers.
pub trait NaturalTransformation<A> {
    type From;
    type To;

    fn transform(from: Self::From) -> Self::To;
}

/// Natural transformation from `Vec<A>` to `Option<A>` by taking the first item.
pub struct VecToFirstOption;

impl<A> NaturalTransformation<A> for VecToFirstOption {
    type From = Vec<A>;
    type To = Option<A>;

    fn transform(from: Vec<A>) -> Option<A> {
        from.into_iter().next()
    }
}

pub fn naturality_square_holds_for_first_option() -> bool {
    let xs = vec![1, 2, 3];
    let f = |x| x * 10;

    let path_top_then_right = VecToFirstOption::transform(VecFunctor::fmap(xs.clone(), f));
    let path_left_then_bottom = OptionFunctor::fmap(VecToFirstOption::transform(xs), f);

    path_top_then_right == path_left_then_bottom
}

pub trait Monoid: Sized {
    fn empty() -> Self;
    fn combine(&self, other: &Self) -> Self;
}

#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct TraceStep(&'static str);

impl TraceStep {
    pub fn new(name: &'static str) -> Self {
        Self(name)
    }

    pub fn name(&self) -> &'static str {
        self.0
    }
}

#[derive(Debug, Clone, PartialEq, Eq)]
pub struct PipelineTrace(Vec<TraceStep>);

impl PipelineTrace {
    pub fn from_steps(steps: impl IntoIterator<Item = TraceStep>) -> Self {
        Self(steps.into_iter().collect())
    }

    pub fn names(&self) -> Vec<&'static str> {
        self.0.iter().map(TraceStep::name).collect()
    }
}

impl Monoid for PipelineTrace {
    fn empty() -> Self {
        PipelineTrace(vec![])
    }

    fn combine(&self, other: &Self) -> Self {
        let mut combined = self.0.clone();
        combined.extend_from_slice(&other.0);
        PipelineTrace(combined)
    }
}

pub fn monoid_laws_hold_for_pipeline_trace() -> bool {
    let a = PipelineTrace::from_steps(vec![TraceStep::new("embedding")]);
    let b = PipelineTrace::from_steps(vec![TraceStep::new("linear")]);
    let c = PipelineTrace::from_steps(vec![TraceStep::new("softmax")]);
    let identity = PipelineTrace::empty();

    let left_identity = identity.combine(&a) == a;
    let right_identity = a.combine(&identity) == a;
    let associativity = a.combine(&b).combine(&c) == a.combine(&b.combine(&c));

    left_identity && right_identity && associativity
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn naturality_square_commutes() {
        assert!(naturality_square_holds_for_first_option());
    }

    #[test]
    fn pipeline_trace_obeys_monoid_laws() {
        assert!(monoid_laws_hold_for_pipeline_trace());
    }
}

`src/calculus.rs`

use crate::error::{CtError, CtResult};

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct Scalar(f32);

impl Scalar {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

#[derive(Debug, Clone, Copy, PartialEq)]
pub struct LocalGradient(f32);

impl LocalGradient {
    pub fn new(value: f32) -> CtResult<Self> {
        if !value.is_finite() {
            return Err(CtError::InvalidLoss(value));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> f32 {
        self.0
    }
}

#[derive(Debug, Clone, Copy)]
pub struct MulOp;

impl MulOp {
    pub fn forward(&self, x: Scalar, y: Scalar) -> CtResult<Scalar> {
        Scalar::new(x.value() * y.value())
    }

    /// Given upstream gradient dL/dz, return `(dL/dx, dL/dy)` for `z = x * y`.
    pub fn backward(
        &self,
        x: Scalar,
        y: Scalar,
        upstream: LocalGradient,
    ) -> CtResult<(LocalGradient, LocalGradient)> {
        let dz_dx = y.value();
        let dz_dy = x.value();

        Ok((
            LocalGradient::new(upstream.value() * dz_dx)?,
            LocalGradient::new(upstream.value() * dz_dy)?,
        ))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn multiply_backward_returns_local_chain_rule_gradients() -> CtResult<()> {
        let mul = MulOp;
        let x = Scalar::new(2.0)?;
        let y = Scalar::new(3.0)?;
        let upstream = LocalGradient::new(1.0)?;
        let (dl_dx, dl_dy) = mul.backward(x, y, upstream)?;

        assert_eq!(dl_dx.value(), 3.0);
        assert_eq!(dl_dy.value(), 2.0);
        Ok(())
    }
}

`src/sketches.rs`

//! Rust models for the seven applied-category-theory sketches.
//!
//! This module is a study companion for *Seven Sketches in Compositionality*.
//! It does not try to encode all of category theory. Instead, each section
//! gives one small Rust model for the main structure of a sketch: orders,
//! resources, databases, co-design, signal flow, circuits, and logic of
//! behavior.

use crate::error::{CtError, CtResult};

/// A finite order used for the first sketch: observations can be refined into
/// features, scores, and decisions.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum InformationLevel {
    Observation,
    Feature,
    Score,
    Decision,
}

impl InformationLevel {
    /// Returns true when information at this level can flow into `target`.
    pub fn can_flow_to(self, target: Self) -> bool {
        self <= target
    }

    /// Least upper bound in this small total order.
    pub fn join(self, other: Self) -> Self {
        self.max(other)
    }
}

const INFORMATION_LEVELS: [InformationLevel; 4] = [
    InformationLevel::Observation,
    InformationLevel::Feature,
    InformationLevel::Score,
    InformationLevel::Decision,
];

/// Checks reflexivity and transitivity for the finite information order.
pub fn information_order_obeys_preorder_laws() -> bool {
    for level in INFORMATION_LEVELS {
        if !level.can_flow_to(level) {
            return false;
        }
    }

    for first in INFORMATION_LEVELS {
        for second in INFORMATION_LEVELS {
            for third in INFORMATION_LEVELS {
                let premise = first.can_flow_to(second) && second.can_flow_to(third);
                if premise && !first.can_flow_to(third) {
                    return false;
                }
            }
        }
    }

    true
}

/// Number of concrete features in a tiny model.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct FeatureCount(usize);

impl FeatureCount {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("feature count"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Number of abstract layers used to summarize feature capacity.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct LayerBudget(usize);

impl LayerBudget {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("layer budget"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

const FEATURES_PER_LAYER: usize = 4;

/// Abstracts concrete features to the minimum layer budget that can hold them.
pub fn abstract_to_layer_budget(features: FeatureCount) -> CtResult<LayerBudget> {
    LayerBudget::new(features.value().div_ceil(FEATURES_PER_LAYER))
}

/// Concretizes an abstract layer budget back to feature capacity.
pub fn concretize_layer_budget(layers: LayerBudget) -> FeatureCount {
    FeatureCount(layers.value() * FEATURES_PER_LAYER)
}

/// Checks the Galois-connection law for the feature/layer abstraction.
pub fn feature_layer_galois_law_holds(
    features: FeatureCount,
    layers: LayerBudget,
) -> CtResult<bool> {
    let abstracted_fits = abstract_to_layer_budget(features)?.value() <= layers.value();
    let concrete_fits = features.value() <= concretize_layer_budget(layers).value();

    Ok(abstracted_fits == concrete_fits)
}

/// A non-negative amount in a resource bundle.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct ResourceAmount(usize);

impl ResourceAmount {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Compute and memory resources composed by component-wise addition.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ResourceBundle {
    compute: ResourceAmount,
    memory: ResourceAmount,
}

impl ResourceBundle {
    pub fn new(compute: ResourceAmount, memory: ResourceAmount) -> Self {
        Self { compute, memory }
    }

    pub fn compute(&self) -> ResourceAmount {
        self.compute
    }

    pub fn memory(&self) -> ResourceAmount {
        self.memory
    }

    /// Monoidal product for independent resources.
    pub fn tensor(&self, other: &Self) -> Self {
        Self {
            compute: ResourceAmount::new(self.compute.value() + other.compute.value()),
            memory: ResourceAmount::new(self.memory.value() + other.memory.value()),
        }
    }

    /// Resource preorder: supply can cover demand when every component is large
    /// enough.
    pub fn can_supply(&self, demand: &Self) -> bool {
        self.compute >= demand.compute && self.memory >= demand.memory
    }
}

/// Demonstrates monotonicity of the monoidal resource operation.
pub fn resource_tensor_is_monotone() -> bool {
    let small_supply = ResourceBundle::new(ResourceAmount::new(2), ResourceAmount::new(8));
    let large_supply = ResourceBundle::new(ResourceAmount::new(4), ResourceAmount::new(16));
    let fixed = ResourceBundle::new(ResourceAmount::new(1), ResourceAmount::new(2));

    large_supply.can_supply(&small_supply)
        && large_supply
            .tensor(&fixed)
            .can_supply(&small_supply.tensor(&fixed))
}

/// Identifier for a department row.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct DepartmentId(usize);

impl DepartmentId {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Identifier for an employee row.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct EmployeeId(usize);

impl EmployeeId {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// A row in the employee table. The department field is the schema arrow
/// `Employee -> Department`.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct EmployeeRecord {
    id: EmployeeId,
    department: DepartmentId,
}

impl EmployeeRecord {
    pub fn new(id: EmployeeId, department: DepartmentId) -> Self {
        Self { id, department }
    }

    pub fn id(&self) -> EmployeeId {
        self.id
    }

    pub fn department(&self) -> DepartmentId {
        self.department
    }
}

/// A tiny database instance: schema objects become sets of ids, and schema
/// arrows become functions between those sets.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct CompanyInstance {
    departments: Vec<DepartmentId>,
    employees: Vec<EmployeeRecord>,
}

impl CompanyInstance {
    pub fn new(
        departments: impl IntoIterator<Item = DepartmentId>,
        employees: impl IntoIterator<Item = EmployeeRecord>,
    ) -> CtResult<Self> {
        let departments = departments.into_iter().collect::<Vec<_>>();
        let employees = employees.into_iter().collect::<Vec<_>>();

        for employee in &employees {
            if !departments.contains(&employee.department()) {
                return Err(CtError::ShapeMismatch {
                    op: "database instance",
                    expected: String::from("employee departments exist in Department"),
                    got: format!("missing department {}", employee.department().value()),
                });
            }
        }

        Ok(Self {
            departments,
            employees,
        })
    }

    pub fn departments(&self) -> &[DepartmentId] {
        &self.departments
    }

    pub fn employees(&self) -> &[EmployeeRecord] {
        &self.employees
    }

    pub fn department_of(&self, employee_id: EmployeeId) -> Option<DepartmentId> {
        self.employees
            .iter()
            .find(|employee| employee.id() == employee_id)
            .map(EmployeeRecord::department)
    }

    pub fn foreign_keys_resolve(&self) -> bool {
        self.employees
            .iter()
            .all(|employee| self.departments.contains(&employee.department()))
    }
}

/// Requests per second needed by a design.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct Throughput(usize);

impl Throughput {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("throughput"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Millisecond latency boundary.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct LatencyMs(usize);

impl LatencyMs {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("latency"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Functional need in a co-design problem.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct DesignRequirement {
    minimum_throughput: Throughput,
    maximum_latency: LatencyMs,
}

impl DesignRequirement {
    pub fn new(minimum_throughput: Throughput, maximum_latency: LatencyMs) -> Self {
        Self {
            minimum_throughput,
            maximum_latency,
        }
    }
}

/// Implementation offered by a candidate component.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ImplementationOffer {
    throughput: Throughput,
    latency: LatencyMs,
}

impl ImplementationOffer {
    pub fn new(throughput: Throughput, latency: LatencyMs) -> Self {
        Self {
            throughput,
            latency,
        }
    }
}

/// A Bool-valued feasibility relation between requirements and offers.
#[derive(Debug, Clone, Copy)]
pub struct FeasibilityRelation;

impl FeasibilityRelation {
    pub fn relates(requirement: DesignRequirement, offer: ImplementationOffer) -> bool {
        offer.throughput >= requirement.minimum_throughput
            && offer.latency <= requirement.maximum_latency
    }
}

/// A scalar coefficient in a signal-flow matrix.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct SignalCoefficient(i32);

impl SignalCoefficient {
    pub fn new(value: i32) -> Self {
        Self(value)
    }

    pub fn zero() -> Self {
        Self(0)
    }

    pub fn value(&self) -> i32 {
        self.0
    }

    fn add(self, other: Self) -> Self {
        Self(self.value() + other.value())
    }

    fn multiply(self, other: Self) -> Self {
        Self(self.value() * other.value())
    }
}

/// Number of rows in a signal-flow matrix.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MatrixRows(usize);

impl MatrixRows {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("matrix rows"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Number of columns in a signal-flow matrix.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct MatrixCols(usize);

impl MatrixCols {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::EmptyInput("matrix columns"));
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Matrix semantics for a signal-flow graph.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct SignalMatrix {
    rows: MatrixRows,
    cols: MatrixCols,
    coefficients: Vec<Vec<SignalCoefficient>>,
}

impl SignalMatrix {
    pub fn new(
        rows: MatrixRows,
        cols: MatrixCols,
        coefficients: Vec<Vec<SignalCoefficient>>,
    ) -> CtResult<Self> {
        if coefficients.len() != rows.value() {
            return Err(CtError::ShapeMismatch {
                op: "signal matrix",
                expected: format!("{} rows", rows.value()),
                got: format!("{} rows", coefficients.len()),
            });
        }

        for row in &coefficients {
            if row.len() != cols.value() {
                return Err(CtError::ShapeMismatch {
                    op: "signal matrix",
                    expected: format!("{} columns", cols.value()),
                    got: format!("{} columns", row.len()),
                });
            }
        }

        Ok(Self {
            rows,
            cols,
            coefficients,
        })
    }

    pub fn rows(&self) -> MatrixRows {
        self.rows
    }

    pub fn cols(&self) -> MatrixCols {
        self.cols
    }

    pub fn coefficients(&self) -> &[Vec<SignalCoefficient>] {
        &self.coefficients
    }

    /// Matrix composition. If `previous` is `A -> B` and `self` is `B -> C`,
    /// the result is `A -> C`.
    pub fn compose_after(&self, previous: &Self) -> CtResult<Self> {
        if previous.rows.value() != self.cols.value() {
            return Err(CtError::ShapeMismatch {
                op: "signal matrix composition",
                expected: format!("middle dimension {}", self.cols.value()),
                got: format!("middle dimension {}", previous.rows.value()),
            });
        }

        let mut coefficients =
            vec![vec![SignalCoefficient::zero(); previous.cols.value()]; self.rows.value()];

        for (output_row, output_coefficients) in coefficients.iter_mut().enumerate() {
            for (input_col, coefficient) in output_coefficients.iter_mut().enumerate() {
                let mut total = SignalCoefficient::zero();

                for middle in 0..self.cols.value() {
                    total = total.add(
                        self.coefficients[output_row][middle]
                            .multiply(previous.coefficients[middle][input_col]),
                    );
                }

                *coefficient = total;
            }
        }

        Self::new(self.rows, previous.cols, coefficients)
    }
}

/// A named boundary port in an open circuit.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub struct PortName(&'static str);

impl PortName {
    pub fn new(value: &'static str) -> CtResult<Self> {
        if value.trim().is_empty() {
            return Err(CtError::EmptyInput("port name"));
        }

        Ok(Self(value))
    }

    pub fn as_str(&self) -> &'static str {
        self.0
    }
}

/// Positive resistance value.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct ResistanceOhms(usize);

impl ResistanceOhms {
    pub fn new(value: usize) -> CtResult<Self> {
        if value == 0 {
            return Err(CtError::InvalidQuantity {
                kind: "resistance",
                value: value as i64,
            });
        }

        Ok(Self(value))
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// A simple resistor component between two ports.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct CircuitComponent {
    from: PortName,
    to: PortName,
    resistance: ResistanceOhms,
}

impl CircuitComponent {
    pub fn resistor(from: PortName, to: PortName, resistance: ResistanceOhms) -> Self {
        Self {
            from,
            to,
            resistance,
        }
    }

    pub fn from(&self) -> PortName {
        self.from
    }

    pub fn to(&self) -> PortName {
        self.to
    }

    pub fn resistance(&self) -> ResistanceOhms {
        self.resistance
    }
}

/// An open circuit with input ports, output ports, and internal components.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct OpenCircuit {
    inputs: Vec<PortName>,
    outputs: Vec<PortName>,
    components: Vec<CircuitComponent>,
}

impl OpenCircuit {
    pub fn new(
        inputs: impl IntoIterator<Item = PortName>,
        outputs: impl IntoIterator<Item = PortName>,
        components: impl IntoIterator<Item = CircuitComponent>,
    ) -> CtResult<Self> {
        let inputs = inputs.into_iter().collect::<Vec<_>>();
        let outputs = outputs.into_iter().collect::<Vec<_>>();
        let components = components.into_iter().collect::<Vec<_>>();

        if inputs.is_empty() {
            return Err(CtError::EmptyInput("circuit inputs"));
        }

        if outputs.is_empty() {
            return Err(CtError::EmptyInput("circuit outputs"));
        }

        Ok(Self {
            inputs,
            outputs,
            components,
        })
    }

    pub fn input_count(&self) -> usize {
        self.inputs.len()
    }

    pub fn output_count(&self) -> usize {
        self.outputs.len()
    }

    pub fn component_count(&self) -> usize {
        self.components.len()
    }

    /// Serial composition wires this circuit's outputs into the next circuit's
    /// inputs.
    pub fn then(&self, next: &Self) -> CtResult<Self> {
        if self.output_count() != next.input_count() {
            return Err(CtError::ShapeMismatch {
                op: "open circuit serial composition",
                expected: format!("{} next inputs", self.output_count()),
                got: format!("{} next inputs", next.input_count()),
            });
        }

        let mut components = self.components.clone();
        components.extend_from_slice(&next.components);

        Self::new(self.inputs.clone(), next.outputs.clone(), components)
    }

    /// Parallel composition keeps the two open interfaces side by side.
    pub fn parallel(&self, other: &Self) -> CtResult<Self> {
        let mut inputs = self.inputs.clone();
        inputs.extend_from_slice(&other.inputs);

        let mut outputs = self.outputs.clone();
        outputs.extend_from_slice(&other.outputs);

        let mut components = self.components.clone();
        components.extend_from_slice(&other.components);

        Self::new(inputs, outputs, components)
    }
}

/// Truth values used for behavior classification.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TruthValue {
    False,
    True,
}

impl TruthValue {
    pub fn and(self, other: Self) -> Self {
        match (self, other) {
            (TruthValue::True, TruthValue::True) => TruthValue::True,
            _ => TruthValue::False,
        }
    }

    pub fn implies(self, other: Self) -> Self {
        match (self, other) {
            (TruthValue::True, TruthValue::False) => TruthValue::False,
            _ => TruthValue::True,
        }
    }
}

/// Discrete time point in a behavior trace.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub struct TimeTick(usize);

impl TimeTick {
    pub fn new(value: usize) -> Self {
        Self(value)
    }

    pub fn value(&self) -> usize {
        self.0
    }
}

/// Closed interval of time ticks.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct TimeInterval {
    start: TimeTick,
    end: TimeTick,
}

impl TimeInterval {
    pub fn new(start: TimeTick, end: TimeTick) -> CtResult<Self> {
        if start > end {
            return Err(CtError::InvalidInterval {
                start: start.value(),
                end: end.value(),
            });
        }

        Ok(Self { start, end })
    }

    pub fn start(&self) -> TimeTick {
        self.start
    }

    pub fn end(&self) -> TimeTick {
        self.end
    }
}

/// Local safety result on one interval.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub struct LocalSafetyCheck {
    interval: TimeInterval,
    truth: TruthValue,
}

impl LocalSafetyCheck {
    pub fn new(interval: TimeInterval, truth: TruthValue) -> Self {
        Self { interval, truth }
    }

    pub fn interval(&self) -> TimeInterval {
        self.interval
    }

    pub fn truth(&self) -> TruthValue {
        self.truth
    }
}

/// A cover of local behavior checks. The global result is true only when all
/// local checks are true.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct SafetyCover(Vec<LocalSafetyCheck>);

impl SafetyCover {
    pub fn new(checks: impl IntoIterator<Item = LocalSafetyCheck>) -> CtResult<Self> {
        let checks = checks.into_iter().collect::<Vec<_>>();

        if checks.is_empty() {
            return Err(CtError::EmptyInput("safety cover"));
        }

        Ok(Self(checks))
    }

    pub fn global_truth(&self) -> TruthValue {
        self.0
            .iter()
            .fold(TruthValue::True, |truth, check| truth.and(check.truth()))
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn information_order_has_preorder_laws() {
        assert!(information_order_obeys_preorder_laws());
        assert_eq!(
            InformationLevel::Feature.join(InformationLevel::Decision),
            InformationLevel::Decision
        );
    }

    #[test]
    fn feature_layer_abstraction_obeys_galois_law() -> CtResult<()> {
        let features = FeatureCount::new(9)?;
        let layers = LayerBudget::new(3)?;

        assert!(feature_layer_galois_law_holds(features, layers)?);
        assert_eq!(abstract_to_layer_budget(features)?.value(), 3);
        assert_eq!(concretize_layer_budget(layers).value(), 12);
        Ok(())
    }

    #[test]
    fn resource_tensor_is_order_preserving() {
        assert!(resource_tensor_is_monotone());
    }

    #[test]
    fn database_instance_resolves_schema_arrow() -> CtResult<()> {
        let research = DepartmentId::new(1);
        let platform = DepartmentId::new(2);
        let ada = EmployeeId::new(7);
        let instance =
            CompanyInstance::new([research, platform], [EmployeeRecord::new(ada, research)])?;

        assert!(instance.foreign_keys_resolve());
        assert_eq!(instance.department_of(ada), Some(research));
        Ok(())
    }

    #[test]
    fn feasibility_relation_matches_requirement_to_offer() -> CtResult<()> {
        let requirement = DesignRequirement::new(Throughput::new(100)?, LatencyMs::new(80)?);
        let offer = ImplementationOffer::new(Throughput::new(120)?, LatencyMs::new(50)?);

        assert!(FeasibilityRelation::relates(requirement, offer));
        Ok(())
    }

    #[test]
    fn signal_matrices_compose_like_flow_graph_semantics() -> CtResult<()> {
        let duplicate = SignalMatrix::new(
            MatrixRows::new(2)?,
            MatrixCols::new(1)?,
            vec![
                vec![SignalCoefficient::new(1)],
                vec![SignalCoefficient::new(1)],
            ],
        )?;
        let add_weighted = SignalMatrix::new(
            MatrixRows::new(1)?,
            MatrixCols::new(2)?,
            vec![vec![SignalCoefficient::new(2), SignalCoefficient::new(3)]],
        )?;

        let composed = add_weighted.compose_after(&duplicate)?;

        assert_eq!(composed.coefficients(), &[vec![SignalCoefficient::new(5)]]);
        Ok(())
    }

    #[test]
    fn open_circuits_compose_in_series_and_parallel() -> CtResult<()> {
        let input = PortName::new("input")?;
        let middle = PortName::new("middle")?;
        let output = PortName::new("output")?;
        let first = OpenCircuit::new(
            [input],
            [middle],
            [CircuitComponent::resistor(
                input,
                middle,
                ResistanceOhms::new(10)?,
            )],
        )?;
        let second = OpenCircuit::new(
            [middle],
            [output],
            [CircuitComponent::resistor(
                middle,
                output,
                ResistanceOhms::new(20)?,
            )],
        )?;

        let serial = first.then(&second)?;
        let parallel = first.parallel(&second)?;

        assert_eq!(serial.input_count(), 1);
        assert_eq!(serial.output_count(), 1);
        assert_eq!(serial.component_count(), 2);
        assert_eq!(parallel.input_count(), 2);
        assert_eq!(parallel.output_count(), 2);
        Ok(())
    }

    #[test]
    fn local_behavior_truth_glues_to_global_truth() -> CtResult<()> {
        let first = LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(0), TimeTick::new(5))?,
            TruthValue::True,
        );
        let second = LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(5), TimeTick::new(10))?,
            TruthValue::True,
        );
        let cover = SafetyCover::new([first, second])?;

        assert_eq!(cover.global_truth(), TruthValue::True);
        assert_eq!(
            TruthValue::True.implies(TruthValue::False),
            TruthValue::False
        );
        Ok(())
    }
}

`src/demo.rs`

use crate::calculus::{LocalGradient, MulOp, Scalar};
use crate::category::{Compose, StepCount, apply_endomorphism_n_times};
use crate::domain::{
    LearningRate, Logits, ModelDimension, Parameters, Product, TokenId, TokenSequence, Vector,
    VocabSize,
};
use crate::error::CtResult;
use crate::ml::{
    CrossEntropy, DatasetWindowing, Embedding, LinearToLogits, Softmax, average_loss,
    composed_prediction_matches_direct_prediction,
};
use crate::structure::{
    Functor, Monoid, OptionFunctor, PipelineTrace, TraceStep, VecFunctor,
    monoid_laws_hold_for_pipeline_trace, naturality_square_holds_for_first_option,
};
use crate::training::TrainStep;
use crate::{Identity, Morphism};

/// Run the full terminal walkthrough used by `cargo run --bin category_ml`.
pub fn run_demo() -> CtResult<()> {
    println!("Category theory concepts implemented in Rust 2024");
    println!("=================================================\n");

    let vocab = ["<pad>", "I", "love", "Rust", "."];
    let raw_text = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;

    println!("1. Object examples");
    println!("   TokenId(1) means {:?}\n", vocab[1]);

    println!("2. Dataset morphism: TokenSequence -> TrainingSet");
    let dataset = DatasetWindowing.apply(raw_text)?;
    for example in dataset.examples() {
        println!(
            "   {:?} -> {:?}",
            vocab[example.first().index()],
            vocab[example.second().index()]
        );
    }
    println!();

    println!("3. Identity morphism: id_Vector : Vector -> Vector");
    let v = Vector::new(vec![1.0, 2.0, 3.0]);
    let same_v = Identity::<Vector>::new().apply(v.clone())?;
    println!("   input  = {:?}", v);
    println!("   output = {:?}\n", same_v);

    println!("4. Composition: Softmax after Linear after Embedding");
    let params = Parameters::init(VocabSize::new(vocab.len())?, ModelDimension::new(4)?);
    let embedding = Embedding::from_parameters(&params);
    let linear = LinearToLogits::from_parameters(&params);
    let token_to_logits = Compose::<_, _, Vector>::new(embedding, linear);
    let token_to_distribution = Compose::<_, _, Logits>::new(token_to_logits, Softmax);
    let distribution = token_to_distribution.apply(TokenId::new(1))?;
    println!("   P(next token | 'I') = {:?}\n", distribution.as_slice());

    println!("5. Product object: Prediction x Target -> Loss");
    let loss = CrossEntropy.apply(Product::new(distribution, TokenId::new(2)))?;
    println!("   loss for target 'love' = {:.6}\n", loss.value());

    println!("6. Endomorphism: TrainStep : Parameters -> Parameters");
    let before = average_loss(&params, &dataset)?;
    let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
    let trained_params =
        apply_endomorphism_n_times(&train_step, params.clone(), StepCount::new(80))?;
    let after = average_loss(&trained_params, &dataset)?;
    println!("   average loss before training = {:.6}", before.value());
    println!("   average loss after  training = {:.6}\n", after.value());

    println!("7. Functor: fmap over Vec and Option");
    let xs = vec![1, 2, 3];
    let ys = VecFunctor::fmap(xs, |x| x * x);
    let maybe = OptionFunctor::fmap(Some(7), |x| x + 1);
    println!("   VecFunctor fmap square: {:?}", ys);
    println!("   OptionFunctor fmap +1: {:?}\n", maybe);

    println!("8. Natural transformation: Vec<A> -> Option<A>");
    println!(
        "   naturality square holds: {}\n",
        naturality_square_holds_for_first_option()
    );

    println!("9. Monoid: pipeline traces compose associatively with identity");
    let trace = PipelineTrace::from_steps(vec![TraceStep::new("embedding")])
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("linear")]))
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("softmax")]));
    println!("   trace = {:?}", trace.names());
    println!(
        "   monoid laws hold for this trace type: {}\n",
        monoid_laws_hold_for_pipeline_trace()
    );

    println!("10. Commutative diagram check");
    println!(
        "   composed prediction == direct prediction: {}\n",
        composed_prediction_matches_direct_prediction(&params)?
    );

    println!("11. Chain rule / local derivative morphism");
    let mul = MulOp;
    let x = Scalar::new(2.0)?;
    let y = Scalar::new(3.0)?;
    let z = mul.forward(x, y)?;
    let upstream = LocalGradient::new(1.0)?;
    let (dl_dx, dl_dy) = mul.backward(x, y, upstream)?;
    println!("   z = x * y = {}", z.value());
    println!(
        "   if dL/dz = {}, then dL/dx = {}, dL/dy = {}\n",
        upstream.value(),
        dl_dx.value(),
        dl_dy.value()
    );

    println!("Compressed categorical training view:");
    println!("   Dataset x Parameters -> Prediction -> Loss -> Gradients -> Updated Parameters");
    println!("   TrainStep is repeated as Parameters0 -> Parameters1 -> ... -> ParametersN");

    Ok(())
}

`src/bin/category_ml.rs`

fn main() -> category_theory_transformer_rs::CtResult<()> {
    category_theory_transformer_rs::run_demo()
}

Runnable Examples

`examples/01_domain_objects.rs`

use category_theory_transformer_rs::{CtResult, DatasetWindowing, Morphism, TokenSequence};

fn main() -> CtResult<()> {
    let tokens = TokenSequence::from_indices([1, 2, 3, 4])?;
    let dataset = DatasetWindowing.apply(tokens)?;

    println!("training pairs:");
    for example in dataset.examples() {
        println!(
            "{} -> {}",
            example.first().index(),
            example.second().index()
        );
    }

    Ok(())
}

`examples/02_morphism_composition.rs`

use category_theory_transformer_rs::{
    Compose, CtResult, Embedding, LinearToLogits, Logits, ModelDimension, Morphism, Parameters,
    Softmax, TokenId, Vector, VocabSize,
};

fn main() -> CtResult<()> {
    let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

    let token_to_logits = Compose::<_, _, Vector>::new(
        Embedding::from_parameters(&params),
        LinearToLogits::from_parameters(&params),
    );
    let token_to_distribution = Compose::<_, _, Logits>::new(token_to_logits, Softmax);

    let distribution = token_to_distribution.apply(TokenId::new(1))?;

    println!("next-token probabilities: {:?}", distribution.as_slice());

    Ok(())
}

`examples/03_training_endomorphism.rs`

use category_theory_transformer_rs::{
    CtResult, DatasetWindowing, LearningRate, ModelDimension, Morphism, Parameters, StepCount,
    TokenSequence, TrainStep, VocabSize, apply_endomorphism_n_times, average_loss,
};

fn main() -> CtResult<()> {
    let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
    let dataset = DatasetWindowing.apply(tokens)?;
    let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);

    let before = average_loss(&params, &dataset)?;
    let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
    let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
    let after = average_loss(&trained, &dataset)?;

    println!("loss before: {:.6}", before.value());
    println!("loss after:  {:.6}", after.value());

    Ok(())
}

`examples/04_structure_and_calculus.rs`

use category_theory_transformer_rs::{
    CtResult, Functor, LocalGradient, Monoid, MulOp, OptionFunctor, PipelineTrace, Scalar,
    TraceStep, VecFunctor, monoid_laws_hold_for_pipeline_trace,
    naturality_square_holds_for_first_option,
};

fn main() -> CtResult<()> {
    let squared = VecFunctor::fmap(vec![1, 2, 3], |x| x * x);
    let shifted = OptionFunctor::fmap(Some(7), |x| x + 1);

    println!("Vec fmap square: {squared:?}");
    println!("Option fmap +1: {shifted:?}");
    println!(
        "naturality square holds: {}",
        naturality_square_holds_for_first_option()
    );

    let trace = PipelineTrace::from_steps(vec![TraceStep::new("embedding")])
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("linear")]))
        .combine(&PipelineTrace::from_steps(vec![TraceStep::new("softmax")]));

    println!("trace: {:?}", trace.names());
    println!(
        "monoid laws hold: {}",
        monoid_laws_hold_for_pipeline_trace()
    );

    let mul = MulOp;
    let x = Scalar::new(2.0)?;
    let y = Scalar::new(3.0)?;
    let upstream = LocalGradient::new(1.0)?;
    let (dl_dx, dl_dy) = mul.backward(x, y, upstream)?;

    println!("dL/dx: {}", dl_dx.value());
    println!("dL/dy: {}", dl_dy.value());

    Ok(())
}

`examples/05_seven_sketches.rs`

use category_theory_transformer_rs::{
    CircuitComponent, CompanyInstance, CtResult, DepartmentId, DesignRequirement, EmployeeId,
    EmployeeRecord, FeasibilityRelation, FeatureCount, ImplementationOffer, InformationLevel,
    LatencyMs, LayerBudget, LocalSafetyCheck, MatrixCols, MatrixRows, OpenCircuit, PortName,
    ResistanceOhms, ResourceAmount, ResourceBundle, SafetyCover, SignalCoefficient, SignalMatrix,
    Throughput, TimeInterval, TimeTick, TruthValue, abstract_to_layer_budget,
    concretize_layer_budget, feature_layer_galois_law_holds, information_order_obeys_preorder_laws,
    resource_tensor_is_monotone,
};

fn main() -> CtResult<()> {
    println!(
        "orders obey preorder laws: {}",
        information_order_obeys_preorder_laws()
    );
    println!(
        "join(feature, decision): {:?}",
        InformationLevel::Feature.join(InformationLevel::Decision)
    );

    let features = FeatureCount::new(9)?;
    let layers = LayerBudget::new(3)?;
    println!(
        "feature/layer Galois law: {}",
        feature_layer_galois_law_holds(features, layers)?
    );
    println!(
        "abstract 9 features to {} layers; concretize 3 layers to {} features",
        abstract_to_layer_budget(features)?.value(),
        concretize_layer_budget(layers).value()
    );

    let encoder = ResourceBundle::new(ResourceAmount::new(2), ResourceAmount::new(8));
    let decoder = ResourceBundle::new(ResourceAmount::new(3), ResourceAmount::new(10));
    println!("combined resource bundle: {:?}", encoder.tensor(&decoder));
    println!(
        "resource tensor monotone: {}",
        resource_tensor_is_monotone()
    );

    let research = DepartmentId::new(1);
    let platform = DepartmentId::new(2);
    let ada = EmployeeId::new(7);
    let instance =
        CompanyInstance::new([research, platform], [EmployeeRecord::new(ada, research)])?;
    println!(
        "employee {:?} belongs to department {:?}",
        ada,
        instance.department_of(ada)
    );

    let requirement = DesignRequirement::new(Throughput::new(100)?, LatencyMs::new(80)?);
    let offer = ImplementationOffer::new(Throughput::new(120)?, LatencyMs::new(50)?);
    println!(
        "co-design offer feasible: {}",
        FeasibilityRelation::relates(requirement, offer)
    );

    let duplicate = SignalMatrix::new(
        MatrixRows::new(2)?,
        MatrixCols::new(1)?,
        vec![
            vec![SignalCoefficient::new(1)],
            vec![SignalCoefficient::new(1)],
        ],
    )?;
    let add_weighted = SignalMatrix::new(
        MatrixRows::new(1)?,
        MatrixCols::new(2)?,
        vec![vec![SignalCoefficient::new(2), SignalCoefficient::new(3)]],
    )?;
    let composed = add_weighted.compose_after(&duplicate)?;
    println!(
        "signal-flow matrix semantics: {:?}",
        composed.coefficients()
    );

    let input = PortName::new("input")?;
    let middle = PortName::new("middle")?;
    let output = PortName::new("output")?;
    let first_circuit = OpenCircuit::new(
        [input],
        [middle],
        [CircuitComponent::resistor(
            input,
            middle,
            ResistanceOhms::new(10)?,
        )],
    )?;
    let second_circuit = OpenCircuit::new(
        [middle],
        [output],
        [CircuitComponent::resistor(
            middle,
            output,
            ResistanceOhms::new(20)?,
        )],
    )?;
    println!(
        "serial circuit component count: {}",
        first_circuit.then(&second_circuit)?.component_count()
    );

    let safety = SafetyCover::new([
        LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(0), TimeTick::new(5))?,
            TruthValue::True,
        ),
        LocalSafetyCheck::new(
            TimeInterval::new(TimeTick::new(5), TimeTick::new(10))?,
            TruthValue::True,
        ),
    ])?;
    println!("global behavior truth: {:?}", safety.global_truth());

    Ok(())
}

Project Configuration

`Cargo.toml`

[package]
name = "category-theory-transformer-rs"
version = "0.1.0"
edition = "2024"

[dependencies]

Companion Lesson Notes

These are the shorter markdown notes kept under lessons/.

`lessons/README.md`

# Category Theory for Tiny ML: Compact Lesson Path

This folder is the compact reading path through the codebase.

For the complete self-contained course, use the chapters in `book/src/`. They
include source snapshots, runnable examples, exercises, a glossary, references,
and the full source appendix.

These compact lessons are kept for quick review. Each lesson is intentionally
short and points to a real Rust file that `cargo` checks.

## The Learning Loop

For each lesson:

1. Read the mental model.
2. Open the named Rust module.
3. Run the named example.
4. Answer the checkpoint before moving on.

## Lessons

1. [Map of the Course](00-map.md)
2. [Domain Objects](01-domain-objects.md)
3. [Morphism and Composition](02-morphisms-composition.md)
4. [The Tiny ML Pipeline](03-ml-pipeline.md)
5. [Training as an Endomorphism](04-training-endomorphism.md)
6. [Functors, Naturality, Monoids, and Chain Rule](05-structure-and-calculus.md)

## Validation

Run the full check:

```bash
bash scripts/check.sh
```

Run one lesson example:

```bash
cargo run --example 03_training_endomorphism
```

`lessons/00-map.md`

# 00 - Map of the Course

## Goal

Learn one idea:

> Category theory is a language for typed transformations.

In this repo, the transformations are tiny ML operations:

- token to vector
- vector to logits
- logits to probabilities
- prediction plus target to loss
- parameters to better parameters

## The Code Map

- `src/domain.rs`: nouns, also called objects
- `src/category.rs`: arrows, identity, composition, endomorphisms
- `src/ml.rs`: ML arrows
- `src/training.rs`: one training step
- `src/structure.rs`: functor, natural transformation, monoid
- `src/calculus.rs`: local derivative example
- `src/demo.rs`: one guided terminal walkthrough

## First Run

```bash
cargo run --bin category_ml
```

You should see a tiny language-model pipeline and the loss decreasing after
training.

## Checkpoint

Before moving on, say this in your own words:

> A morphism is a typed function. Composition lets small typed functions become
> one larger typed pipeline.

`lessons/01-domain-objects.md`

# 01 - Domain Objects

## Mental Model

Objects are the nouns of the system.

In normal Rust code, it is tempting to pass `usize`, `Vec<f32>`, and `f32`
everywhere. That is easy at first, then confusing later. This repo wraps those
values in small domain types.

## Read This File

Open `src/domain.rs`.

Focus only on these types first:

- `TokenId`
- `TokenSequence`
- `Vector`
- `Logits`
- `Distribution`
- `Loss`
- `TrainingSet`
- `Parameters`

## Run the Example

```bash
cargo run --example 01_domain_objects
```

Expected shape:

```text
training pairs:
1 -> 2
2 -> 3
3 -> 4
```

## Why This Matters

`TokenId` and `Loss` are not interchangeable, even if both could be represented
by numbers. The Rust type system keeps those ideas separate.

## Checkpoint

What bug becomes harder when `TokenId` is not just a raw `usize`?

`lessons/02-morphisms-composition.md`

# 02 - Morphism and Composition

## Mental Model

A morphism is a typed arrow:

```text
Input -> Output
```

In Rust, this repo models that with `Morphism<Input, Output>`.

## Read This File

Open `src/category.rs`.

Read in this order:

1. `Morphism`
2. `Identity`
3. `Compose`
4. `Endomorphism`

## Run the Example

```bash
cargo run --example 02_morphism_composition
```

## What to Notice

The pipeline is built from small arrows:

```text
TokenId -> Vector -> Logits -> Distribution
```

`Compose` is the glue. If the middle types do not match, Rust rejects the
program before it runs.

## Checkpoint

Why is `TokenId -> Vector -> Logits` easier to debug than one giant
`predict(...)` function?

`lessons/03-ml-pipeline.md`

# 03 - The Tiny ML Pipeline

## Mental Model

The ML pipeline is a sequence of typed transformations:

```text
TokenSequence -> TrainingSet
TokenId -> Vector
Vector -> Logits
Logits -> Distribution
Distribution x TokenId -> Loss
```

## Read This File

Open `src/ml.rs`.

Read only these structs first:

- `DatasetWindowing`
- `Embedding`
- `LinearToLogits`
- `Softmax`
- `CrossEntropy`

## Run the Full Demo

```bash
cargo run --bin category_ml
```

Look at sections 2 through 5 in the output.

## The Key Detail

`Softmax` validates that its output is a real probability distribution.
`CrossEntropy` validates that the target token is inside the distribution.

Errors happen at the boundary where the bad data is first understood.

## Checkpoint

Where should an out-of-range target token be caught: inside `CrossEntropy`, or
later after loss calculation?

`lessons/04-training-endomorphism.md`

# 04 - Training as an Endomorphism

## Mental Model

An endomorphism maps a thing back to the same kind of thing:

```text
A -> A
```

Training does exactly that:

```text
Parameters -> Parameters
```

The model changes, but it is still a model.

## Read This File

Open `src/training.rs`.

Read in this order:

1. `TrainStep`
2. `impl Morphism<Parameters, Parameters> for TrainStep`
3. The unit test at the bottom

## Run the Example

```bash
cargo run --example 03_training_endomorphism
```

Expected pattern:

```text
loss before: ...
loss after:  ...
```

The second number should be smaller.

## Why This Is Category-Theoretic

`TrainStep` can be repeated because its input type and output type are both
`Parameters`.

That makes this loop type-correct:

```text
Parameters0 -> Parameters1 -> Parameters2 -> ... -> ParametersN
```

## Checkpoint

Why would training be harder to compose if it returned raw vectors instead of
`Parameters`?

`lessons/05-structure-and-calculus.md`

# 05 - Functors, Naturality, Monoids, and Chain Rule

## Mental Model

This lesson gives names to patterns you already use:

- `Functor`: map inside a wrapper without changing the wrapper shape
- `NaturalTransformation`: convert one wrapper shape to another in a consistent way
- `Monoid`: an empty value plus an associative combine operation
- Chain rule: local gradients compose into larger gradients

## Read These Files

Open:

- `src/structure.rs`
- `src/calculus.rs`

## Run the Example

```bash
cargo run --example 04_structure_and_calculus
```

## What to Notice

The examples are tiny on purpose.

`VecFunctor` and `OptionFunctor` are not trying to replace real Rust APIs. They
show the shape of the idea:

```text
keep the container, transform the inside
```

`PipelineTrace` is a monoid because:

- there is an empty trace
- traces can be combined
- grouping does not change the final trace

`MulOp` shows the smallest useful backward pass:

```text
z = x * y
dL/dx = dL/dz * y
dL/dy = dL/dz * x
```

## Checkpoint

Why is "local rule plus composition" the core idea behind backpropagation?

Keyboard shortcuts

Category Theory for Tiny ML in Rust

Category Theory for Tiny ML in Rust