Training as an Endomorphism
The problem this chapter solves is:
A model is not only used for prediction. It must also be updated by training, and one update should produce the same kind of object it consumed.
The key shape is:
Parameters -> Parameters
This is an endomorphism.
In ordinary ML terms:
old parameters
-> compute predictions
-> compute loss gradients
-> subtract learning-rate-scaled gradients
-> new parameters
In category-theory terms:
A -> A
Because the input and output type are the same, the step can be repeated.
Reader orientation: Do not read this chapter as a full backpropagation engine. It is a small, explicit training step whose purpose is to make the shape
Parameters -> Parametersvisible and runnable.
What You Already Know
If you have seen gradient descent, you already know the informal movement:
parameters are adjusted and then used again. If you know Rust, you already know
that a function can return the same type it receives. This chapter names that
shape precisely: a training step is an endomorphism on Parameters.
Source Snapshot
This file implements one full-batch optimizer update.
Source snapshot: src/training.rs
use crate::category::Morphism;
use crate::domain::{LearningRate, Parameters, TrainingSet, Vector};
use crate::error::{CtError, CtResult};
use crate::ml::{LinearToLogits, Softmax};
/// One full-batch optimizer update.
///
/// Categorically, this is an endomorphism:
///
/// `Parameters -> Parameters`
#[derive(Debug, Clone)]
pub struct TrainStep {
dataset: TrainingSet,
learning_rate: LearningRate,
}
impl TrainStep {
pub fn new(dataset: TrainingSet, learning_rate: LearningRate) -> Self {
Self {
dataset,
learning_rate,
}
}
}
impl Morphism<Parameters, Parameters> for TrainStep {
fn name(&self) -> &'static str {
"train_step_endomorphism"
}
fn apply(&self, params: Parameters) -> CtResult<Parameters> {
let vocab_size = params.vocab_size();
let d_model = params.d_model();
if vocab_size == 0 || d_model == 0 {
return Err(CtError::EmptyInput("parameters"));
}
let mut grad_embedding = vec![vec![0.0; d_model]; params.embedding.len()];
let mut grad_lm_head = vec![vec![0.0; vocab_size]; d_model];
let mut grad_bias = vec![0.0; vocab_size];
for example in self.dataset.examples() {
let input_id = example.first().index();
let target_id = example.second().index();
if input_id >= params.embedding.len() {
return Err(CtError::OutOfRange {
kind: "input token",
index: input_id,
limit: params.embedding.len(),
});
}
if target_id >= vocab_size {
return Err(CtError::OutOfRange {
kind: "target token",
index: target_id,
limit: vocab_size,
});
}
let x = ¶ms.embedding[input_id];
let logits = LinearToLogits::from_parts(params.lm_head.clone(), params.bias.clone())
.apply(Vector::new(x.clone()))?;
let probs = Softmax.apply(logits)?;
let mut dlogits = probs.as_slice().to_vec();
dlogits[target_id] -= 1.0;
for (vocab_id, dlogit) in dlogits.iter().copied().enumerate() {
grad_bias[vocab_id] += dlogit;
for (feature, x_feature) in x.iter().copied().enumerate() {
grad_lm_head[feature][vocab_id] += x_feature * dlogit;
}
}
for (feature, grad_feature) in grad_embedding[input_id].iter_mut().enumerate() {
let dx = params.lm_head[feature]
.iter()
.zip(dlogits.iter())
.map(|(weight, dlogit)| weight * dlogit)
.sum::<f32>();
*grad_feature += dx;
}
}
let batch_scale = 1.0 / self.dataset.len() as f32;
let learning_rate = self.learning_rate.value();
let mut updated = params.clone();
for (row, grad_row) in updated.embedding.iter_mut().zip(grad_embedding.iter()) {
for (value, grad) in row.iter_mut().zip(grad_row.iter()) {
*value -= learning_rate * grad * batch_scale;
}
}
for (row, grad_row) in updated.lm_head.iter_mut().zip(grad_lm_head.iter()) {
for (value, grad) in row.iter_mut().zip(grad_row.iter()) {
*value -= learning_rate * grad * batch_scale;
}
}
for (bias, grad) in updated.bias.iter_mut().zip(grad_bias.iter()) {
*bias -= learning_rate * grad * batch_scale;
}
Ok(updated)
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::category::{StepCount, apply_endomorphism_n_times};
use crate::domain::{ModelDimension, TokenSequence, VocabSize};
use crate::ml::{DatasetWindowing, average_loss};
#[test]
fn repeated_training_step_reduces_loss() -> CtResult<()> {
let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
let dataset = DatasetWindowing.apply(tokens)?;
let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);
let before = average_loss(¶ms, &dataset)?;
let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
let after = average_loss(&trained, &dataset)?;
assert!(after.value() < before.value());
Ok(())
}
}
The Whole File
src/training.rs defines:
TrainStep
TrainStep::new
impl Morphism<Parameters, Parameters> for TrainStep
unit test proving repeated training reduces loss
The whole file is about one idea:
training is a repeatable typed transformation of model state
Worked Example: Repeating One Update
The smallest first-principles version of a repeated update is a number being moved a little at a time:
#![allow(unused)]
fn main() {
fn step_toward_zero(value: f32, learning_rate: f32) -> f32 {
value - learning_rate * value
}
let once = step_toward_zero(10.0, 0.1);
let twice = step_toward_zero(once, 0.1);
assert!(twice < once);
}
The real training code applies the same repeatable-update idea to Parameters,
not to one scalar. The output stays the same kind of object as the input, so the
update can be run again.
Self-Check
Before reading the full training step, explain why Parameters -> Parameters
is repeatable but Parameters -> Loss is not.
TrainStep
The problem this block solves is:
A training update needs a dataset and a learning rate, and those values should travel together as one configured operation.
The block:
/// One full-batch optimizer update.
///
/// Categorically, this is an endomorphism:
///
/// `Parameters -> Parameters`
#[derive(Debug, Clone)]
pub struct TrainStep {
dataset: TrainingSet,
learning_rate: LearningRate,
}
Rust Syntax
This is a named-field struct.
It stores:
dataset: TrainingSet
learning_rate: LearningRate
Both fields are private.
That means callers cannot directly replace the dataset or learning rate after construction.
The derived traits mean:
Debug -> can be printed for debugging
Clone -> can be explicitly duplicated
TrainingSet is already non-empty.
LearningRate is already finite and positive.
So TrainStep stores validated inputs.
ML Concept
A training step needs:
- examples to learn from
- a step size for parameter updates
The dataset gives the input-target pairs.
The learning rate controls how far the update moves.
Category-Theory Concept
TrainStep is the value that will implement:
Parameters -> Parameters
That makes it an endomorphism on the object Parameters.
TrainStep::new
The problem this block solves is:
Construct a configured training step from already validated pieces.
The block:
impl TrainStep {
pub fn new(dataset: TrainingSet, learning_rate: LearningRate) -> Self {
Self {
dataset,
learning_rate,
}
}
}
Rust Syntax
impl TrainStep defines methods for TrainStep.
The constructor takes ownership of:
dataset
learning_rate
and stores them.
It returns Self, not CtResult<Self>, because the inputs are already
validated domain objects.
No extra validation is needed here.
ML Concept
This is like configuring an optimizer step:
use this dataset
use this learning rate
The actual update happens later in apply.
Category-Theory Concept
The constructor chooses one specific endomorphism from a family:
TrainStep(dataset, learning_rate) : Parameters -> Parameters
Different datasets or learning rates create different update morphisms.
Morphism Implementation
The problem this block solves is:
Make
TrainStepa real typed arrow from model parameters back to model parameters.
The header:
impl Morphism<Parameters, Parameters> for TrainStep {
Rust Syntax
This says:
TrainStep implements Morphism<Input = Parameters, Output = Parameters>
So the apply method must have this effective shape:
Parameters -> CtResult<Parameters>
The name method:
fn name(&self) -> &'static str {
"train_step_endomorphism"
}
returns a static label for the transformation.
ML Concept
The input Parameters are the current model weights.
The output Parameters are the updated weights after one full-batch step.
Category-Theory Concept
Because the input and output object are the same, TrainStep is an
endomorphism.
That is what lets this work:
Parameters0 -> Parameters1 -> Parameters2 -> ... -> ParametersN
apply: Shape Checks
The problem this block solves is:
Before computing gradients, verify that the parameter object has usable dimensions.
The block:
let vocab_size = params.vocab_size();
let d_model = params.d_model();
if vocab_size == 0 || d_model == 0 {
return Err(CtError::EmptyInput("parameters"));
}
Rust Syntax
The code asks the parameter object for two dimensions.
Then it rejects zero-sized parameters.
This uses an explicit error instead of panicking.
ML Concept
Training cannot run if:
- there are zero possible vocabulary outputs
- hidden vectors have zero width
Those shapes would make the gradient arrays meaningless.
Category-Theory Concept
The endomorphism is only defined on valid Parameters.
Invalid parameter state is rejected before the morphism performs the update.
Gradient Buffers
The problem this block solves is:
Accumulate gradients for every trainable parameter before applying the update.
The block:
let mut grad_embedding = vec![vec![0.0; d_model]; params.embedding.len()];
let mut grad_lm_head = vec![vec![0.0; vocab_size]; d_model];
let mut grad_bias = vec![0.0; vocab_size];
Rust Syntax
These are mutable matrices and vectors initialized to zero.
Their shapes mirror the trainable parameters:
grad_embedding: same row count as embedding, d_model columns
grad_lm_head: d_model x vocab_size
grad_bias: vocab_size
ML Concept
Gradients accumulate how each parameter should change to reduce loss.
The code uses full-batch training: it processes every example, accumulates all gradients, averages them, then updates once.
Category-Theory Concept
The gradient buffers are not the endomorphism itself.
They are internal machinery used to construct the output object in:
Parameters -> Parameters
Example Loop
The problem this block solves is:
For each training example, compute the local contribution to the parameter gradients.
The loop begins:
for example in self.dataset.examples() {
let input_id = example.first().index();
let target_id = example.second().index();
...
}
Rust Syntax
self.dataset.examples() returns a slice of TrainingExample.
Each example is a Product<TokenId, TokenId>.
So:
example.first()
is the input token.
example.second()
is the target token.
The code extracts raw indices because matrix indexing needs usize.
ML Concept
Each example says:
given input token, predict target token
The training loop calculates how wrong the current model is for that example.
Category-Theory Concept
The example is an element of:
TokenId x TokenId
The training morphism consumes many such product values while building the parameter update.
Token Bounds Checks
The problem this block solves is:
Training examples must refer to tokens that exist in the current parameter shapes.
The checks:
if input_id >= params.embedding.len() {
return Err(CtError::OutOfRange { ... });
}
if target_id >= vocab_size {
return Err(CtError::OutOfRange { ... });
}
Rust Syntax
These are ordinary bounds checks with typed errors.
They prevent invalid indexing into:
- the embedding table
- the vocabulary-sized output vector
ML Concept
An input token must have an embedding row.
A target token must be one of the possible prediction classes.
If either token is outside the model vocabulary, training cannot continue.
Category-Theory Concept
The example must belong to the finite token object that the parameters are currently modeling.
This check keeps the training morphism inside the intended domain.
Forward Pass Inside Training
The problem this block solves is:
To compute gradients, the training step first needs the current prediction.
The block:
let x = ¶ms.embedding[input_id];
let logits = LinearToLogits::from_parts(params.lm_head.clone(), params.bias.clone())
.apply(Vector::new(x.clone()))?;
let probs = Softmax.apply(logits)?;
Rust Syntax
x borrows the embedding row for the input token.
LinearToLogits::from_parts(...) builds a linear projection from the current
weights.
Vector::new(x.clone()) wraps the embedding row as a Vector.
Then:
Vector -> Logits -> Distribution
runs through the same morphism interface as prediction.
ML Concept
This computes the model’s current predicted distribution for one input token.
The gradient depends on the difference between that distribution and the true target.
Category-Theory Concept
Even inside training, prediction is still a composed path:
TokenId -> Vector -> Logits -> Distribution
Training uses that path as part of a larger endomorphism:
Parameters -> Parameters
Logit Gradient
The problem this block solves is:
For softmax plus cross entropy, the gradient with respect to logits is predicted probability minus one-hot target.
The block:
let mut dlogits = probs.as_slice().to_vec();
dlogits[target_id] -= 1.0;
Rust Syntax
The probabilities are copied into a mutable vector.
Then the target class is adjusted by subtracting 1.0.
If:
probs = [0.70, 0.20, 0.10]
target = 1
then:
dlogits = [0.70, -0.80, 0.10]
ML Concept
This is the standard simplified gradient for softmax cross entropy.
It says:
- decrease the scores that are too high
- increase the target score if it was too low
Category-Theory Concept
This is local derivative information for one part of the composed prediction path.
The next loops compose that local derivative back into parameter gradients.
Output-Head And Bias Gradients
The problem this block solves is:
Convert the logit gradient into gradients for the output matrix and bias.
The core loop:
for (vocab_id, dlogit) in dlogits.iter().copied().enumerate() {
grad_bias[vocab_id] += dlogit;
for (feature, x_feature) in x.iter().copied().enumerate() {
grad_lm_head[feature][vocab_id] += x_feature * dlogit;
}
}
Rust Syntax
The outer loop visits every vocabulary output.
The inner loop visits every feature of the input vector.
The bias gradient is just the logit gradient.
The weight gradient is:
input feature * output gradient
ML Concept
For a linear layer:
logits = xW + b
the gradient of a weight is:
input activation * output gradient
This is the same pattern used in larger neural networks.
Category-Theory Concept
This is the local backward map for the affine projection stage.
It translates changes needed at the output object Logits into changes in the
parameter object.
Embedding Gradient
The problem this block solves is:
Move the output error backward through the language-model head to the input embedding row.
The block:
for (feature, grad_feature) in grad_embedding[input_id].iter_mut().enumerate() {
let dx = params.lm_head[feature]
.iter()
.zip(dlogits.iter())
.map(|(weight, dlogit)| weight * dlogit)
.sum::<f32>();
*grad_feature += dx;
}
Rust Syntax
The loop mutates the gradient row for the input token.
For each feature, it pairs:
weights from that feature to every vocab output
dlogits for every vocab output
Then it sums:
weight * dlogit
ML Concept
This is backpropagation through the linear head.
It tells the embedding row how it should change so the future logits improve.
Only the row for the current input token receives an embedding gradient.
Category-Theory Concept
This is another local backward map.
The training endomorphism is built by composing local derivative information from output back toward parameters.
Parameter Update
The problem this block solves is:
Turn accumulated gradients into new parameters.
The code computes:
let batch_scale = 1.0 / self.dataset.len() as f32;
let learning_rate = self.learning_rate.value();
let mut updated = params.clone();
Then it subtracts scaled gradients from every parameter.
Rust Syntax
batch_scale averages the accumulated gradients.
learning_rate extracts the raw scalar.
updated = params.clone() creates the output parameter object.
The following loops mutate updated, not the original params.
Finally:
Ok(updated)
returns the new model state.
ML Concept
The update rule is:
parameter_new = parameter_old - learning_rate * average_gradient
This is gradient descent.
Category-Theory Concept
The final result has the same object type as the input:
Parameters -> Parameters
That completes the endomorphism.
Regression Test
The problem this block solves is:
Prove the learner-visible promise that repeated training reduces loss on the tiny dataset.
The test:
#[test]
fn repeated_training_step_reduces_loss() -> CtResult<()> {
let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
let dataset = DatasetWindowing.apply(tokens)?;
let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);
let before = average_loss(¶ms, &dataset)?;
let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
let after = average_loss(&trained, &dataset)?;
assert!(after.value() < before.value());
Ok(())
}
Rust Syntax
The test returns CtResult<()>, so it can use ?.
It builds a token sequence, turns it into a training set, initializes parameters, and configures a training step.
Then it applies the endomorphism 80 times and checks the loss decreased.
ML Concept
This is not a benchmark.
It is a sanity check:
training should make the tiny model better on the tiny data
Category-Theory Concept
The test exercises repeated endomorphism application:
Parameters0 -> Parameters1 -> ... -> Parameters80
Run The Example
Source snapshot: examples/03_training_endomorphism.rs
use category_theory_transformer_rs::{
CtResult, DatasetWindowing, LearningRate, ModelDimension, Morphism, Parameters, StepCount,
TokenSequence, TrainStep, VocabSize, apply_endomorphism_n_times, average_loss,
};
fn main() -> CtResult<()> {
let tokens = TokenSequence::from_indices([1, 2, 3, 4, 1, 2, 3, 4])?;
let dataset = DatasetWindowing.apply(tokens)?;
let params = Parameters::init(VocabSize::new(5)?, ModelDimension::new(4)?);
let before = average_loss(¶ms, &dataset)?;
let train_step = TrainStep::new(dataset.clone(), LearningRate::new(1.0)?);
let trained = apply_endomorphism_n_times(&train_step, params, StepCount::new(80))?;
let after = average_loss(&trained, &dataset)?;
println!("loss before: {:.6}", before.value());
println!("loss after: {:.6}", after.value());
Ok(())
}
Run:
cargo run --example 03_training_endomorphism
Expected pattern:
loss before: ...
loss after: ...
The second number should be smaller.
Core Mental Model
In Rust terms:
TrainStep implements Morphism<Parameters, Parameters>
In ML terms:
one full-batch gradient descent update
In category-theory terms:
an endomorphism that can be iterated
Checkpoint
Why is it useful that training returns Parameters instead of a raw matrix?
A strong answer:
Because the output can immediately be used as the input to the next
TrainStep, preserving theParameters -> Parametersendomorphism shape.
Where This Leaves Us
This chapter turned training into a repeatable typed transformation. The model
state enters as Parameters, the training step computes gradients from the tiny
dataset, and the updated model state leaves as Parameters again.
The next chapter steps back from the training loop and names reusable structures that appear across the whole course: mapping inside wrappers, changing wrapper shapes consistently, combining traces, and composing local derivative rules.
Further Reading
These pages give the terms behind the training update:
- Glossary: endomorphism, parameters, learning rate, gradient
- References: gradient descent, softmax regression, and Rust error handling
Retrieval Practice
Recall
What makes TrainStep an endomorphism?
Explain
Why does the training code validate token bounds before accumulating gradients?
Apply
If you changed StepCount::new(80) to StepCount::new(1), what would you
expect to happen to the loss, and why?