11-17-2025, 01:11 PM
Thread 5 — Training Large Models: Optimisers, Learning Rates & Loss Landscapes
The Hidden Mechanics Behind Model Training
Modern AI models aren’t just built — they’re grown, shaped through millions of tiny adjustments.
This thread explains the advanced machinery behind training deep models.
1. The Loss Landscape
A model’s performance is represented as a giant multidimensional surface.
Each point on the surface corresponds to a particular set of weights.
The goal of training:
find low valleys (good solutions) on this landscape.
This landscape is:
• huge
• chaotic
• full of ridges, basins, and flat regions
Understanding it is key to training powerful models.
2. Gradient Descent — The Core Idea
At each step:
• compute gradient (direction of steepest descent)
• move weights slightly downhill
Basic form:
SGD — Stochastic Gradient Descent
Simple but powerful.
3. Advanced Optimisers
Modern models use smarter algorithms:
• Adam — adaptive momentum
• AdamW — weight decay removed
• RMSProp — stabilises learning
• LAMB / Lion — used for extremely large models
Optimisers improve training speed and stability.
4. Learning Rate Scheduling
The learning rate controls the “step size” during training.
Too high → unstable
Too low → painfully slow
Schedulers include:
• warmup
• cosine decay
• exponential decay
• cyclical schedules
These dramatically improve performance.
5. Batch Size Effects
Small batches:
• noisy gradients
• good generalisation
Large batches:
• stable
• fast
• used for huge models
Knowing when to use which is a science.
6. Regularisation Techniques
Used to prevent overfitting:
• dropout
• weight decay
• label smoothing
• data augmentation
Essential for robust models.
7. Training Large Language Models
LLMs require:
• distributed training
• parallelism
• mixed precision (FP16/BF16)
• gradient checkpointing
And enormous compute power.
Final Thoughts
Behind every modern AI is a complex training system.
Understanding these tools gives insight into the engineering that powers today’s intelligent models.
The Hidden Mechanics Behind Model Training
Modern AI models aren’t just built — they’re grown, shaped through millions of tiny adjustments.
This thread explains the advanced machinery behind training deep models.
1. The Loss Landscape
A model’s performance is represented as a giant multidimensional surface.
Each point on the surface corresponds to a particular set of weights.
The goal of training:
find low valleys (good solutions) on this landscape.
This landscape is:
• huge
• chaotic
• full of ridges, basins, and flat regions
Understanding it is key to training powerful models.
2. Gradient Descent — The Core Idea
At each step:
• compute gradient (direction of steepest descent)
• move weights slightly downhill
Basic form:
SGD — Stochastic Gradient Descent
Simple but powerful.
3. Advanced Optimisers
Modern models use smarter algorithms:
• Adam — adaptive momentum
• AdamW — weight decay removed
• RMSProp — stabilises learning
• LAMB / Lion — used for extremely large models
Optimisers improve training speed and stability.
4. Learning Rate Scheduling
The learning rate controls the “step size” during training.
Too high → unstable
Too low → painfully slow
Schedulers include:
• warmup
• cosine decay
• exponential decay
• cyclical schedules
These dramatically improve performance.
5. Batch Size Effects
Small batches:
• noisy gradients
• good generalisation
Large batches:
• stable
• fast
• used for huge models
Knowing when to use which is a science.
6. Regularisation Techniques
Used to prevent overfitting:
• dropout
• weight decay
• label smoothing
• data augmentation
Essential for robust models.
7. Training Large Language Models
LLMs require:
• distributed training
• parallelism
• mixed precision (FP16/BF16)
• gradient checkpointing
And enormous compute power.
Final Thoughts
Behind every modern AI is a complex training system.
Understanding these tools gives insight into the engineering that powers today’s intelligent models.
