Objectif du cours
This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. Students will learn how deep learning models are trained across multiple GPUs, nodes, and clusters, and why distributed training is essential for today’s largest AI systems.
We will cover:
- Core techniques for distributed training
- Modern frameworks and scaling strategies
- Practical implementations with real-world toolchains
- Theoretical underpinnings of large-scale learning
- Inference and applications
As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory, offering students both the practical skills and deeper insights needed to work with frontier AI systems.
To learn more : https://training-large-models-course.github.io/
Organisation des séances
The course will consist of 8 sessions. The first 7 sessions will each include 2 hours of lectures followed by 2 hours of hands-on lab work. The final session will be dedicated to grading.
- Getting Started on Distributed LLM Training
- Systems for ML
- Multi-GPU Parallelization Techniques
- Communication-Efficient Distributed Optimization
- Post-training
- Serving and deployment
- Agentic AI (tentative)
- Grading
Bring your laptop and follow the class webpage to install the required libraries (prefer GPU when available) and to find all project and homework instructions.
Mode de validation
Grades will be based on:
• Homework 1 (25%)
• Homework 2 (25%)
• Project (50%)
Edouard OYALLON
CNRS, Sorbonne Université