Objectif du cours
This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. Students will learn how deep learning models are trained across multiple GPUs, nodes, and clusters, and why distributed training is the key to enabling today’s largest AI systems.
We will cover:
- Core techniques for distributed training
- Modern frameworks and scaling strategies
- Practical implementations with real-world toolchains
- Theoretical underpinnings of large-scale learning
- Inference and applications
As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for future researchers and engineers. This series bridges engineering and theory, offering students both the practical skills and deeper insights needed to work with frontier AI systems.
Organisation des séances
Enrollment will be limited to 60 students (no possibility for external auditors). Further details about enrollment will be available on the (future) course website.
The course will consist of 8 sessions. The first 7 sessions will each include 2 hours of lectures followed by 2 hours of hands-on lab work. The final session will be dedicated to grading.
- Foundations of distributed LLM Training
- Hardware and Software Ecosystem
- Parallelism I: Fundamental Techniques
- Parallelism II: Advanced Use Cases
- Synchronization and Communication Strategies
- Inference at Scale
- Data, Evaluation, Metrics, Alignment, Ethics, and RL(HF)
- Poster Session and Final Evaluation
Mode de validation
Grades will be based on:
• Lab (25%)
• Homework (25%)
• Poster Session (50%) at lecture 8
Edouard OYALLON
CNRS, Sorbonne Université
