Training and deploying Large-Scale Models

E. OYALLON

Deep LearningLearning

Objectif du cours

This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. Students will learn how deep learning models are trained across multiple GPUs, nodes, and clusters, and why distributed training is the key to enabling today’s largest AI systems.

We will cover:

Core techniques for distributed training
Modern frameworks and scaling strategies
Practical implementations with real-world toolchains
Theoretical underpinnings of large-scale learning
Inference and applications

As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for future researchers and engineers. This series bridges engineering and theory, offering students both the practical skills and deeper insights needed to work with frontier AI systems.

Organisation des séances

Enrollment will be limited to 60 students (no possibility for external auditors). Further details about enrollment will be available on the (future) course website.

The course will consist of 8 sessions. The first 7 sessions will each include 2 hours of lectures followed by 2 hours of hands-on lab work. The final session will be dedicated to grading.

Foundations of distributed LLM Training
Hardware and Software Ecosystem
Parallelism I: Fundamental Techniques
Parallelism II: Advanced Use Cases
Synchronization and Communication Strategies
Inference at Scale
Data, Evaluation, Metrics, Alignment, Ethics, and RL(HF)
Poster Session and Final Evaluation

Mode de validation

Grades will be based on:
• Lab (25%)
• Homework (25%)
• Poster Session (50%) at lecture 8

Les intervenants

Edouard OYALLON

CNRS, Sorbonne Université

voir les autres cours du 2nd semestre