Training and deploying Large-Scale Models

E. OYALLON

Deep LearningLearning

Objectif du cours

This course introduces the foundations and practices of training modern Large Language Models (LLMs) at scale. Students will learn how deep learning models are trained across multiple GPUs, nodes, and clusters, and why distributed training is essential for today’s largest AI systems.

We will cover:

Core techniques for distributed training
Modern frameworks and scaling strategies
Practical implementations with real-world toolchains
Theoretical underpinnings of large-scale learning
Inference and applications

As LLMs grow in complexity and impact, understanding how they are built and deployed has become essential for researchers and engineers. This series bridges engineering and theory, offering students both the practical skills and deeper insights needed to work with frontier AI systems.

To learn more : https://training-large-models-course.github.io/

Organisation des séances

The course will consist of 8 sessions. The first 7 sessions will each include 2 hours of lectures followed by 2 hours of hands-on lab work. The final session will be dedicated to grading.

Getting Started on Distributed LLM Training
Systems for ML
Multi-GPU Parallelization Techniques
Communication-Efficient Distributed Optimization
Post-training
Serving and deployment
Agentic AI (tentative)
Grading

Bring your laptop and follow the class webpage to install the required libraries (prefer GPU when available) and to find all project and homework instructions.

Mode de validation

Grades will be based on:
• Homework 1 (25%)
• Homework 2 (25%)
• Project (50%)

Les intervenants

Edouard OYALLON

CNRS, Sorbonne Université

voir les autres cours du 2nd semestre