Turing Seminar – An introduction to AGI Safety

L.DANA, C-R. SEGERIE

Opening

Objectif du cours

The rapid advancements in artificial intelligence are showing no signs of slowing down. From GPT-2 to GPT-5, we remain uncertain about where this race for performance is leading us, though many experts warn of potentially catastrophic risks. The highly publicized launches of powerful chatbots are only the tip of the iceberg. In the first part of the course, we will review these advancements and explore where they might lead us. Informed by scaling laws and expert forecasts, we will see that Artificial General Intelligence (AGI) could emerge within this decade.

While these technologies are enabling breakthroughs in biomedical research, breaking down language barriers, and reducing administrative burdens, significant technical challenges remain in building general-purpose models that are both reliable and safe. The second part of the course will discuss the benefits and risks of future AGIs.

These risks stem from both our limited understanding of the models we create and the lack of international coordination to avoid competitive races and ensure the safe deployment of powerful systems. Assessing which dangerous capabilities a model has—and how they can be elicited—requires careful evaluation, often of closed-source systems. The final part of the course will examine solutions for controlling AIs in a robust way: understanding, evaluating, and regulating them. We will explicitly discuss how engineers can contribute through research and machine learning expertise.

Organisation des séances

7x2hours sessions + Project. Before each session, students must read pedagogical resources listed on the syllabus[https://docs.google.com/document/d/19K2T62EHiZPZeBpa8K-5a64a84YhL8Ly8C_Ck9MfLm4/edit?tab=t.0].

Activities during sessions: Presentations of papers, expert interventions, research projects, and sometimes debates and discussions.

Indicative program:

- Capability and Forecasts: State-of-the-art AI architectures, benchmark saturation, development forecasts, and other capabilities.
- Emerging Risks from AI Systems: Misuse, misalignment, and systemic risks.
- Developing Safe AIs: Modern proposals for making AI safer and better aligned with human values.
- Evaluating AIs: How to assess models for forecasting progress and monitoring dangerous capabilities.
- Interpretability in Transformers: Mathematical techniques such as feature visualization, pixel attribution, circuits in LLMs, and factual knowledge memorization.
- AI Governance: Past and current regulatory efforts, and the intersection between technical research and governance.

Mode de validation

Grading 100% Project. (See the Syllabus)

Références

The website of last year’s course is made available here. During the Seminar, we will cover more than a hundred recent papers. Here is a selection:

Vaswani, Ashish, et al. « Attention is all you need. » Advances in neural information processing systems 30 (2017).

Ngo, Richard. « The alignment problem from a deep learning perspective. » arXiv preprint arXiv:2209.00626 (2022).

Silver, David, et al. « Mastering chess and shogi by self-play with a general reinforcement learning algorithm. » arXiv preprint arXiv:1712.01815 (2017).

Ye, Weirui, et al. « Mastering atari games with limited data. » Advances in Neural Information Processing Systems 34 (2021): 25476-25488.

Hubinger, Evan, et al. « Risks from learned optimization in advanced machine learning systems. » arXiv preprint arXiv:1906.01820 (2019).

Di Langosco, Lauro Langosco, et al. « Goal misgeneralization in deep reinforcement learning. » International Conference on Machine Learning. PMLR, 2022.

Hendrycks, Dan, et al. « Unsolved problems in ml safety. » arXiv preprint arXiv:2109.13916 (2021).

Les intervenants

Charbel-Raphaël SEGERIE

Léo DANA

voir les autres cours du 1er semestre