Turing Seminar – An introduction to AGI Safety

C-R. SEGERIE

Opening

Objectif du cours

The rapid advancements in artificial intelligence are showing no signs of slowing down. From GPT-1 to GPT-4, we are uncertain where this race for performance is leading us, but we are aware that experts are warning of potentially catastrophic risks (Statement on AI Risk | CAIS). The highly publicized launches of chatbots with spectacular performances are just the tip of the iceberg of recent advancements. Some models compete with humans on specific tasks, from DALL·E in image generation to Whisper in oral text transcription. Others, more general models, simultaneously master dialogues, video games, and real-world robotics, or can even freely perform tasks on the internet.

While these technologies are awe-inspiring, allowing us to enhance biomedical research, break down language barriers, and lighten administrative work, significant technical flaws remain in developing these general-purpose models with high safety standards. Models like ChatGPT or Bing Chat, although specifically developed to be polite and benevolent towards the user, can be easily manipulated.

In this course, we will address these major technical flaws. These models remain large black boxes whose behaviors are unpredictable for the enterprise actors deploying them. In this opacity, we cannot guarantee that their actions will conform to our expectations. A second flaw is the lack of robustness; the models are trained on a particular dataset and must therefore generalize to new situations during their deployment. The fact that Bing Chat threatens users when it was trained to help them illustrates this failure of generalization. The third flaw lies in the difficulty of specifying precisely to a model the desired objective, given the complexity and diversity of human values.

We will begin the seminar by studying the functioning of the largest models (GPT3, Text-to-image, etc.) and the reasons for their performance. You will be able to understand and re-code a transformer yourself.

Then, we will address different solution paradigms: Specification techniques with Reinforcement Learning (RLHF and its variations), mechanistic interpretability of transformers (how information is represented in neural networks, understanding the internal functioning of neural networks, robustly editing a language model’s knowledge by modifying its memory, …), scalable oversight (training and alignment techniques that are likely to work even with superhuman-level AIs), agent foundation (The Corrigibility problem), as well as higher-level aspects such as the governance of general AIs.

Présentation :here

Organisation des séances

7x2hours sessions + Project. Before each session, students must read pedagogical resources.

Activities during sessions: Presentations of papers, expert interventions, research projects, and sometimes debates and discussions.

Indicative program:

Capability – Transformers. architecture – text-to-image models – Hypotheses of the seminar.

Panorama of RL (DQN, Policy Gradient, MuZero, EfficientZero, an the algo that learns faster than humans on Atari)

Introduction to Risks – (Inspired by the framework proposed by Rohin Shah, head of alignment at Google DeepMind) – inner – outer alignment – risk framework.

Scalable Oversight and an explanation of OpenAI’s Plan and its flaws.

Interpretability in Vision – feature visualization, pixel attribution techniques

Interpretability of transformers – To understand the internal functioning of neural networks, robustly edit the knowledge of a language model by modifying its memory or force the specialization of an activation layer.

AI Governance: Just like the nuclear industry has a safety industry, artificial intelligence needs to foster an entire ecosystem focused on AI safety. We will delve into the technical research applicable to AI governance.(Benchmarks, model evaluations, Compute governance, etc)

A recording of the training day summarizing last year’s seminar is available here, but the curriculum will be updated significantly as research progresses rapidly.

Mode de validation

Grading 100% Project. + Bonus TPs

Références

The website of last year’s course is made available here. During the Seminar, we will cover more than a hundred recent papers. Here is a selection:

Vaswani, Ashish, et al. « Attention is all you need. » Advances in neural information processing systems 30 (2017).

Ngo, Richard. « The alignment problem from a deep learning perspective. » arXiv preprint arXiv:2209.00626 (2022).

Silver, David, et al. « Mastering chess and shogi by self-play with a general reinforcement learning algorithm. » arXiv preprint arXiv:1712.01815 (2017).

Ye, Weirui, et al. « Mastering atari games with limited data. » Advances in Neural Information Processing Systems 34 (2021): 25476-25488.

Hubinger, Evan, et al. « Risks from learned optimization in advanced machine learning systems. » arXiv preprint arXiv:1906.01820 (2019).

Di Langosco, Lauro Langosco, et al. « Goal misgeneralization in deep reinforcement learning. » International Conference on Machine Learning. PMLR, 2022.

Hendrycks, Dan, et al. « Unsolved problems in ml safety. » arXiv preprint arXiv:2109.13916 (2021).

Les intervenants

Charbel-Raphaël SEGERIE

voir les autres cours du 1er semestre