Algorithms and learning for protein science
F. CAZALS
LearningMachine LearningTheoryTrack Santé

Prè-requis

Training in algorithms / machine learning.

Interest for biophysics / biology / medicine.

Objectif du cours

Proteins underlie all biological functions, yet, understanding their mechanisms at the atomic scale remains a fundamental open problem. The difficulties are inherent to complex dynamics in very high dimensional spaces. Indeed, with circa 5000 atoms and xyz coordinates per atom, a polypeptide chain of median size lives in a configuration space of dimension 15,000. While AlphaFold2 has been a game changer
by providing plausible structures of selected (well folded) regions of proteins, it by no means provide insights on dynamics.
In this context, the goal of this class is twofold. First, to cast the main problems related to protein dynamics into a rigorous mathematical / algorithmic framework. Second, to present some of the major ongoing developments, which feature a stimulating interplay between theoretical biophysics, geometry, topology, and machine learning. To get acquainted with real data, one course will be devoted to a computer practical providing background on standard molecular manipulations, and illustrating selected methods studied in class.

Organisation des séances

The courses consists of 6 lectures (cours magistral) of 3 hours each; plus one lecture consisting of a computer practical.
The class is taught in English.

(1) Introduction: understanding protein functions at the atomic level
• Proteins: structures; [1], [2]
• Three examples of molecular mechanisms: viral attachment, membrane transport, antibody-antigen interactions; [3], [4], [5]
• Time-scales and local ergodicity in protein dynamics; [6], [7], [6]
• Energies versus free energies; [8], [9]
• Open (mathematical, algorithmic) challenges in protein science

(2) Protein structure prediction with AlphaFold2 and transformers
• Protein sequences and multiple sequence alignments; [10]
• AlphaFold2: architecture; [11]
• AlphaFold2: performance assessment; [12]

(3) Molecular kinematics, inverse problems, loop sampling
• Modeling proteins using internal coordinates; [13]
• Kinematics and loop closure problems; [14], [15]
• Inverse problems and protein loop sampling; [16, 17]
• Modeling mixtures of multivariate densities on flat torii; [18]

(4) Boltzmann samplers for sequences and structures
• Generative models for protein sequences; [19], [20]
• Potts models versus transformers; [21]
• Application in protein design; [22]

(5) Computer practical
• Molecular visualization with pymol and/or Visual Molecular Dynamics
• Structure predictions with AlphaFold2
• Molecular distances and structural alignments
• Conformational (loop) sampling

(6) High-dimensional sampling: from the volume of polytopes to densities of states in statistical physics
• Polytopes and their volumes: hardness and approximability; [23], [24]
• Linear programs and Hit-and-run (HAR); [25],[26]
• Estimating the volume of polytopes with HAR and PDMPs [27], [28]
• Computing densities of states in statistical physics: the Wang-Landau algorithm; [29], [30]

(7) Spatial partitions and applications
• Random projection trees; [31]
• Applications to nearest neighbor finding, regression, dimension estimation; [32]
• Shapley-Shubik indices for tree-like structures; [33], [34]
• Applications to feature selection and mixing assessment

Mode de validation

Projects for students working in tandem (15 points) + individual quizz (5 points).

A project will consists of reproducing / expanding results recently published. Students are asked to return a report, plus a git repo / notebook / code archive. Students will be given one month to complete the project.

Catch-up: oral exam on the lectures–typically a quizz with one or two questions per lecture.

Références

General references
• Bioinformatics: [10]
• Biophysics and theoretical biophysics: [2], [8], [35]
• Algorithms, machine learning: [9], [24], [36]

References
[1] Carl Ivar Branden and John Tooze. Introduction to protein structure. Garland Science, 2012.
[2] John Kuriyan, Boyana Konforti, and David Wemmer. The molecules of life: Physical and chemical principles. Garland Science, 2012.
[3] Ruchao Peng, Lian-AoWu, QinglingWang, Jianxun Qi, and George Fu Gao. Cell entry by SARS-CoV-2. Trends in biochemical sciences, 46(10):848–860, 2021.
[4] Satoshi Murakami. Multidrug efflux transporter, AcrB–the pumping mechanism. Current opinion in structural biology, 18(4):459–465, 2008.
[5] A. Schmidt, H. Xu, A. Khan, T. O’Donnell, S. Khurana, L. King, J. Manischewitz, H. Golding, P. Suphaphiphat, A. Carfi, E. Settembre, P. Dormitzer, T. Kepler, R. Zhang, A. Moody, B. Haynes, H-X. Liao, D. Shaw, and S. Harrison. Preconfiguration of the antigen-binding site during affinity maturation of a broadly neutralizing influenza virus antibody. PNAS, 110(1):264–269, 2013.
[6] J.C. Schön and M. Jansen. Prediction, determination and validation of phase diagrams via the global study of energy landscapes. Int. J. of Materials Research, 100(2):135, 2009.
[7] S.A. Adcock and A.J. McCammon. Molecular dynamics: survey of methods for simulating the activity of proteins. Chemical reviews, 106(5):1589–1615, 2006.
[8] K. Dill and S. Bromberg. Molecular driving forces: statistical thermodynamics in biology, chemistry, physics, and nanoscience. Garland Science, 2010.
[9] T. Lelièvre, G. Stoltz, and M. Rousset. Free energy computations: A mathematical perspective. World Scientific, 2010.
[10] J. Pevsner. Bioinformatics and functional genomics. John Wiley & Sons, 2015.
[11] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
[12] Bernard Moussad, Rahmatullah Roche, and Debswapna Bhattacharya. The transformative power of transformers in protein structure prediction. Proceedings of the National Academy of Sciences, 120(32):e2303499120, 2023.
[13] T. O’Donnell and F. Cazals. Modeling the dynamics of proteins: techniques from geometry and kinematics. In preparation, 2024.
[14] E. Coutsias, C. Seok, M. Jacobson, and K. Dill. A kinematic view of loop closure. Journal of computational chemistry, 25(4):510–528, 2004.
[15] Kimberly Noonan, David O’Brien, and Jack Snoeyink. Probik: Protein backbone motion by inverse kinematics. The International Journal of Robotics Research, 24(11):971–982, 2005.
[16] T. O’Donnell, V. Agashe, and F. Cazals. Geometric constraints within tripeptides and the existence of tripeptide reconstructions. J. Comp. Chem., 44(13):1236–1249, 2023.
[17] T. O’Donnell and F. Cazals. Enhanced conformational exploration of protein loops using a global parameterization of the backbone geometry. J. Comp. Chem., 44(11):1094–1104, 2023.
[18] Piyumi R Amarasinghe, Lloyd Allison, Peter J Stuckey, Maria Garcia de la Banda, Arthur M Lesk, and Arun S Konagurthu. Getting ‘ϕψχal’with proteins: minimum message length inference of joint distributions of backbone and sidechain dihedral angles. Bioinformatics, 39(Supplement_1):i357–i367, 2023.
[19] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. Marks, C. Sander, R. Zecchina, J. Onuchic, T. Hwa, and M. Weigt. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. PNAS, 108(49):E1293–E1301, 2011.
[20] Jeanne Trinquier, Guido Uguzzoni, Andrea Pagnani, Francesco Zamponi, and Martin Weigt. Efficient generative modeling of protein sequences using simple autoregressive models. Nature communications, 12(1):5800, 2021.
[21] Riccardo Rende, Federica Gerace, Alessandro Laio, and Sebastian Goldt. Mapping of attention mechanisms to a generalized potts model. Physical Review Research, 6(2):023057, 2024.
[22] William P Russ, Matteo Figliuzzi, Christian Stocker, Pierre Barrat-Charlaix, Michael Socolich, Peter Kast, Donald Hilvert, Remi Monasson, Simona Cocco, Martin Weigt, et al. An evolution-based model for designing chorismate mutase enzymes. Science, 369(6502):440–445, 2020.
[23] S. Levy. Flavors of Geometry. Cambridge University Press, 1997.
[24] A. Blum, J. Hopcroft, and R. Kannan. Foundations of data science. Cambridge, 2020.
[25] H. Berbee, C. Boender, A. Ran, C. Scheffer, R. Smith, and J. Telgen. Hit-and-run algorithms for the identification of nonredundant linear inequalities. Mathematical Programming, 37(2):184–207, 1987.
[26] L. Lovász. Hit-and-run mixes fast. Mathematical Programming, Series B, 86:443–461, 12 1999.
[27] B. Cousins and S. Vempala. A practical volume algorithm. Mathematical Programming Computation, 8(2):133–160, 2016.
[28] A. Chevallier, F. Cazals, and P. Fearnhead. Efficient computation of the the volume of a polytope in high-dimensions using Piecewise Deterministic Markov Processes. In AISTATS, 2022.
[29] G. Fort, B. Jourdain, E. Kuhn, T. Lelièvre, and G. Stoltz. Convergence of the Wang-Landau algorithm. Mathematics of Computation, 84(295):2297–2327, 2015.
[30] A. Chevallier and F. Cazals. Wang-Landau algorithm: an adapted random walk to boost convergence. J. of Computational Physics, 410(1):1–19, 2020.
[31] S. Dasgupta and K. Sinha. Randomized partition trees for exact nearest neighbor search. JMLR: Workshop and Conference Proceedings, 30:1–21, 2013.
[32] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In Proceedings of the 40th annual ACM symposium on Theory of computing, pages 537–546. ACM, 2008.
[33] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
[34] Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From local explanations to global understanding with explainable ai for trees. Nature machine intelligence, 2(1):56–67, 2020.
[35] D.M. Zuckerman. Statistical Physics of Biomolecules: An Introduction. CRC Press, 2010.
[36] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of statistical learning : data mining, inference and prediction. Springer, 2001. htf-esldm-01.

Thèmes abordés

Proteins

Molecular conformations

Thermodynamics

Dynamics

High dimensional spaces

Kinematics

Sampling

(free) energies

Les intervenants

Frédéric CAZALS

INRIA

voir les autres cours du 2nd semestre