Introduction
Integrative structural modeling is an approach for determining macromolecular structures that are challenging to determine experimentally (Alber et al., Reference Alber, Dokudovskaya, Veenhoff, Zhang, Kipper, Devos, Suprapto, Karni-Schmidt, Williams, Chait, Rout and Sali2007; Sali, Glaeser, Earnest, & Baumeister, Reference Sali, Glaeser, Earnest and Baumeister2003). Data from multiple experiments is combined with physical principles, statistics of previous structures, and prior models for structure determination. This approach overcomes the limitations of individual techniques for structure determination and maximizes the accuracy, precision, completeness, and efficiency of structure determination (Rout & Sali, Reference Rout and Sali2019; Sali, Reference Sali2021).
Recent advancements in both computational and experimental domains have prompted a resurgence of interest in integrative modeling (Beck, Covino, Hänelt, & Müller-McNicoll, Reference Beck, Covino, Hänelt and Müller-McNicoll2024; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024). On the one hand, AI-based predictions of structures of proteins and their complexes with other proteins and nucleic acids have significantly advanced structural biology of late (Abramson et al., Reference Abramson, Adler, Dunger, Evans, Green, Pritzel, Ronneberger, Willmore, Ballard, Bambrick, Bodenstein, Evans, Hung, O’Neill, Reiman, Tunyasuvunakool, Wu, Žemgulytė, Arvaniti and Jumper2024; Akdel et al., Reference Akdel, Pires, Pardo, Jänes, Zalevsky, Mészáros, Bryant, Good, Laskowski, Pozzati, Shenoy, Zhu, Kundrotas, Serra, Rodrigues, Dunham, Burke, Borkakoti, Velankar and Beltrao2022; Jumper et al., Reference Jumper, Evans, Pritzel, Green, Figurnov, Ronneberger, Tunyasuvunakool, Bates, Žídek, Potapenko, Bridgland, Meyer, Kohl, Ballard, Cowie, Romera-Paredes, Nikolov, Jain, Adler and Hassabis2021). This has spurred the development of numerous methods that aim to integrate AI-based structures with diverse types of experimental data, including electron diffraction data from X-ray crystallography, electron density maps from electron cryo-microscopy, and chemical crosslinks from mass spectrometry (Chang et al., Reference Chang, Wang, Connolly, Meng, Su, Cvirkaite-Krupovic, Krupovic, Egelman and Si2022; Stahl et al., Reference Stahl, Warneke, Demann, Bremenkamp, Hormes, Brock, Stülke and Rappsilber2024; Stahl, Graziadei, Dau, Brock, & Rappsilber, Reference Stahl, Graziadei, Dau, Brock and Rappsilber2023; Terwilliger et al., Reference Terwilliger, Poon, Afonine, Schlicksup, Croll, Millán, Richardson, Read and Adams2022; Terwilliger et al., Reference Terwilliger, Afonine, Liebschner, Croll, McCoy, Oeffner, Williams, Poon, Richardson, Read and Adams2023; Zhang et al., Reference Zhang, Zhang, Kagaya, Terashi, Zhao, Xiong and Kihara2023). These methods integrate the data in various ways, ranging from using the data to validate AI-based predictions, to using the data as additional inputs in the deep learning method, to encoding the data in the loss functions, resulting in structure predictions that are consistent with the data (O’Reilly et al., Reference O’Reilly, Graziadei, Forbrig, Bremenkamp, Charles, Lenz, Elfmann, Fischer, Stülke and Rappsilber2023; Stahl et al., Reference Stahl, Graziadei, Dau, Brock and Rappsilber2023, Reference Stahl, Warneke, Demann, Bremenkamp, Hormes, Brock, Stülke and Rappsilber2024; Terwilliger et al., Reference Terwilliger, Poon, Afonine, Schlicksup, Croll, Millán, Richardson, Read and Adams2022, Reference Terwilliger, Afonine, Liebschner, Croll, McCoy, Oeffner, Williams, Poon, Richardson, Read and Adams2023; Zhang, Haghighatlari, et al., Reference Zhang, Zhang, Kagaya, Terashi, Zhao, Xiong and Kihara2023). On the other hand, experimental techniques for in situ structure determination of assemblies are also rapidly advancing, with advancements in both hardware and software for imaging cells using cryo-electron tomography (Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024). This has led to an increase in tomography data, concurrent with an increase in the number and resolution of structures solved using tomography. Together, integrative methods using cryo-electron tomography maps along with AI-based structure predictions have resulted in significant advancements in structure determination, for example for nuclear pore complexes and ciliary complexes (Chen et al., Reference Chen, Shiozaki, Haas, Skinner, Zhao, Guo, Polacco, Yu, Krogan, Lishko, Kaake, Vale and Agard2023; Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Hesketh, Mukhopadhyay, Nakamura, Toropova, & Roberts, Reference Hesketh, Mukhopadhyay, Nakamura, Toropova and Roberts2022; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024; Mosalaganti et al., Reference Mosalaganti, Obarska-Kosinska, Siggel, Taniguchi, Turoňová, Zimmerli, Buczak, Schmidt, Margiotta, Mackmull, Hagen, Hummer, Kosinski and Beck2022; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022).
Nonetheless, there is immense potential for advancing integrative modeling methods for macromolecular assemblies. Here, we provide our perspective on two areas warranting immediate method development in the context of integrative modeling: methods for modeling intrinsically disordered regions (IDRs) of proteins and approaches for leveraging in situ data. First, unlike ordered proteins, intrinsically disordered proteins (IDPs) comprise a dynamic ensemble of conformations that are best characterized in statistical terms rather than as static structures (Baul, Chakraborty, Mugnai, Straub, & Thirumalai, Reference Baul, Chakraborty, Mugnai, Straub and Thirumalai2019). They comprise a significant fraction of the eukaryotic proteome and are involved in critical cellular processes (Oldfield & Dunker, Reference Oldfield and Dunker2014). They are found in several macromolecular assemblies, for example, the FG-Nups in the nuclear pore complex (Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022). However, their intrinsic disorder makes their characterization in these assemblies challenging. Improved representations for IDPs and methods for generating realistic IDP ensembles are crucial for understanding their functions. Second, the structural characterization of macromolecules using in situ data relies on accurate particle annotations on the tomograms (de Teresa-Trueba et al., Reference de Teresa-Trueba, Goetz, Mattausch, Stojanovska, Zimmerli, Toro-Nahuelpan, Cheng, Tollervey, Pape, Beck, Diz-Muñoz, Kreshuk, Mahamid and Zaugg2023; Rice et al., Reference Rice, Wagner, Stabrin, Sitsel, Prumbaum and Raunser2023). However, owing to the low signal-to-noise ratio of the acquired tilt images, the missing wedge effect, and the inherent heterogeneity in the sample, the localization and identification of macromolecules in tomograms is time-consuming, laborious, and often challenging (de Teresa-Trueba et al., Reference de Teresa-Trueba, Goetz, Mattausch, Stojanovska, Zimmerli, Toro-Nahuelpan, Cheng, Tollervey, Pape, Beck, Diz-Muñoz, Kreshuk, Mahamid and Zaugg2023; Moebel et al., Reference Moebel, Martinez-Sanchez, Lamm, Righetto, Wietrzynski, Albert, Larivière, Fourmentin, Pfeffer, Ortiz, Baumeister, Peng, Engel and Kervrann2021). Advances in deep learning methods and integrative approaches for combining data from other experimental and computational methods with cryo-electron tomograms can facilitate high throughput in situ structural characterization of macromolecular species.
In this Perspective, we first briefly review the existing integrative modeling methods and recent examples of macromolecular assemblies characterized using integrative modeling. Then, we discuss methods developed and opportunities for modeling disordered regions and leveraging in situ data. Finally, we end with an outlook summarizing other open problems in integrative modeling.
Integrative modeling methods
Several methods have been developed for integrative structure determination (Table 1). A subset of these including Integrative Modeling Platform (IMP), High Ambiguity Driven DOCKing (HADDOCK), and Assembline (Alber et al., Reference Alber, Dokudovskaya, Veenhoff, Zhang, Kipper, Devos, Suprapto, Karni-Schmidt, Williams, Chait, Rout and Sali2007; Dominguez, Boelens, & Bonvin, Reference Dominguez, Boelens and Bonvin2003; Honorato et al., Reference Honorato, Trellet, Jiménez-García, Schaarschmidt, Giulini, Reys, Koukos, Rodrigues, Karaca, Van Zundert, Roel-Touris, Van Noort, Jandová, Melquiond and Bonvin2024; Rantos, Karius, & Kosinski, Reference Rantos, Karius and Kosinski2022; Russel et al., Reference Russel, Lasker, Webb, Velázquez-Muriel, Tjioe, Schneidman-Duhovny, Peterson and Sali2012) are discussed here. IMP is a framework for Bayesian integrative modeling that facilitates structure determination of macromolecular ensembles at multiple resolutions (multi-scale) and multiple states (multi-state) (Alber et al., Reference Alber, Dokudovskaya, Veenhoff, Zhang, Kipper, Devos, Suprapto, Karni-Schmidt, Williams, Chait, Rout and Sali2007; Russel et al., Reference Russel, Lasker, Webb, Velázquez-Muriel, Tjioe, Schneidman-Duhovny, Peterson and Sali2012). A wide array of experimental data can be combined using IMP, for example in vivo genetic interactions, co-immunoprecipitation, FRET (Förster Resonance Energy Transfer), SAXS (small angle X-ray scattering), XLMS (chemical crosslinks from mass spectrometry), density maps from cryo electron-microscopy, and atomic structures from X-ray crystallography, NMR (Nuclear Magnetic Resonance), and AI-based predictions (Rout & Sali, Reference Rout and Sali2019; Sali, Reference Sali2021). The Bayesian inference framework allows for data from multiple sources to be integrated while considering the uncertainty in the data (Schneidman-Duhovny, Pellarin, & Sali, Reference Schneidman-Duhovny, Pellarin and Sali2014). The modular design of IMP facilitates the mixing and matching of scoring functions and sampling algorithms. It has been used in the modeling of several large assemblies, most notably the nuclear pore complex (Akey et al., Reference Akey, Singh, Ouch, Echeverria, Nudelman, Varberg, Yu, Fang, Shi, Wang, Salzberg, Song, Xu, Gumbart, Suslov, Unruh, Jaspersen, Chait, Sali and Rout2022; Alber et al., Reference Alber, Dokudovskaya, Veenhoff, Zhang, Kipper, Devos, Suprapto, Karni-Schmidt, Williams, Chait, Rout and Sali2007; Rout & Sali, Reference Rout and Sali2019; Sali, Reference Sali2021; Singh et al., Reference Singh, Soni, Hutchings, Echeverria, Shaikh, Duquette, Suslov, Li, Van Eeuwen, Molloy, Shi, Wang, Guo, Chait, Fernandez-Martinez, Rout, Sali and Villa2024). Recent advancements in IMP include Bayesian scoring functions for in vivo genetic interactions (Braberg et al., Reference Braberg, Echeverria, Bohn, Cimermancic, Shiver, Alexander, Xu, Shales, Dronamraju, Jiang, Dwivedi, Bogdanoff, Chaung, Hüttenhain, Wang, Mavor, Pellarin, Schneidman, Bader and Krogan2020), Bayesian model selection for optimizing model representation (Arvindekar, Pathak, Majila, & Viswanath, Reference Arvindekar, Pathak, Majila and Viswanath2024), automated choice of sampling parameters (Pasani & Viswanath, Reference Pasani and Viswanath2021), and annotation of precision for model regions (Ullanat, Kasukurthi, & Viswanath, Reference Ullanat, Kasukurthi and Viswanath2022).
A list of commonly used integrative modeling software for large protein complexes. Each of these combines information from three or more experimental and/or computational sources. For a comprehensive overview, see (Bonomi et al., Reference Bonomi, Heller, Camilloni and Vendruscolo2017; Habeck, Reference Habeck2023; Rout & Sali, Reference Rout and Sali2019)
Assembline is a protocol for integrative modeling that builds upon IMP, combining Xlink Analyzer, UCSF Chimera, and IMP to model large assemblies (Rantos et al., Reference Rantos, Karius and Kosinski2022). It is applicable for systems for which medium-resolution EM maps and a large number of atomic structures of subunits are available. It improves upon IMP by using pre-computed rigid body fits to EM maps to make the sampling more efficient. HADDOCK is a method for atomistic integrative modeling of protein complexes (Dominguez et al., Reference Dominguez, Boelens and Bonvin2003; Honorato et al., Reference Honorato, Trellet, Jiménez-García, Schaarschmidt, Giulini, Reys, Koukos, Rodrigues, Karaca, Van Zundert, Roel-Touris, Van Noort, Jandová, Melquiond and Bonvin2024). Experimental data from NMR, SAXS, XLMS, and mutagenesis studies are encoded as Ambiguous Interaction Restraints (AIR). Recent improvements to HADDOCK include the ability to model complexes of up to 20 macromolecules, new restraints based on cryo-EM maps, coarse-grained representations for efficient sampling, customizable pre- and post-processing steps, and a user-friendly web server for integrative modeling (Honorato et al., Reference Honorato, Trellet, Jiménez-García, Schaarschmidt, Giulini, Reys, Koukos, Rodrigues, Karaca, Van Zundert, Roel-Touris, Van Noort, Jandová, Melquiond and Bonvin2024).
Other than these, several methods allow fitting known protein structures into medium to low-resolution density maps, including MDFF and TEMPy-REFF (Beton, Mulvaney, Cragnolini, & Topf, Reference Beton, Mulvaney, Cragnolini and Topf2024; Trabuco, Villa, Mitra, Frank, & Schulten, Reference Trabuco, Villa, Mitra, Frank and Schulten2008). MDFF (Molecular dynamics flexible fitting) utilizes MD simulations for fitting structures into density maps by biasing the simulation using an additional potential derived from the density map (Trabuco et al., Reference Trabuco, Villa, Mitra, Frank and Schulten2008). TEMPy-REFF (Responsibility-based Flexible-Fitting) refines an initial structure within a density map iteratively using the Expectation-Maximization algorithm (Beton et al., Reference Beton, Mulvaney, Cragnolini and Topf2024).
Recent examples in integrative modeling: focus on nuclear and cell adhesion complexes
Integrative modeling has shed light on diverse cellular processes by determining the structures of assemblies associated with them. A list of representative recently characterized integrative structures is presented (Table 2). Here, we discuss examples of recent integrative structural biology studies in nuclear trafficking, gene expression regulation, and cell–cell adhesion. These studies not only provide novel insights into the structure and function of these assemblies but also highlight areas for future applications and method development.
Abbreviations: DIA-MS, Data independent acquisition mass spectrometry; EM, Electron microscopy; ET, Electron tomography; NMR, Nuclear magnetic resonance; NS, Negative staining; SEC-MALLS, Size exclusion chromatography—multi-angle laser light scattering; XLMS, Crosslinking coupled with mass spectrometry.
The nuclear pore complex (NPC) is a large macromolecular assembly in the nuclear envelope that connects the nucleus and cytoplasm and plays an important role in nuclear trafficking (Akey et al., Reference Akey, Singh, Ouch, Echeverria, Nudelman, Varberg, Yu, Fang, Shi, Wang, Salzberg, Song, Xu, Gumbart, Suslov, Unruh, Jaspersen, Chait, Sali and Rout2022; Alber et al., Reference Alber, Dokudovskaya, Veenhoff, Zhang, Kipper, Devos, Suprapto, Karni-Schmidt, Williams, Chait, Rout and Sali2007). Several recent studies have improved our understanding of the components of the NPC (Bley et al., Reference Bley, Nie, Mobbs, Petrovic, Gres, Liu, Mukherjee, Harvey, Huber, Lin, Brown, Tang, Rundlet, Correia, Chen, Regmi, Stevens, Jette, Dasso and Hoelz2022; Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Singh et al., Reference Singh, Soni, Hutchings, Echeverria, Shaikh, Duquette, Suslov, Li, Van Eeuwen, Molloy, Shi, Wang, Guo, Chait, Fernandez-Martinez, Rout, Sali and Villa2024; Yu et al., Reference Yu, Heidari, Mikhaleva, Tan, Mingu, Ruan, Reinkemeier, Obarska-Kosinska, Siggel, Beck, Hummer and Lemke2023; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022). Some of these studies involve the fitting of AlphaFold and experimentally determined structures into medium-resolution cryo-EM maps and cryo-electron tomograms (Bley et al., Reference Bley, Nie, Mobbs, Petrovic, Gres, Liu, Mukherjee, Harvey, Huber, Lin, Brown, Tang, Rundlet, Correia, Chen, Regmi, Stevens, Jette, Dasso and Hoelz2022; Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Petrovic et al., Reference Petrovic, Samanta, Perriches, Bley, Thierbach, Brown, Nie, Mobbs, Stevens, Liu, Tomaleri, Schaus and Hoelz2022; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022). Other studies additionally incorporate biochemical data including chemical crosslinks (Singh et al., Reference Singh, Soni, Hutchings, Echeverria, Shaikh, Duquette, Suslov, Li, Van Eeuwen, Molloy, Shi, Wang, Guo, Chait, Fernandez-Martinez, Rout, Sali and Villa2024). Together these studies have been used to characterize the structures of the cytoplasmic face, cytoplasmic ring, the linker-scaffold network, and the nuclear basket of the NPC. The resulting structures enabled the identification of novel interfaces between disordered nucleoporins (Nups) (Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022), elucidated the function of nucleoporins—Nup38 and the Cytoplasmic Filament Nucleoporin (CFNC) (Bley et al., Reference Bley, Nie, Mobbs, Petrovic, Gres, Liu, Mukherjee, Harvey, Huber, Lin, Brown, Tang, Rundlet, Correia, Chen, Regmi, Stevens, Jette, Dasso and Hoelz2022), delineated the role of Mlp/Trp in assisting mRNP transport (Bley et al., Reference Bley, Nie, Mobbs, Petrovic, Gres, Liu, Mukherjee, Harvey, Huber, Lin, Brown, Tang, Rundlet, Correia, Chen, Regmi, Stevens, Jette, Dasso and Hoelz2022; Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Singh et al., Reference Singh, Soni, Hutchings, Echeverria, Shaikh, Duquette, Suslov, Li, Van Eeuwen, Molloy, Shi, Wang, Guo, Chait, Fernandez-Martinez, Rout, Sali and Villa2024; Yu et al., Reference Yu, Heidari, Mikhaleva, Tan, Mingu, Ruan, Reinkemeier, Obarska-Kosinska, Siggel, Beck, Hummer and Lemke2023; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022), and revealed the plasticity and robustness of the inner ring (Petrovic et al., Reference Petrovic, Samanta, Perriches, Bley, Thierbach, Brown, Nie, Mobbs, Stevens, Liu, Tomaleri, Schaus and Hoelz2022). Finally, another study determined the distribution of intrinsically disordered nucleoporins in the NPC and their motion in the central channel using fluorescence lifetime imaging of fluorescence resonance energy transfer (FLIM-FRET) and coarse-grained molecular dynamic (MD) simulations (Yu et al., Reference Yu, Heidari, Mikhaleva, Tan, Mingu, Ruan, Reinkemeier, Obarska-Kosinska, Siggel, Beck, Hummer and Lemke2023).
Whereas the above studies are on components of the NPC, (Akey et al., Reference Akey, Singh, Ouch, Echeverria, Nudelman, Varberg, Yu, Fang, Shi, Wang, Salzberg, Song, Xu, Gumbart, Suslov, Unruh, Jaspersen, Chait, Sali and Rout2022, Reference Akey, Echeverria, Ouch, Nudelman, Shi, Wang, Chait, Sali, Fernandez-Martinez and Rout2023; Mosalaganti et al., Reference Mosalaganti, Obarska-Kosinska, Siggel, Taniguchi, Turoňová, Zimmerli, Buczak, Schmidt, Margiotta, Mackmull, Hagen, Hummer, Kosinski and Beck2022) determined comprehensive integrative structures of the entire NPC. These studies integrate in situ cryo-electron tomography data with AlphaFold or experimentally determined structures (Mosalaganti et al., Reference Mosalaganti, Obarska-Kosinska, Siggel, Taniguchi, Turoňová, Zimmerli, Buczak, Schmidt, Margiotta, Mackmull, Hagen, Hummer, Kosinski and Beck2022), and additionally cryo-EM maps, chemical crosslinks, and data from quantitative fluorescence imaging and biochemical studies to determine comprehensive structures of NPCs (Akey et al., Reference Akey, Singh, Ouch, Echeverria, Nudelman, Varberg, Yu, Fang, Shi, Wang, Salzberg, Song, Xu, Gumbart, Suslov, Unruh, Jaspersen, Chait, Sali and Rout2022, Reference Akey, Echeverria, Ouch, Nudelman, Shi, Wang, Chait, Sali, Fernandez-Martinez and Rout2023). The structures revealed distinct dilated and constricted states of the complex and characterized the plasticity of the pore (Akey et al., Reference Akey, Singh, Ouch, Echeverria, Nudelman, Varberg, Yu, Fang, Shi, Wang, Salzberg, Song, Xu, Gumbart, Suslov, Unruh, Jaspersen, Chait, Sali and Rout2022, Reference Akey, Echeverria, Ouch, Nudelman, Shi, Wang, Chait, Sali, Fernandez-Martinez and Rout2023; Mosalaganti et al., Reference Mosalaganti, Obarska-Kosinska, Siggel, Taniguchi, Turoňová, Zimmerli, Buczak, Schmidt, Margiotta, Mackmull, Hagen, Hummer, Kosinski and Beck2022). Additionally, they localized precise anchoring sites for the intrinsically disordered Nups (Mosalaganti et al., Reference Mosalaganti, Obarska-Kosinska, Siggel, Taniguchi, Turoňová, Zimmerli, Buczak, Schmidt, Margiotta, Mackmull, Hagen, Hummer, Kosinski and Beck2022) and delineated the function of Pom153 in ring dilation (Akey et al., Reference Akey, Echeverria, Ouch, Nudelman, Shi, Wang, Chait, Sali, Fernandez-Martinez and Rout2023).
The Nucleosome Remodeling and Deacetylase (NuRD) complex is a chromatin remodifying assembly that plays an important role in several cellular processes including transcriptional regulation, cell cycle progression, and cellular differentiation (Arvindekar et al., Reference Arvindekar, Jackman, Low, Landsberg, Mackay and Viswanath2022). It consists of chromatin remodeling and deacetylase modules, connected by MBD and GATAD2 proteins. The structures of three subcomplexes of NuRD were determined by integrating data from negative-stain and low-resolution cryo-EM maps, X-ray crystallography, XLMS, SEC-MALS, DIA-MS, NMR spectroscopy, homology modeling, secondary structure predictions, and physical principles (Arvindekar et al., Reference Arvindekar, Jackman, Low, Landsberg, Mackay and Viswanath2022). The integrative structures depict MBD in two states in NuRD and elucidate the role of the intrinsically disordered region of MBD in bridging the chromatin remodeling and deacetylase modules of NuRD.
Desmosomes are intercellular junctions that tether the intermediate filaments of adjacent cells in tissues under mechanical stress (Pasani, Menon, & Viswanath, Reference Pasani, Menon and Viswanath2024). The integrative structure of the desmosomal outer dense plaque (ODP) was determined by combining data from cryo-electron tomography, X-ray crystallography, immuno-electron microscopy, in vitro overlay, in vivo co-localization assays, Yeast Two-Hybrid (Y2H), co-immuno precipitation, in- silico sequence-based predictions of transmembrane and disordered regions, homology modeling, and stereochemistry (Pasani et al., Reference Pasani, Menon and Viswanath2024). The structure enabled the localization of disordered regions of Plakophilin (PKP) and Plakoglobin (PG) and the identification of novel protein–protein interfaces associated with them, leading to hypotheses about the functions of these disordered regions.
Two elements emerge as common across the aforementioned studies: they leverage in situ cryo-electron tomography data and the characterized systems contain significant fractions of disordered regions (Figure 1). This highlights two areas of immediate interest for method development: modeling with intrinsically disordered proteins (IDP) and utilizing data from cryo-electron tomography (cryo-TM), discussed in the following sections.
Integrative modeling of intrinsically disordered proteins
Intrinsically disordered proteins (IDPs) are a class of proteins that lack a well-defined ordered structure in their monomeric state. Rather, they exist as an ensemble of interconverting conformers in equilibrium and hence are structurally heterogeneous (Baul et al., Reference Baul, Chakraborty, Mugnai, Straub and Thirumalai2019; Lindorff-Larsen & Kragelund, Reference Lindorff-Larsen and Kragelund2021). This heterogeneity of IDPs also makes it challenging to characterize them both experimentally and computationally (Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024).
Learning Representations for IDPs
Recently, protein language models (pLMs) have emerged as powerful tools for learning context-aware representations, providing a compact and informative approach to characterize the structural and functional properties of proteins (Bepler & Berger, Reference Bepler and Berger2021; Rives et al., Reference Rives, Meier, Sercu, Goyal, Lin, Liu, Guo, Ott, Zitnick, Ma and Fergus2021). pLMs enhance the performance of models on downstream tasks via transfer learning, eliminating the need to train a neural network from end to end. This approach is particularly beneficial while training models with small datasets.
Using pLMs for IDPs presents several challenges. First, pLMs trained only on sequences may not be able to capture the conformational heterogeneity of IDPs. Second, the databases used to train pLMs are dominated by ordered protein sequences, leading to a bias in the learned representations. Third, IDPs often function through transient interactions and context-dependent conformations, i.e., the same IDP may adopt different conformations with different binding partners. The state-of-the-art pLMs do not account for the environmental context and interacting partners and thus may not capture these transient interactions. Finally, the lack of structural data representative of IDP conformations poses a significant challenge in training models.
Advances in representation learning techniques are required for accurately characterizing the behavior of IDPs. Representations for IDPs could be improved by fine-tuning existing pLMs on IDP-specific tasks and/or by incorporating additional data on IDPs. Sequence alone might not be sufficient to capture the properties of IDPs; incorporating structural information or physics-based priors might allow pLMs to capture the complex dynamics of IDPs (Wang, Wang, Evans, & Tiwary, Reference Wang, Wang, Evans and Tiwary2024). Structure-aware pLMs have been recently developed (Peñaherrera & Koes, Reference Peñaherrera and Koes2024; Sun & Shen, Reference Sun and Shen2023; Wang et al., Reference Wang, Wang, Evans and Tiwary2024). The same approach can be extended to IDPs. There is a need to obtain more structural data for IDPs (Jahn, Marquet, Heinzinger, & Rost, Reference Jahn, Marquet, Heinzinger and Rost2024). Whereas, experimental structural data remains important, acquiring it might be tedious and time-consuming. Computational approaches for generating realistic IDP conformational ensembles, such as MD simulations and generative models, would provide valuable experimental-like structural data. In the next section, we discuss methods for generating IDP ensembles.
Generating IDP ensembles
Determining the conformational ensembles of IDPs is essential for understanding their functions. MD simulations are widely used for generating conformational ensembles. However, their reliability depends on the accuracy of force fields and the ergodicity of sampling (Bonomi, Heller, Camilloni, & Vendruscolo, Reference Bonomi, Heller, Camilloni and Vendruscolo2017; Robustelli, Piana, & Shaw, Reference Robustelli, Piana and Shaw2018). Force fields typically used for folded proteins often fail to accurately capture the conformations of IDPs when compared with experimental data. Efforts for improving the force fields for IDPs focus on either refining the protein force field (Baul et al., Reference Baul, Chakraborty, Mugnai, Straub and Thirumalai2019; Huang et al., Reference Huang, Rauscher, Nawrocki, Ran, Feig, de Groot, Grubmüller and MacKerell2017; Joseph et al., Reference Joseph, Reinhardt, Aguirre, Chew, Russell, Espinosa, Garaizar and Collepardo-Guevara2021), or accurately accounting for protein-water interactions (Best, Zheng, & Mittal, Reference Best, Zheng and Mittal2014; Nerenberg, Jo, So, Tripathy, & Head-Gordon, Reference Nerenberg, Jo, So, Tripathy and Head-Gordon2012; Robustelli et al., Reference Robustelli, Piana and Shaw2018; Vitalis & Pappu, Reference Vitalis and Pappu2009). Coarse-grained models that improve sampling by reducing the degrees of freedom have also been developed (Baratam & Srivastava, Reference Baratam and Srivastava2024; Baul et al., Reference Baul, Chakraborty, Mugnai, Straub and Thirumalai2019; Joseph et al., Reference Joseph, Reinhardt, Aguirre, Chew, Russell, Espinosa, Garaizar and Collepardo-Guevara2021; Marrink, Risselada, Yefimov, Tieleman, & de Vries, Reference Marrink, Risselada, Yefimov, Tieleman and de Vries2007; Thomasen, Pesce, Roesgaard, Tesei, & Lindorff-Larsen, Reference Thomasen, Pesce, Roesgaard, Tesei and Lindorff-Larsen2022).
Deep generative models offer a computationally efficient means for sampling conformations from a learned data distribution. Latent space embeddings from variational autoencoder (VAE) trained on IDP sequences (Mansoor, Baek, Park, Lee, & Baker, Reference Mansoor, Baek, Park, Lee and Baker2024), conditional generative adversarial networks (GAN) (Janson, Valdes-Garcia, Heo, & Feig, Reference Janson, Valdes-Garcia, Heo and Feig2023), denoising diffusion probabilistic models (DDPM) (Janson & Feig, Reference Janson and Feig2024; Zhu et al., Reference Zhu, Li, Zhang, Zheng, Zhong, Bai, Wang, Wei, Yang and Chen2024) have been used for generating all-atom and Cα coarse-grained ensembles of IDPs. More sophisticated approaches such as flow matching may also be employed for generating ensembles of IDPs. Notably, these aforementioned generative models leverage MD-generated ensembles for training.
Recent studies demonstrate the combined use of MD simulations and machine learning approaches to generate IDP conformers with the aim of predicting the biophysical properties of IDPs and designing IDP sequences (Lotthammer, Ginell, Griffith, Emenecker, & Holehouse, Reference Lotthammer, Ginell, Griffith, Emenecker and Holehouse2024; Pesce et al., Reference Pesce, Bremer, Tesei, Hopkins, Grace, Mittag and Lindorff-Larsen2024; Tesei et al., Reference Tesei, Trolle, Jonsson, Betz, Knudsen, Pesce, Johansson and Lindorff-Larsen2024). For example, the ALBATROSS deep learning model was developed for predicting the biophysical properties of IDPs, such as the radius of gyration, by training on IDP ensembles generated via the MPIPI-GG model (Lotthammer et al., Reference Lotthammer, Ginell, Griffith, Emenecker and Holehouse2024). Similarly, support vector regression models were trained to predict chain compaction for IDP sequences using IDP ensembles generated by the CALVADOS model (Tesei et al., Reference Tesei, Trolle, Jonsson, Betz, Knudsen, Pesce, Johansson and Lindorff-Larsen2024). Lastly, a method for designing IDP sequences with pre-defined conformational properties was developed by combining ensemble generation using CALVADOS with alchemical free-energy calculations within a Markov Chain Monte Carlo (MCMC) optimization framework (Pesce et al., Reference Pesce, Bremer, Tesei, Hopkins, Grace, Mittag and Lindorff-Larsen2024).
Integrating experimental data for generating IDP ensembles
Broadly, experimental data can be utilized for modeling IDPs in several ways: validation of generated ensembles, reweighting generated ensembles using experimental data, incorporating experimental data as restraints for sampling conformations, or using experimental data to improve existing force fields (Bernetti & Bussi, Reference Bernetti and Bussi2023; Chan-Yao-Chong, Durand, & Ha-Duong, Reference Chan-Yao-Chong, Durand and Ha-Duong2019; Fisher & Stultz, Reference Fisher and Stultz2011). A comprehensive list of methods can be found in reviews on this topic (Bonomi et al., Reference Bonomi, Heller, Camilloni and Vendruscolo2017; Habeck, Reference Habeck2023).
First, ensemble validation involves generating realistic ensembles of IDPs and validating the results with experimental data (Chan-Yao-Chong et al., Reference Chan-Yao-Chong, Durand and Ha-Duong2019). Due to their ability to capture the dynamics of IDPs, NMR, and SAS data are most commonly used for validating the generated ensembles for IDPs (Baratam & Srivastava, Reference Baratam and Srivastava2024; Shrestha, Smith, & Petridis, Reference Shrestha, Smith and Petridis2021). Second, ensemble weighting involves using experimental data to refine an existing ensemble, to minimize deviation of the ensemble from the observed data (Chan-Yao-Chong et al., Reference Chan-Yao-Chong, Durand and Ha-Duong2019). This can be achieved by maximum parsimony (SES Berlin et al., Reference Berlin, Castañeda, Schneidman-Duhovny, Sali, Nava-Tudela and Fushman2013) or maximum entropy (Pitera & Chodera, Reference Pitera and Chodera2012; Roux & Weare, Reference Roux and Weare2013; Cavalli, Camilloni, & Vendruscolo, Reference Cavalli, Camilloni and Vendruscolo2013) (EROS Różycki, Kim, & Hummer, Reference Różycki, Kim and Hummer2011, (BioEn Hummer & Köfinger, Reference Hummer and Köfinger2015), and ABSURD (Salvi, Abyzov, & Blackledge, Reference Salvi, Abyzov and Blackledge2016). Bayesian inference methods allow consideration of uncertainty in data (Fisher, Ullman, & Stultz, Reference Fisher, Ullman and Stultz2013; Lincoff et al., Reference Lincoff, Haghighatlari, Krzeminski, Teixeira, Gomes, Gradinaru, Forman-Kay and Head-Gordon2020). Combining Bayesian inference and maximum entropy methods helps overcome the limitations of each (Crehuet, Buigues, Salvatella, & Lindorff-Larsen, Reference Crehuet, Buigues, Salvatella and Lindorff-Larsen2019; Fröhlking, Bernetti, & Bussi, Reference Fröhlking, Bernetti and Bussi2023). Deep learning models in combination with Bayesian and maximum entropy methods can also be used for refining an initial pool of conformations (DynamICE: Zhang, Haghighatlari, et al., Reference Zhang, Haghighatlari, Li, Liu, Namini, Teixeira, Forman-Kay and Head-Gordon2023). Third, experimental data can also be used as restraints to guide simulations (Chan-Yao-Chong et al., Reference Chan-Yao-Chong, Durand and Ha-Duong2019). Metainference uses Bayesian inference for incorporating noisy, ensemble-averaged experimental data using replica-averaged modeling (Bonomi, Camilloni, Cavalli, & Vendruscolo, Reference Bonomi, Camilloni, Cavalli and Vendruscolo2016; Bonomi, Camilloni, & Vendruscolo, Reference Bonomi, Camilloni and Vendruscolo2016). Similarly, parallel replica ensemble restraints based on SAXS data were used in MD simulations of IDPs (Hermann & Hub, Reference Hermann and Hub2019). Finally, experimental data can also be used for improving existing force fields on the fly using a Maximum Entropy approach (Cesari, Gil-Ley, & Bussi, Reference Cesari, Gil-Ley and Bussi2016).
A holistic understanding of the dynamic behavior of IDPs requires realistic conformational ensembles that can be generated using MD simulations and deep generative models. MD simulations can provide experimental-like ensembles for training deep generative models; the latter may aid in improving force fields, enhancing sampling of IDP conformations, and analyzing the ensemble generated via MD. Thus, an integrated approach would enable overcoming the limitations of each and improving our understanding of the dynamic nature of IDPs.
Integrative structure determination using in situ data
Cryo-electron tomography (cryo-ET) is a cryo-EM imaging technique that enables structural characterization of macromolecular species (macromolecules, their complexes, and assemblies), in their native cellular environment at nanometer resolution (Gubins et al., Reference Gubins, Chaillet, van der Schot, Veltkamp, Förster, Hao, Wan, Cui, Zhang, Moebel, Wang, Kihara, Zeng, Xu, Nguyen, White and Bunyak2020; Lamm et al., Reference Lamm, Righetto, Wietrzynski, Pöge, Martinez-Sanchez, Peng and Engel2022). High-throughput localization and identification of macromolecular species within a tomogram can provide insights into their conformational heterogeneity, potential interactors, counts, and distributions within the cell (Arvindekar, Majila, & Viswanath, Reference Arvindekar, Majila and Viswanath2024; Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; Förster, Han, & Beck, Reference Förster, Han, Beck and Jensen2010; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024). Integrating cryo-ET data along with complementary data from experiments such as XLMS, Y2H, cryo-EM Single Particle Analysis (SPA), FRET, AI-based structure predictions, and prior structural models can help build a comprehensive structural atlas of the cell (Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; Förster et al., Reference Förster, Han, Beck and Jensen2010; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024). However, the intracellular crowding, compositional heterogeneity and low copy numbers of macromolecular species, the low signal-to-noise ratio, and the missing wedge in the tomography data pose significant challenges for localizing and identifying macromolecules in the tomograms (Moebel et al., Reference Moebel, Martinez-Sanchez, Lamm, Righetto, Wietrzynski, Albert, Larivière, Fourmentin, Pfeffer, Ortiz, Baumeister, Peng, Engel and Kervrann2021; Pyle & Zanetti, Reference Pyle and Zanetti2021).
Localization and identification of macromolecular species with known structures
Macromolecular species with known structures are often annotated in tomograms either manually or by template matching. Manual particle annotation, however, is time-consuming, laborious, error-prone, and not suitable for high-throughput workflows (Lamm et al., Reference Lamm, Righetto, Wietrzynski, Pöge, Martinez-Sanchez, Peng and Engel2022). Template matching involves using a low-pass filtered template of the known structure of a target macromolecule to localize similar densities in the tomogram (Frangakis et al., Reference Frangakis, Böhm, Förster, Nickell, Nicastro, Typke, Hegerl and Baumeister2002). Methods for template matching are under active development (Cruz-León et al., Reference Cruz-León, Majtner, Hoffmann, Kreysing, Kehl, Tuijtel, Schaefer, Geißler, Beck, Turoňová and Hummer2024; Maurer, Siggel, & Kosinski, Reference Maurer, Siggel and Kosinski2024). For example, the use of high-resolution information and template-specific search parameter optimization for objective, comprehensive, and high-confidence localization and identification of macromolecular species in tomograms was recently proposed (Cruz-León et al., Reference Cruz-León, Majtner, Hoffmann, Kreysing, Kehl, Tuijtel, Schaefer, Geißler, Beck, Turoňová and Hummer2024).
In addition to template matching, several supervised learning methods have also been recently developed. Two such deep learning-based methods, DeepFinder and DeePiCt, utilize convolutional neural networks (CNNs) for simultaneous localization and identification of macromolecular species (de Teresa-Trueba et al., Reference de Teresa-Trueba, Goetz, Mattausch, Stojanovska, Zimmerli, Toro-Nahuelpan, Cheng, Tollervey, Pape, Beck, Diz-Muñoz, Kreshuk, Mahamid and Zaugg2023; Moebel et al., Reference Moebel, Martinez-Sanchez, Lamm, Righetto, Wietrzynski, Albert, Larivière, Fourmentin, Pfeffer, Ortiz, Baumeister, Peng, Engel and Kervrann2021). Another deep learning-based object detection method, MemBrain, was developed for estimating the localizations and orientations of membrane-embedded macromolecules (Lamm et al., Reference Lamm, Righetto, Wietrzynski, Pöge, Martinez-Sanchez, Peng and Engel2022, Reference Lamm, Zufferey, Righetto, Wietrzynski, Yamauchi, Burt, Liu, Zhang, Martinez-Sanchez, Ziegler, Isensee, Schnabel, Engel and Peng2024). These approaches have been shown to outperform template matching for localizing macromolecules (de Teresa-Trueba et al., Reference de Teresa-Trueba, Goetz, Mattausch, Stojanovska, Zimmerli, Toro-Nahuelpan, Cheng, Tollervey, Pape, Beck, Diz-Muñoz, Kreshuk, Mahamid and Zaugg2023; Gubins et al., Reference Gubins, Chaillet, van der Schot, Veltkamp, Förster, Hao, Wan, Cui, Zhang, Moebel, Wang, Kihara, Zeng, Xu, Nguyen, White and Bunyak2020; Lamm et al., Reference Lamm, Righetto, Wietrzynski, Pöge, Martinez-Sanchez, Peng and Engel2022; Moebel et al., Reference Moebel, Martinez-Sanchez, Lamm, Righetto, Wietrzynski, Albert, Larivière, Fourmentin, Pfeffer, Ortiz, Baumeister, Peng, Engel and Kervrann2021). However, similar to manual annotation and template matching, these supervised learning approaches are limited to macromolecules with known structures. They are not suitable for high-throughput workflows and de novo structural characterization of macromolecular species (de Teresa-Trueba et al., Reference de Teresa-Trueba, Goetz, Mattausch, Stojanovska, Zimmerli, Toro-Nahuelpan, Cheng, Tollervey, Pape, Beck, Diz-Muñoz, Kreshuk, Mahamid and Zaugg2023; Gubins et al., Reference Gubins, Chaillet, van der Schot, Veltkamp, Förster, Hao, Wan, Cui, Zhang, Moebel, Wang, Kihara, Zeng, Xu, Nguyen, White and Bunyak2020; Lamm et al., Reference Lamm, Righetto, Wietrzynski, Pöge, Martinez-Sanchez, Peng and Engel2022; Moebel et al., Reference Moebel, Martinez-Sanchez, Lamm, Righetto, Wietrzynski, Albert, Larivière, Fourmentin, Pfeffer, Ortiz, Baumeister, Peng, Engel and Kervrann2021).
de novo localization and identification of species
For de novo structural characterization of macromolecular species with unknown structures, deep metric learning-based approaches, such as TomoTwin, and unsupervised learning approaches, such as Multi-Pattern Pursuit (MPP) and Deep Iterative Subtomogram Clustering Approach (DISCA) were recently developed (Rice et al., Reference Rice, Wagner, Stabrin, Sitsel, Prumbaum and Raunser2023; Xu et al., Reference Xu, Singla, Tocheva, Chang, Stevens, Jensen and Alber2019; Zeng et al., Reference Zeng, Kahng, Xue, Mahamid, Chang and Xu2023). These approaches aim to cluster subtomograms based on their structural similarity. Subtomogram averaging on the clustered subtomograms can aid in the structural characterization of macromolecular species at 10–20 Å resolutions (Rice et al., Reference Rice, Wagner, Stabrin, Sitsel, Prumbaum and Raunser2023; Zeng et al., Reference Zeng, Kahng, Xue, Mahamid, Chang and Xu2023). These approaches are currently sensitive to noise in the tomograms and the size and abundance of the macromolecular species. However, they hold great promise for de novo high-throughput structural characterization of macromolecular species using tomographic data.
Visual proteomics
Visual proteomics is an approach that aims to build molecular atlases that encapsulate structural descriptions of macromolecules within the cell using methods such as cryo-ET (Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; Förster et al., Reference Förster, Han, Beck and Jensen2010; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024). This approach is inherently integrative. Given a tomogram, large macromolecular species with known atomic structures can be localized and identified within it using methods like template matching. Densities with unknown macromolecular identities can be obtained using the de novo approaches described above. The in situ structures of these uncharacterized macromolecular species can then be determined using an integrative approach by rigid fitting of structures obtained using cryo-EM SPA, X-ray crystallography, and AI-based structure predictions along with data from orthogonal experiments such as fluorescence microscopy and XLMS (Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; Förster et al., Reference Förster, Han, Beck and Jensen2010; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024). For example, recent studies used integrative approaches to combine data from cryo-ET, SPA with cryo-EM, mass spectrometry, and predictions from AlphaFold to understand the molecular architecture of the human IFT-A and IFT-B complexes (Hesketh et al., Reference Hesketh, Mukhopadhyay, Nakamura, Toropova and Roberts2022) and microtubule doublets in mouse sperm cells (Chen et al., Reference Chen, Shiozaki, Haas, Skinner, Zhao, Guo, Polacco, Yu, Krogan, Lishko, Kaake, Vale and Agard2023). In summary, utilizing cryo-ET data in an integrative approach can provide insights into interactors of a macromolecular species, associated protein communities, and larger cellular neighborhoods (Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; Förster et al., Reference Förster, Han, Beck and Jensen2010; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024).
Outlook
Integrative modeling has progressed significantly in the past decade, as evidenced by the increasing number, size, and precision of structures deposited to the PDB-Dev and integrated into the PDB (https://pdb-dev.wwpdb.org) (Saltzberg et al., Reference Saltzberg, Viswanath, Echeverria, Chemmama, Webb and Sali2021; Vallat et al., Reference Vallat, Webb, Fayazi, Voinea, Tangmunarunkit, Ganesan, Lawson, Westbrook, Kesselman, Sali and Berman2021). Integrative structural biology plays a crucial role in the era of AI-based structure predictions. Experimental data from rapidly advancing techniques such as cryo-electron tomography, and AI-based predictions can complement each other within an integrative framework (Arvindekar, Majila, & Viswanath, Reference Arvindekar, Majila and Viswanath2024; Beck et al., Reference Beck, Covino, Hänelt and Müller-McNicoll2024; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024; Shor & Schneidman-Duhovny, Reference Shor and Schneidman-Duhovny2024b). This approach has proved powerful for several systems such as ciliary complexes and nuclear pore complexes (Chen et al., Reference Chen, Shiozaki, Haas, Skinner, Zhao, Guo, Polacco, Yu, Krogan, Lishko, Kaake, Vale and Agard2023; Fontana et al., Reference Fontana, Dong, Pi, Tong, Hecksel, Wang, Fu, Bustamante and Wu2022; Hesketh et al., Reference Hesketh, Mukhopadhyay, Nakamura, Toropova and Roberts2022; McCafferty et al., Reference McCafferty, Klumpe, Amaro, Kukulski, Collinson and Engel2024; Mosalaganti et al., Reference Mosalaganti, Obarska-Kosinska, Siggel, Taniguchi, Turoňová, Zimmerli, Buczak, Schmidt, Margiotta, Mackmull, Hagen, Hummer, Kosinski and Beck2022; Zhu et al., Reference Zhu, Huang, Zeng, Zhan, Liang, Xu, Zhao, Wang, Wang, Zhou, Tao, Liu, Lei, Yan and Shi2022). Alphafold and similar AI-based prediction methods can increasingly solve structures for larger and more complex systems (Abramson et al., Reference Abramson, Adler, Dunger, Evans, Green, Pritzel, Ronneberger, Willmore, Ballard, Bambrick, Bodenstein, Evans, Hung, O’Neill, Reiman, Tunyasuvunakool, Wu, Žemgulytė, Arvaniti and Jumper2024). However, their applicability to solving entire structures of large assemblies remains an open question as they are limited by the GPU memory as well as the availability of training data. For example, membrane proteins and IDPs are under-represented in the training data (Carugo & Djinović-Carugo, Reference Carugo and Djinović-Carugo2023; Dobson et al., Reference Dobson, Szekeres, Gerdán, Langó, Zeke and Tusnády2023). The low-pLDDT regions in Alphafold structures often coincide with IDRs, suggesting that Alphafold may be used to predict these regions (Wilson, Choy, & Karttunen, Reference Wilson, Choy and Karttunen2022). In contrast, in cases where Alphafold predicts structures of IDPs with high confidence, these regions typically represent the folded conformations of the IDPs, indicating a disorder-to-order transition in the presence of a partner (Alderson, Pritišanac, Kolarić, Moses, & Forman-Kay, Reference Alderson, Pritišanac, Kolarić, Moses and Forman-Kay2023; Wilson et al., Reference Wilson, Choy and Karttunen2022). Nonetheless, the static structures from Alphafold are not an accurate representation of the dynamic behavior of IDPs, characterized by an ensemble of conformations (Ruff & Pappu, Reference Ruff and Pappu2021).
In this Perspective, we highlighted two emerging frontiers for method development in integrative modeling: modeling disordered regions and modeling with data from cryo-electron tomography. Here, we briefly point to other open areas in integrative modeling that are the subject of current studies and/or may benefit from timely method development. First, a lack of knowledge about the system stoichiometry is one of the challenges for starting integrative modeling. Methods to estimate the stoichiometry based on the confidence of AI-based predictions are only beginning to be developed and are not yet generalizable (Chim & Elofsson, Reference Chim and Elofsson2024; Shor & Schneidman-Duhovny, Reference Shor and Schneidman-Duhovny2024b, Reference Shor and Schneidman-Duhovny2024a). Second, methods for incorporating in vivo data in modeling are required. Recently, in vivo genetic interaction measurements were encoded as Bayesian distance restraints for integrative modeling of assemblies (Braberg et al., Reference Braberg, Echeverria, Bohn, Cimermancic, Shiver, Alexander, Xu, Shales, Dronamraju, Jiang, Dwivedi, Bogdanoff, Chaung, Hüttenhain, Wang, Mavor, Pellarin, Schneidman, Bader and Krogan2020). Similarly, methods for integrating other in vivo data such as data from super-resolution microscopy may also be developed to model larger cellular neighborhoods. Third, on the model representation front, it would be beneficial to determine system representation using objective measures instead of fixing them ad hoc (Arvindekar, Pathak, et al., Reference Arvindekar, Pathak, Majila and Viswanath2024; Viswanath & Sali, Reference Viswanath and Sali2019). Current methods for optimizing representations are limited to assessing a small number of candidate representations (Arvindekar, Pathak, et al., Reference Arvindekar, Pathak, Majila and Viswanath2024; Viswanath & Sali, Reference Viswanath and Sali2019). Methods that enable sampling and assessing a large number of representations, for example by dynamically varying the model representations during sampling, would benefit integrative modeling (Viswanath & Sali, Reference Viswanath and Sali2019). Fourth, methods for integrative modeling of dynamic systems with multiple discrete states and/or a continuum of states are also continually advancing (Habeck, Reference Habeck2023; Hoff, Thomasen, Lindorff-Larsen, & Bonomi, Reference Hoff, Thomasen, Lindorff-Larsen and Bonomi2024; Hoff, Zinke, Izadi-Pruneyre, & Bonomi, Reference Hoff, Zinke, Izadi-Pruneyre and Bonomi2024; Lincoff et al., Reference Lincoff, Haghighatlari, Krzeminski, Teixeira, Gomes, Gradinaru, Forman-Kay and Head-Gordon2020; Potrzebowski, Trewhella, & Andre, Reference Potrzebowski, Trewhella and Andre2018). Fifth, sampling procedures in integrative modeling may be improved by leveraging the recent advances in deep learning, particularly in generative modeling. Specifically, recent generative modeling methods for protein structure prediction may be extended to incorporate experimental data, potentially leading to more efficient sampling procedures than the current stochastic sampling methods (Jing, Berger, & Jaakkola, Reference Jing, Berger and Jaakkola2024; Watson et al., Reference Watson, Juergens, Bennett, Trippe, Yim, Eisenach, Ahern, Borst, Ragotte, Milles, Wicky, Hanikel, Pellock, Courbet, Sheffler, Wang, Venkatesh, Sappington, Torres and Baker2023; Wu et al., Reference Wu, Yang, van den Berg, Alamdari, Zou, Lu and Amini2024; Zheng et al., Reference Zheng, He, Liu, Shi, Lu, Feng, Ju, Wang, Zhu, Min, Zhang, Tang, Hao, Jin, Chen, Noé, Liu and Liu2024). Finally, methods for comprehensive validation of integrative models, including assessment of model uncertainty and Bayesian assessment of fit to different kinds of input data are also necessary and are under development (Sali et al., Reference Sali, Berman, Schwede, Trewhella, Kleywegt, Burley, Markley, Nakamura, Adams, Bonvin, Chiu, Peraro, Di Maio, Ferrin, Grünewald, Gutmanas, Henderson, Hummer, Iwasaki and Westbrook2015; Vallat et al., Reference Vallat, Webb, Fayazi, Voinea, Tangmunarunkit, Ganesan, Lawson, Westbrook, Kesselman, Sali and Berman2021). In all, these efforts will facilitate faster, more accurate, and more precise characterization of larger assemblies (Sali, Reference Sali2021). The grand challenge in the field is to construct spatiotemporal models of entire cells. Integrative models of assemblies can contribute directly to this effort via metamodeling efforts that involve the integration of models at different scales to address the grand challenge (Raveh et al., Reference Raveh, Sun, White, Sanyal, Tempkin, Zheng, Bharath, Singla, Wang, Zhao, Li, Graham, Kesselman, Stevens and Sali2021).
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/qrd.2024.15.
Acknowledgments
Molecular graphics images were produced using the UCSF Chimera and UCSF ChimeraX packages from the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco (supported by NIH P41 RR001081, NIH R01-GM129325, and National Institute of Allergy and Infectious Diseases).
Author contribution
K.M., S.A., and M.J.: reading and synthesis. K.M., S.A., M.J., and S.V.: writing: original draft, writing: revision. K.M.: visualization. S.V.: supervision, funding.
Funding
This work has been supported by the following grants: Department of Atomic Energy (DAE) TIFR grant RTI 4006, Department of Science and Technology (DST) SERB grant SPG/2020/000475, and Department of Biotechnology (DBT) BT/PR40323/BTIS/137/78/2023 from the Government of India to S.V.
Competing interest
None declared.
Comments
Editor, Perspectives in Integrated Biophysics, QRB Discovery
29th June 2024
Dear Editor,
We are pleased to submit an invited perspective entitled “Frontiers in integrative structural biology: modeling disordered proteins and utilizing in situ data” by Majila et. al. for your consideration of publication in QRB Discovery.
Integrative structural modeling combines data from experiments, physical principles, statistics of previous structures, and prior models to obtain structures of macromolecular assemblies that are challenging to characterize experimentally. Drawing upon our integrative modeling studies for characterizing a diverse range of assemblies, we highlight two challenges for current modelling methods: modeling disordered regions in assemblies and incorporating in situ data. We discuss the state-of-the-art and several interesting open questions in these two areas.
We very much hope you will find the manuscript worthy of review. We have suggested potential reviewers on the journal website.
Sincerely yours,
Shruthi Viswanath