INTRODUCTION
“I should like you to consider that these functions…follow from the mere arrangement of the machine's organs every bit as naturally as the movements of a clock or other automaton follow from the arrangement of its counter-weights and wheels.” Rene Descartes (1596–1650); Treatise of Man (Descartes and Hall, Reference Descartes and Hall1972)
Descartes' visionary approach to understanding the functioning of the human body almost 400 years ago brought him into conflict with the scientific establishment of the time, but marked the beginning of a new way of thinking in the natural sciences: one which saw the inexorable rise of a reductionist approach to resolving scientific problems. His appeal to “Divide each difficulty into as many parts as is feasible and necessary to resolve it” now underpins the way we think about science and has facilitated great advances in biomedicine, including an understanding of the mechanisms which control the replication, transfer and evolution of genetic information from one generation to another. Indeed, it is since the discovery of the structure of DNA and the development in the later part of the 20th Century of tools to identify and manipulate individual genes and their products, that reductionist biology reached its high point through the paradigm of gene-centred, hypothesis-driven, experimentation. With gene identification becoming increasingly easy (including for most pathogens and hosts) and the availability of cheap commercial systems to manipulate gene expression both in vitro and in vivo, such experiments were irresistible. In infectious diseases research, this approach was successful in revealing many important biological functions that have transformed our understanding of the mechanisms which underpin host-pathogen interactions. Yet, within this vast amount of genomics data resides the challenge to reductionist thinking itself. The birth (or re-birth) of “Systems Biology” marks the beginning of a retreat from reductionism – a process encouraged by the fact that we are all being hopelessly outpaced by the scale of the biological data available. Some now argue that although reductionism has successfully identified most of the components and many of the interactions of biological systems, it offers no convincing concepts or methods to understand how system properties emerge (Sauer et al. Reference Sauer, Heinemann and Zamboni2007). By contrast, systems biology attempts to link high throughput molecular sciences such as genomics, proteomics and metabolomics by integrating data across different levels of structure and scale with the aim of understanding pathways, functional nodules and large-scale organisation (Oltvai and Barabasi, Reference Oltvai and Barabasi2002). A systems approach is characterized by multi-disciplinary collaboration including the natural sciences, mathematics, computer science, engineering and medicine; and in its most tractable form it enables the linking of experiment and modelling processes. Above all, systems biology is about assembling, rather than disassembling structure; integration rather than reduction and “requires that we develop ways of thinking about integration that are as rigorous as our reductionist programmes, but different” (Noble, Reference Noble2006).
Whether systems biology evolves to become the new paradigm for biological science experimentation in the 21st Century remains to be seen, but it is certainly here to stay. If it is to be a success then the technological tools required to implement its possibilities need to be fit for purpose. In the next few years the genomes from all known parasites either are, or will be available at the click of a mouse and the same will be true for most host species. The recording, storage and querying of genome sequence data is challenging because of its size, but is comparatively straightforward since genomes are basically stable entities which can be annotated and applied as a road map to the organism across multiple experimental scenarios. By contrast, the phenome, defined by the totality of all traits of an organism, or of one of its sub-systems (Mahner and Kary, Reference Mahner and Kary1997) is decidedly not stable and presents huge challenges in terms of analysis, representation and interpretation. Expression changes as measured by mRNA abundance using high-throughput mRNA sequencing provide great possibilities to provide signatures, for example, of how host cells or tissue respond to parasite challenge. These data are of course highly dynamic and more complex than sequence data, requiring careful regard of statistics during experimental design and need to consider the value of steady-state observations versus turnover. Although forming an integral part of the message signature in a biological system, mRNAs are of course not the functional molecules in a cell. Ultimately it is the dynamic of protein expression that is the most relevant functional measurement. Proteins, which more than any other component in the cell are the ‘counter-weights and wheels.’ of ‘the clock’, are in a constant state of flux and activation and as such provide an even greater analytical challenge than the measurement of mRNA.
In the past 12 years significant advances have been made in analysing the proteomes of parasites and their hosts. In 2002 only a handful of research publications existed on parasites and proteomics, whereas in 2011 over 70 articles were published in that year alone, with the field totalling over 2000 citations (ISI Web of Knowledge). The rapid growth of the field has been facilitated by major technical advances allowing access to a wide spectrum of researchers. However, significant limitations still exist in the technologies constituting the science of proteomics, some of which present a real challenge to the emergence of systems biology as a tool to study host-parasite interactions. In this review we examine the current status of parasite proteomics and the scale of the tasks ahead. We review the current status of large-scale identification proteomics and discuss the need to apply more sophisticated quantitative proteomics approaches as we move from the era of descriptive proteomics to one in which we are concerned with understanding the dynamics of protein expression. We review recent developments in the availability of user-friendly publicly accessible interfaces for parasite proteomics data and their potential for integrating transcriptomics and other data into a wider systems biology analysis. Finally we discuss the relationship between the proteome and transcriptome and ask ourselves: to what extent have we even begun to acquire the necessary baseline data required to realise a systems approach to host-pathogen interactions/
HIGH-THROUGHPUT IDENTIFICATION OF PARASITE PROTEOMES
Before the availability of annotated parasite genomes, ‘top-down’ approaches in which intact proteins were analysed directly by mass spectrometry (MS) (along with non-MS based approaches such as amino acid sequencing by Edman degradation), were the only practical way to obtain protein sequence information. These approaches were challenging, laborious and had significant limitations, one of which was that they were suited to only very low-throughput experiments. Advances both in MS instrumentation and better annotated genomes meant that so-called ‘bottom-up’ proteomics was possible in which MS is used to analyse enzymatically-digested or chemically produced peptides from protein samples. The resulting MS fragmentation spectra (or fingerprint) can then be used to infer sequence by matching to a database derived from annotated genome sequence data. With genome sequence data now abundant, this approach is what most people today recognise as ‘proteomics’ and is especially suited to high-throughput experiments in which thousands of proteins can be identified from a sample in a single run on a suitably configured mass spectrometer. In high-throughput identification experiments, proteins samples can be analysed whole, or as sub-proteomes produced by a variety of approaches such as fractionation, organelle separation or affinity purification. Before MS takes place, separation at either the protein level or of peptides is almost always performed to reduce sample complexity. Protein separation is typically achieved by techniques such as gel electrophoresis or other forms of chromatography followed by in-gel digestion, while peptide separation is achieved by using liquid chromatography before analysis by tandem mass spectrometry (MS/MS). A typical workflow for high-throughput identification proteomics in parasite systems is shown in Fig. 1. Once the MS spectra have been produced identifications are made by matching the experimental peptide MS data, usually mass/charge (m/z) data, to a theoretically calculated peptide m/z database using search engines such as Mascot (Perkins et al. Reference Perkins, Pappin, Creasy and Cottrell1999), SEQUEST (Yates et al. Reference Yates, Eng, McCormack and Schieltz1995) and X!Tandem (Craig and Beavis, Reference Craig and Beavis2004).
The field of parasitology has been quick to exploit emerging proteomics technologies. Early studies typically used in-gel digestion of protein spots separated by two-dimensional electrophoresis (2-DE) and identification made using peptide mass fingerprinting (PMF) on a MALDI-ToF instrument, initially on excretory–secretory (ES) products from parasitic worms (Jefferies et al. Reference Jefferies, Campbell, van Rossum, Barrett and Brophy2001; Yatsuda et al. Reference Yatsuda, Krijgsveld, Cornelissen, Heck and de Vries2003) then on more complex life stage proteomes (Cohen et al. Reference Cohen, Rumpel, Coombs and Wastling2002; Curwen et al. Reference Curwen, Ashton, Johnston and Wilson2004). The more widespread use of CID (collisional induced dissociation) based LC-MS/MS analysis of peptides allowed the high-throughput identification of multiple proteins present in 1DE or 2DE spots or in whole digests – fractionated or un-fractionated. The advances in MS instrumentation enabled a more comprehensive analysis of parasite proteomes. Two pioneering examples took advantage of high-throughput proteomics for the analysis of the life cycle of Plasmodium falciparum (Florens et al. Reference Florens, Washburn, Raine, Anthony, Grainger, Haynes, Moch, Muster, Sacci, Tabb, Witney, Wolters, Wu, Gardner, Holder, Sinden, Yates and Carucci2002; Lasonder et al. Reference Lasonder, Ishihama, Andersen, Vermunt, Pain, Sauerwein, Eling, Hall, Waters, Stunnenberg and Mann2002). These global strategies were used to analyse the changes that the parasite undergoes as it traverses its life cycle in multiple hosts. Both studies utilised pre-fraction strategies, on-line multidimensional protein identification technology (MudPIT) which used the complementary separation power of strong cation exchange and reverse phase chromatography or 1-D gel electrophoresis combined with MS/MS analysis to identify both parasite and human proteins. Evidence for the translation of more than 1000 predicted ‘hypothetical ‘proteins were confirmed by use of high throughput proteomic techniques. Similar multiple separation technologies (MudPIT, 2-DE, and gel-LCMS) were used on the proteomes of other Apicomplexa including Cryptosporidium parvum (Sanderson et al. Reference Sanderson, Xia, Prieto, Yates, Heiges, Kissinger, Bromley, Lal, Sinden, Tomley and Wastling2008) and Toxoplasma gondii (Xia et al. Reference Xia, Sanderson, Jones, Prieto, Yates, Bromley, Tomley, Lal, Sinden, Brunk, Roos and Wastling2008). In the case of T. gondii in particular, peptide evidence was also used to help correctly assign exon-intron boundaries and make important refinements to the annotation of the genome (Xia et al. Reference Xia, Sanderson, Jones, Prieto, Yates, Bromley, Tomley, Lal, Sinden, Brunk, Roos and Wastling2008). At present, much of the parasite high-throughput identification proteomics data is focused on protozoan parasites, probably largely due to the greater number of advanced annotated genomes compared to those available for helminths. There are now over 15 protozoan species with on-going proteomics projects, half of which have greater than 30% coverage and are headed by Plasmodium and Toxoplasma, each with around 70% coverage (Table 1). The next few years will see a similar advance in proteomics identification data for helminths as many of the genomes come on-line. However, some helminth parasites such as Schistosoma mansoni do have advanced proteomics projects (Curwen et al. Reference Curwen, Ashton, Johnston and Wilson2004; van Balkom et al. Reference van Balkom, van Gestel, Brouwers, Krijgsveld, Tielens, Heck and van Hellemond2005; Braschi and Wilson, Reference Braschi and Wilson2006; Guillou et al. Reference Guillou, Roger, Mone, Rognon, Grunau, Theron, Mitta, Coustau and Gourbal2007). Some helminth studies have concentrated on sub-proteomes where the focus has been on the host-parasite interface and the possible roles played by excretory/secretory (ES) material on the host immune system. The ES of Brugia malyai has been extensively characterized using proteomic techniques. Several proteins with immune regulation properties were identified in the ES of individual life stages (Hewitson et al. Reference Hewitson, Harcus, Curwen, Dowle, Atmadja, Ashton, Wilson and Maizels2008; Moreno and Geary, Reference Moreno and Geary2008) as well as in other nematodes such as Teladorsagia circumcincta (Craig et al. Reference Craig, Wastling and Knox2006). Further comprehensive analysis of the B. malyai secretome was achieved by (Bennuru et al. Reference Bennuru, Semnani, Meng, Ribeiro, Veenstra and Nutman2009), identifying over 800 proteins total in the analysis of several life stages of the parasite.
Advances in MS instrumentation and protein separation technology will continue to increase the number of identifications that can be obtained from a single sample. Increasing resolution and accuracy have improved the reliability of these identifications, as has the use of more sophisticated bioinformatic tools to improve processing of the MS/MS data and to ensure identifications are supported statistically. It is surely only a matter of time until relatively extensive proteomics coverage has been reported for most parasites of relevance to human and animal health. However, it is worth considering the extent to which a fully comprehensive proteome for some organisms is really achievable. Issues with sample preparation, incomplete tryptic digestion (Brownridge and Beynon, Reference Brownridge and Beynon2011), dynamic range, bias against certain classes of proteins, imperfect genome annotation, and concerns over peptide coverage, detectability and specificity (‘proteotypic’ peptides) (Beck et al. Reference Beck, Claassen and Aebersold2011) make full coverage of a proteome challenging. Extensive pre-fractionation of the protein or peptide samples can go some way to overcome some of these problems, but in doing so add a great deal of experimental redundancy as well as greatly increasing instrument analysis time and cost. Finally, parasites are characterized by often possessing complex life-cycles, sometimes in multiple hosts or survival in the external environment. A truly comprehensive proteome for any parasite is therefore never expressed at any one moment; rather it is a dynamic and responsive facet of the host-parasite system for which a great range of temporal data is required before a full picture can be achieved.
ADVANCES IN QUANTITATIVE PROTEOMICS IN PARASITE AND HOSTS
Measurement of the changes in the abundance of proteins from one condition to the next, or determining changes in the protein composition of protein complexes and organelles under different conditions, are key data which can help us understand how the proteome responds in a dynamic host-parasite interaction. Quantification is now one of the foremost topics in proteomics and the most recent proteomic platforms are now geared not only to provide identification, but also some form of quantitative data. There are two main approaches used in quantitative proteomics: label-based methods and label-free methods.
Common labelling techniques involve either stable isotope labelling through in vivo metabolic labelling, chemical modification or labelling by fluorescent dyes. Popular chemical labelling includes proteins labelled with isotope-coded affinity tags (ICAT) (Gygi et al. Reference Gygi, Rist, Gerber, Turecek, Gelb and Aebersold1999) and iTRAQ, which uses a multiplexed set of isobaric reagents that yield amine-derivatised peptides for relative and absolute quantitation (Ross et al. Reference Ross, Huang, Marchese, Williamson, Parker, Hattan, Khainovski, Pillai, Dey, Daniels, Purkayastha, Juhasz, Martin, Bartlet-Jones, He, Jacobson and Pappin2004). In vivo labelling metabolically labels proteins by incorporation of stable isotope labels with amino acids in cell culture (SILAC) (Ong et al. Reference Ong, Blagoev, Kratchmarova, Kristensen, Steen, Pandey and Mann2002). Fluorescent dye labelling is used in 2-DE fluorescence difference gel electrophoresis (2-DE DIGE) (Unlu et al. Reference Unlu, Morgan and Minden1997). More recently, label-free quantification has gained popularity due to cheap and easy experimental implementation. Label-free techniques directly use raw data from parallel MS runs to compare relative proteins abundance in different runs. Spectral counting and intensity-based methods are among the most commonly used approaches. Spectral counting infers protein abundance using the number of peptide-spectrum matches (PSMs) in a given run. However, due to different ionization efficiencies caused by biophysical properties of each peptide, the raw spectral counting has been proved to be less reliable as a quantification indication. Several software packages have been developed to normalize spectral counting, such as APEX (Lu et al. Reference Lu, Vogel, Wang, Yao and Marcotte2007) and emPAI (Ishihama et al. Reference Ishihama, Oda, Tabata, Sato, Nagasu, Rappsilber and Mann2005). Intensity-based methods align precursor ion spectra of the same peptide from parallel runs according to their retention times (RT), and protein quantification is acquired by summing ion intensities that have been matched to peptides for a given protein. This approach has been implemented by several commercial software packages such as Progenesis LC-MS (NonLinear Dynamics) and Protein Lynx Global Server (Waters), as well as open-source packages such as MaxQuant (Cox and Mann, Reference Cox and Mann2008), OpenMS (Sturm et al. Reference Sturm, Bertsch, Gropl, Hildebrandt, Hussong, Lange, Pfeifer, Schulz-Trieglaff, Zerck, Reinert and Kohlbacher2008) and MSight (Palagi et al. Reference Palagi, Walther, Quadroni, Catherinet, Burgess, Zimmermann-Ivol, Sanchez, Binz, Hochstrasser and Appel2005).
Label-based quantitative proteomics in host-parasite systems
Which quantitative approach to adopt depends on the nature of the host-parasite system under investigation. If the parasite can be cultured then in vivo metabolic labelling with a stable isotope can be achieved. SILAC (stable isotope labelling with amino acids in cell culture) has been extensively performed in cell culture, with virtually the entire proteomes of diploid and haploid yeast being compared using SILAC (de Godoy et al. Reference de Godoy, Olsen, Cox, Nielsen, Hubner, Fröhlich, Walther and Mann2008). There are only a few instances of in vivo metabolic labelling being used with parasites. Abundance changes in the proteome of the trophozite stages of the malarial parasite Plasmodium falciparum following chloroquine and artemisinin treatment were examined using a stable isotope approach that used 14N-isoleucine and 13C6,15N1-isoleucine combined with a MudPIT peptide separation method (Prieto et al. Reference Prieto, Koncarevic, Park, Yates and Becker2008). The role of the antibiotic paromomycin on the global proteomes of susceptible and resistant strains of the protozoan parasite Leishmania donovani was examined with SILAC, using the conventional 13C6 L-lysine-HCl and 13C615N4 L-arginine-HCl heavy isotopes (Chawla et al. Reference Chawla, Jhingran, Panigrahi, Stuart and Madhubala2011). Changes in the relative abundance and phosphorylation of protein components of the invasion motor complex during host cell invasion by the apicomplexan parasite Toxoplasma were also monitored by SILAC based quantitative proteomics (Nebl et al. Reference Nebl, Prieto, Kapp, Smith, Williams, Yates, Cowman and Tonkin2011). While some progress has been made with single cell parasites in cell culture, larger multicellular parasites are less amenable to SILAC labelling. Recent work by two groups has developed methodologies to perform SILAC based proteomics on the nematode C. elegans, using metabolically labelled E. coli as a food source in order to label the worms. Larance et al. (Reference Larance, Bailly, Pourkarimi, Hay, Buchanan, Coulthurst, Xirodimas, Gartner and Lamond2011) characterized C. elegans protein abundance changes after heat shock treatment and Fredens et al. (Reference Fredens, Engholm-Keller, Giessing, Pultz, Larsen, Højrup, Møller-Jensen and Færgeman2011) followed the C. elegans proteome response to the knockdown of the transcription factor nuclear hormone receptor 49 (NHR-49) with RNAi. Nearly 4700 proteins were identified (approx. 20% of predicted proteome) and 3470 of these quantified, with 330 significantly up- or down-regulated. SILAC based quantification benefits from high accuracy and the fact that the labelled ‘heavy’ proteome is essentially indistinguishable from the ‘light’ or normal proteome and can be combined early-on in the procedure, at the cell level or just after lysis, meaning that less variation or inaccuracies will be introduced during sample preparation and pre-fractionation before MS analysis. The success of these experiments opens up the possibility of using SILAC with other parasitic nematodes.
Many parasites cannot easily be labelled in vivo making SILAC an impractical approach. An alternative labelling technique is therefore to post-label by chemically modifying peptides or protein preparations from experiments using tagged stable isotope or fluorescent dyes. Difference gel electrophoresis (DIGE) can be used to label up to three different samples with a fluorescent dye. The samples are then mixed and analysed by 2D electrophoresis. Differences in protein abundance between the samples can be measured by excitation at different wavelengths and gel images are matched and analysed by image analysis software such as DeCyder™ (GE Healthcare). This technique helps reduce the variability between samples run on separate 2DE gels. DIGE has been used to monitor the changes in the host cell proteome to invasion by Toxoplasma gondii, highlighting significant changes in key metabolic pathways and in post-translational protein modification (Nelson et al. Reference Nelson, Jones, Carmen, Sinai, Burchmore and Wastling2008), measuring the key protein changes during Neospora differentiation (Marugán-Hernández et al. Reference Marugán-Hernández, Alvarez-García, Risco-Castillo, Regidor-Cerrillo and Ortega-Mora2010) and identifying changes in the abundance of proteins involved with energy metabolism in the head proteome of the Anopheles mosquito after infection with Plasmodium (Lefevre et al. Reference Lefevre, Thomas, Schwartz, Levashina, Blandin, Brizard, Le Bourligu, Demettre, Renaud and Biron2007). The plasma proteomes of several individuals infected with Leishmania donovani were compared to control individuals using DIGE (Rukmangadachar et al. Reference Rukmangadachar, Kataria, Hariprasad, Samantaray and Srinivasan2011) identifying several putative biomarkers. Schistosoma japonicum schistosomula from hosts with differing susceptibility to the parasite were also examined using DIGE (Hong et al. Reference Hong, Peng, Jiang, Fu, Liu, Shi, Li and Lin2011) and several proteins were shown to be differentially expressed between schistosomula, highlighting the adaptation of S. japonicum to different host environments.
The most widespread of the stable isotope tag techniques is iTRAQ. Up to eight different samples can be labelled with isobaric tags that react with the primary amino groups of peptides. The samples can then be mixed and analysed in the same MS run. During MS the tags are fragmented into reporter groups with a different mass for each tag, the intensity of which can be used to derive the relative abundance of corresponding peptides in the starting sample. Protein abundance changes in the malaria parasite P. falciparum following doxycycline treatment were measured using iTRAQ (Briolant et al. Reference Briolant, Almeras, Belghazi, Boucomont-Chapeaublanc, Wurtz, Fontaine, Granjeaud, Fusaï, Rogier and Pradines2010) as was the differential protein expression over the life stages of Trypanosoma congolense (Eyford et al. Reference Eyford, Sakurai, Smith, Loveless, Hertz-Fowler, Donelson, Inoue and Pearson2011). The protein abundance differential measured using iTRAQ was used to distinguish putative mitosomal proteins from co-purified contamination in Giardia intestinalis extracts (Jedelský et al. Reference Jedelský, Doležal, Rada, Pyrih, Smíd, Hrdý, Sedinová, Marcinčiková, Voleman, Perry, Beltrán, Lithgow and Tachezy2011). Combining quantitative iTRAQ proteomic profiling with transcriptomics showed that the expression of merozoite proteins in Plasmodiun falciparum were regulated post-translationally during invasion pathway switching as an adaptation to variations of the host cell (Kuss et al. Reference Kuss, Gan, Gunalan, Bozdech, Sze and Preiser2011).
Label-free quantitative proteomics in parasite systems
Label-free methodologies are becoming an increasingly popular and widely used first approach to quantitative proteomics, although they do present statistical and bioinformatic challenges. The principle behind this technique is that two samples can be compared without the need to modify or label protein preparations using the mass spectra alone. Spectral counting is a straightforward way to obtain semi-quantitative data on protein abundances within a sample and is often automatically performed on peptide identification data sets (e.g. emPAI in the Mascot search engine). This way quantitative data can be obtained from intensive highly fractionated shotgun identification proteomics. Schrimpf et al. (Reference Schrimpf, Weiss, Reiter, Ahrens, Jovanovic, Malmström, Brunner, Mohanty, Lercher, Hunziker, Aebersold, von Mering and Hengartner2009) have identified more than half of the predicted C. elegans proteins and using a modified spectral counting algorithm estimate the abundances of over 1000 proteins. This information was used to validate gene models and to compare the abundance of orthologous proteins in another organism. Bennuru et al. (Reference Bennuru, Meng, Ribeiro, Semnani, Ghedin, Chan, Lucas, Veenstra and Nutman2011) identified approximately 60% of the predicted gene products from adults, microfilariae, L3 larvae and ES products of the lymaphatic filarial worm Brugia malayi. Abundance was estimated using simple spectral counting. Several high-throughput proteomics studies have focused on the proteome of Schistosoma parasites. The proteomes of several developmental stages of S. japonicum as well as tissues at the host-parasite interface were characterized in tandem with transcriptomics (Liu et al. Reference Liu, Lu, Hu, Wang, Cui, Chi, Yan, Wang, Song, Xu, Wang, Zhang, Zhang, Wang, Xue, Brindley, McManus, Yang, Feng, Chen and Han2006) in parallel with a proteomic study of the host proteins that are associated with S. japonicum (Liu et al. Reference Liu, Hu, Cui, Chi, Fang, Wang, Yang and Han2007) and the parasites excretory/secretory proteins (Liu et al. Reference Liu, Cui, Hu, Feng, Wang and Han2009). Label-free quantitative proteomics of the early gametocyte phase of P. falciparum identified that proteins involved in erythrocyte remodelling were enriched (Silvestrini et al. Reference Silvestrini, Lasonder, Olivieri, Camarda, van Schaijk, Sanchez, Younis Younis, Sauerwein and Alano2010).
An alternative to spectral counting is to align separate LC-MS/MS runs of peptide mixtures and to calculate the differences in intensities of the same peptides detected in each run. This approach tends to be more accurate than spectral counting but requires expensive instrumentation to ensure reproducibility. Software, both commercial and free, is available to perform the alignment and ion intensity comparison functions. To date no examples of its application to host-proteome interactions have been published, but the potential for increased accuracy with this approach means that its application is unlikely to be overlooked by the field.
Absolute protein quantification
The approaches described so far have been in terms of relative quantification. The precise determination of the concentration of specific protein is known as absolute quantification. This is performed by stable isotope dilution, where a reference standard to which a stable isotope has been incorporated is added in known amounts to the sample mixture. The reference peptide is in all respects the same as the analyte peptide apart from the mass difference due to the isotopic label. When analyzed by MS the ratio of the intensities of the analyte and standard ions will allow the calculation of the concentration of the analyte as the concentration of the standard is known. The reference standards can be synthesized chemically individually, e.g. AQUA peptides (Gerber et al. Reference Gerber, Rush, Stemman, Kirschner and Gygi2003) or expressed from synthetic genes in E. coli using stable isotopically enriched media e.g. QconCAT Absolute quantification (Beynon et al. Reference Beynon, Doherty, Pratt and Gaskell2005), can provide the copy number of proteins in a cell under a certain state, an important input for systems biology modelling. These absolute values can be compared across several studies including those from different groups and be more easily integrated into transcriptomic and metabolic system data. The use of absolute quantification in parasite studies is only just starting to emerge. The relative and absolute amounts of Schistosome tegument proteins have been determined using a QconCAT methodology (Castro-Borges et al. Reference Castro-Borges, Simpson, Dowle, Curwen, Thomas-Oates, Beynon and Wilson2011). Of course any isotope dilution experiment requires that the analyte be already characterized (in contrast to discovery based proteomics).
Targeted quantitative proteomics
So far we have discussed high-throughput proteomics approaches which have made no a priori assumptions as to the proteins to be analysed. Whilst this is highly valuable as a tool to help generate assumption-free hypotheses, it has the disadvantage of pushing separation techniques and instrumentation to its limits because of the diversity of protein species and the dynamic range of the targets. Such an approach is unlikely ever to unravel a truly complete proteome and is even less able to make good quantitative measurements across such a wide dynamic range. Targeted proteomics approaches are now being developed that enable the MS instrument measurements to be focused only on specific peptides from predetermined sets of selected proteins. These techniques have the potential for greater accuracy since the instrumentation and bioinformatics can be tuned to a relatively small sub-set of protein targets. Selected reaction monitoring (SRM) is typically used to selectively record fragmentation events that are specific for the peptides of interest (Lange et al. Reference Lange, Picotti, Domon and Aebersold2008; Bertsch et al. Reference Bertsch, Jung, Zerck, Pfeifer, Nahnsen, Henneges, Nordheim and Kohlbacher2010). Targeted proteomics allows a rapid and accurate quantitative profiling of a repeated set of proteins across samples from different conditions. A triple quadrupole mass spectrometer (QQQ) is used to achieve peptide targeting in SRM experiments. The first quadrupole is used to isolate precursor ions in a narrow mass range and the selected ions are then fragmented in the second quadrupole. The third quadrupole is used to specifically detect a set of fragment ions that is characteristic for the target peptides. This sequential isolation of targeted ions enables a great reduction in background noise and makes this approach the most sensitive MS strategy available. An application of this strategy has been used to obtain the absolute quantity of low abundance proteins in P. falciparum crude cell extracts (Southworth et al. Reference Southworth, Hyde and Sims2011).
PROTEIN MODIFICATIONS AND INTERACTIONS
Post-translational modifications
Proteins can undergo a great range of chemical modifications after translation. These post-translational modifications (PTMs) can determine protein localisation, activity state, turnover, structure, as well as interactions with other proteins, cells or organisms. While >1000 PTMs have been assembled in UNIMOD (www.unimod.org) with more likely to be found, PTMs are generally not well characterized in parasites. Understanding the roles of these PTMs in parasite regulation, survival and pathogenesis as well their contribution to the adaptation and evolution of the host require both highly sensitive and precise detection and reliable high-throughput methodologies to quantify protein changes in a complex mixture.
The low dynamic range, stability and sometimes transient changes of protein modifications combined with attempting to relate these modifications to biological events create a challenge to modern technologies. There are many proteomic approaches to studying PTMs, ranging from bottom-up and top-down mass spectrometry, gel and gel-free techniques and affinity based methodologies.
Classically, gel-based techniques paired with mass spectrometry (MS) have been used to highlight PTMs. Two-dimensional electrophoresis separates proteins by their charge and molecular weight. The resolving power of this technique can separate differentially expressed modified forms of a given protein. Further selectivity in the detection of specific PTMs by using certain stains, metabolic labelling, antibodies or specific probes can aid in detection and identification of PTMs and have been used to identify phosphorylated proteins in erythrocytes infected by the human malaria parasite Plasmodium falciparum (Wu et al. Reference Mi, Guo, Kejariwal and Thomas2009). Fluorescent or colorimetric stains for gels or western blots (e.g. Pro-Q Diamond stain, Invitrogen) allow simple selective detection of phosphoproteins (Nunes et al. Reference Nunes, Okada, Scheidig-Benatar, Cooke and Scherf2010). A more sensitive technique for phosphoprotein detection is the radiolabelling of proteins by 32P incorporation (Leykauf et al. Reference Leykauf, Treeck, Gilson, Nebl, Braulke, Cowman, Gilberger and Crabb2010) or immunoblotting (Wu et al. Reference Wu, Nelson, Quaile, Xia, Wastling and Craig2009). Glycosylated protein can also be detected using specific stains such as Pro-Q–Emerald (Invitrogen) conjugated lectins or differential glycosidase digestion (Rebello et al. Reference Rebello, Barros, Mota, Carvalho, Perales, Lenzi and Neves-Ferreira2011).
The scarcity of many PTMs requires the enrichment of the sub-population of select modified proteins. Affinity based enrichment can be performed at the protein or peptide level and targets specific or groups of PTMs. Immobilised metal ion affinity chromatography (IMAC) utilises the affinity of chelated Fe(III) or Ga(III) ions to the phosphate group of phosphopeptides. Crude protein mixtures from Leishmania donovai extracts were enriched for phosphoproteins using IMAC then digested with trypsin and analysed for life stage specific phosphoprotein abundance (Hem et al. Reference Hem, Gherardini, Osorio y Fortéa, Hourdel, Morales, Watanabe, Pescher, Kuzyk, Smith, Borchers, Zilberstein, Helmer-Citterich, Namane and Späth2010). Oxides of metals such as titanium, zirconium and aluminium, can also be used to isolate phosphoproteins selectively. Anti-pSER/pThr/pTyr antibodies also facilitate the enrichment of phosphoproteins by immunoprecipitation followed by separation by 1-DE or 2-DE gels. Specific antibodies can also act to isolate other PTMs, for example the global analysis of acetylation, methylation and nitration of peptides. Carbohydrate-binding proteins (lectins) are used to enrich for glycoproteins and glycopeptides using affinity chromatography. Affinity resins that bind polyubiquitin protein conjugates are commercially available.
PTMs can also be specifically targeted by chemical derivatization. Affinity tags can be introduced by beta elimination of phosphoric acid from pSER or pThr followed by the addition of affinity groups such as biotin to allow enrichment of phosphoproteins by chromatography. Solid phase extraction of glycopeptides can be achieved by immobilsation of carbohydrates to a hydrazide activated resin followed by release by PNGase F and analysis with LC-MS/MS. Hydrophilic interaction liquid chromatography (HILIC)-based methods can also be used to isolate glycopeptides. Recently hexapeptide libraries have been applied to large-scale glycomics analysis (Huhn et al. Reference Huhn, Ruhaak, Wuhrer and Deelder2011) and arrays have been used to profile glycans (Lepenies and Seeberger, Reference Lepenies and Seeberger2010; Lonardi et al. Reference Lonardi, Balog, Deelder and Wuhrer2010; Ruhaak et al. Reference Ruhaak, Zauner, Huhn, Bruggink, Deelder and Wuhrer2010).
Covalent PTMs of cysteine are mediators of redox regulation and signalling. Cysteines are involved in many biochemical reactions, crucial in redox reactions and when involved in disulfide bonds, influence protein structure and stability. S-nitrosylation, s-glutathioylation, palmitoylation and prenylation are all PTMs of cysteine found in parasites (Jortzik et al., Reference Jortzik, Wang and Becker2011).
Protein-protein interactions
Most cellular processes are governed by protein-protein interactions. These can range from the interaction of two proteins to the formation of large macromolecular complexes consisting of many different proteins in differing ratios. Interactions can be strong or transient. There are several methods for experimentally determining protein-protein interaction. The two most widely used are the yeast two-hybrid system and affinity purification coupled with MS (Tandem affinity purification, TAP), where the c-terminus of a bait protein is fused to a TAP tag. The TAP tag consists of a calmodulin-binding peptide (CBP) and a IgG binding domain from protein A, separated by a TEV protease cleavage site. The TAP-tagged bait protein is isolated from a cell lysate using IgG-coated beads. After washing, the bait protein is released from the beads by incubating with TEV protease. A second round of purification uses calmodulin-coated beads to isolate the bait protein (and associated binding partners) via the CBP tag. Bound proteins are eluted and analysed with SDS-PAGE and mass spectrometry. The yeast two-hybrid system measures the interaction of a bait protein which is fused to the DNA binding domain of the yeast protein Gal4 and the prey protein, which is fused to the transactivation domain of Gal4. When the bait and prey interact, a downstream reporter gene is activated. A variation of this technique has been used to investigate interaction networks in P. falciparum (LaCount et al., Reference LaCount, Vignali, Chettier, Phansalkar, Bell, Hesselberth, Schoenfeld, Ota, Sahasrabudhe, Kurschner, Fields and Hughes2005), identifying 2846 unique interactions involving 1312 proteins and highlighting a group of interacting proteins involved with host cell invasion, including 19 uncharacterized proteins.
BIOINFORMATICS RESOURCES FOR PARASITE PROTEOMICS
There are two major limiting steps in any proteomics experiment. One is the limit imposed by the MS instrumentation itself and second, but equally as crucial, is the bioinformatic processing of the large quantity of data generated by modern proteomics platforms. The involvement of various bioinformatics tasks in processing and interpreting proteomics data is summarised in Fig. 2. The scale and complexity of the data generated by such a workflow means that it is essential to develop integrated database pipelines if complex proteomics data are generated for multiple host-parasite systems. Proteomics databases for parasites are an essential component of ensuring that these data are stored and rendered accessible for easy use by the community. Subsequently, the focus is then on downstream interpretation in relation to protein function and localization prediction, pathway and network analysis, since these are the aspects of bioinformatics which have the potential to turn an elegant data gathering exercise into one which can reveal genuine insights into function. KEGG (Kanehisa and Goto, Reference Kanehisa and Goto2000), MetaCyc (Caspi et al. Reference Caspi, Altman, Dreher, Fulcher, Subhraveti, Keseler, Kothari, Krummenacker, Latendresse, Mueller, Ong, Paley, Pujar, Shearer, Travers, Weerasinghe, Zhang and Karp2011) and Reactome (Croft et al. Reference Croft, O'Kelly, Wu, Haw, Gillespie, Matthews, Caudy, Garapati, Gopinath, Jassal, Jupe, Kalatskaya, Mahajan, May, Ndegwa, Schmidt, Shamovsky, Yung, Birney, Hermjakob, D'Eustachio and Stein2011) are such tools developed to facility pathway browsing and data analysis.
Data repositories for parasite proteomics data
Several public repositories host proteomics data for the research communities, such as the Proteomics identifications database (PRIDE) (Jones et al. Reference Jones, Cote, Martens, Quinn, Taylor, Derache, Hermjakob and Apweiler2006), the Global Proteome Machine databases (GPMDB) (Craig et al. Reference Craig and Beavis2004) and PeptideAtlas (Desiere et al. Reference Desiere, Deutsch, King, Nesvizhskii, Mallick, Eng, Chen, Eddes, Loevenich and Aebersold2006). While these databases are useful for storage and re-querying of proteomics data generated, the integration of proteomics data with organism-specific genomic and proteomics resources provides an essential technical step to data interpretation (Xia et al. Reference Xia, Sanderson, Jones, Prieto, Yates, Bromley, Tomley, Lal, Sinden, Brunk, Roos and Wastling2008). For parasites the most advanced example of this is the hosting of proteomics data in EuPathDB (Aurrecoechea et al. Reference Aurrecoechea, Brestelli, Brunk, Fischer, Gajria, Gao, Gingle, Grant, Harb, Heiges, Innamorato, Iodice, Kissinger, Kraemer, Li, Miller, Nayak, Pennington, Pinney, Roos, Ross, Srinivasamoorthy, Stoeckert, Thibodeau, Treatman and Wang2010). Although these data deal with only protozoan parasites the easily accessible format has opened up proteomics data to the entire research community in a way that was difficult to envisage, even a few years ago. Proteomics data repositories for helminths are generally less unified and well resourced, possibly reflecting the fact that the respective genome sequencing projects lag behind those of the protozoa, although this is likely to change in the near future. WormBase (Yook et al. Reference Yook, Harris, Bieri, Cabunoc, Chan, Chen, Davis, de la Cruz, Duong, Fang, Ganesan, Grove, Howe, Kadam, Kishore, Lee, Li, Muller, Nakamura, Nash, Ozersky, Paulini, Raciti, Rangarajan, Schindelman, Shi, Schwarz, Ann Tuli, Van Auken, Wang, Wang, Williams, Hodgkin, Berriman, Durbin, Kersey, Spieth, Stein and Sternberg2011), for example, now supports 15 helminth species with growing proteomics data resources.
Proteomics resources at EuPathDB
EuPathDB acts as a portal to eukaryotic pathogens (Aurrecoechea et al. Reference Aurrecoechea, Brestelli, Brunk, Fischer, Gajria, Gao, Gingle, Grant, Harb, Heiges, Innamorato, Iodice, Kissinger, Kraemer, Li, Miller, Nayak, Pennington, Pinney, Roos, Ross, Srinivasamoorthy, Stoeckert, Thibodeau, Treatman and Wang2010). It is an integrated genome database composed of a family of dedicated pathogen databases including PlasmoDB, ToxoDB (also serves Neospora caninum), CryptoDB, GiardiaDB, TrichDB, TriTrypDB, AmoebaDB and MicrosporidiaDB. More detailed introductions to these databases have been published through online tutorials, individual websites and journal publications (Heiges et al. Reference Heiges, Wang, Robinson, Aurrecoechea, Gao, Kaluskar, Rhodes, Wang, He, Su, Miller, Kraemer and Kissinger2006; Gajria et al. Reference Gajria, Bahl, Brestelli, Dommer, Fischer, Gao, Heiges, Iodice, Kissinger, Mackey, Pinney, Roos, Stoeckert, Wang and Brunk2008; Aurrecoechea et al. Reference Aurrecoechea, Brestelli, Brunk, Carlton, Dommer, Fischer, Gajria, Gao, Gingle, Grant, Harb, Heiges, Innamorato, Iodice, Kissinger, Kraemer, Li, Miller, Morrison, Nayak, Pennington, Pinney, Roos, Ross, Stoeckert, Sullivan, Treatman and Wang2009a,Reference Aurrecoechea, Brestelli, Brunk, Dommer, Fischer, Gajria, Gao, Gingle, Grant, Harb, Heiges, Innamorato, Iodice, Kissinger, Kraemer, Li, Miller, Nayak, Pennington, Pinney, Roos, Ross, Stoeckert, Treatman and Wangb, Reference Aurrecoechea, Brestelli, Brunk, Fischer, Gajria, Gao, Gingle, Grant, Harb, Heiges, Innamorato, Iodice, Kissinger, Kraemer, Li, Miller, Nayak, Pennington, Pinney, Roos, Ross, Srinivasamoorthy, Stoeckert, Thibodeau, Treatman and Wang2010; Aslett et al. Reference Aslett, Aurrecoechea, Berriman, Brestelli, Brunk, Carrington, Depledge, Fischer, Gajria, Gao, Gardner, Gingle, Grant, Harb, Heiges, Hertz-Fowler, Houston, Innamorato, Iodice, Kissinger, Kraemer, Li, Logan, Miller, Mitra, Myler, Nayak, Pennington, Phan, Pinney, Ramasamy, Rogers, Roos, Ross, Sivam, Smith, Srinivasamoorthy, Stoeckert, Subramanian, Thibodeau, Tivey, Treatman, Velarde and Wang2010). In this review, a selection of the important features that are relevant to parasite proteomics research will be highlighted.
Proteomics data of T. gondii and C. parvum pioneered the full integration of proteomics data into EuPathDB (Sanderson et al. Reference Sanderson, Xia, Prieto, Yates, Heiges, Kissinger, Bromley, Lal, Sinden, Tomley and Wastling2008; Xia et al. Reference Xia, Sanderson, Jones, Prieto, Yates, Bromley, Tomley, Lal, Sinden, Brunk, Roos and Wastling2008). The latest version of EuPathDB (v2.12) hosts proteomics data for 26,035 proteins from 16 species (http://eupathdb.org/eupathdb/). Data analysis tools have been developed to facilitate the browsing, functional prediction and comparison of proteomics data with other types of genomic data.
Fig. 3 shows how protein expression data for particular genes can be viewed on individual gene record pages using ToxoDB v7.2 as an example, where colour coded peptides are mapped to the gene sequence according to the experiments that identified them. Protein expression evidence can also be queried according to the experiments and samples using ‘Identify Genes based on Mass Spec. Evidence.’ tool (Fig. 4). Once a group of proteins of interest has been acquired based either on existing proteomics studies or user supplied lists, additional data analysis tools can be used to interact with genome information, functional predictions and other type of genome wide ‘omics’ data using ‘Add Step.’ tool in the results page. Fig. 4 also shows the comparison of proteomics data with mRNA expression data from RNA sequencing experiments, where the relationship of the two datasets can be analysed side by side. This function has vastly improved the interaction of proteomics data with existing knowledge and other genomics data and is a first step to enabling a truly ‘trans-omics’ approach to studying parasite biology. In addition to these text based tools, the Generic Genome Browser (GBrowse) has also been incorporated in ToxoDB to improve visualization of data mining. GBrowse is a web-based application for displaying genomic annotations and other features developed by GMOD (Generic Model Organism Database project) (Stein et al. Reference Stein, Mungall, Shu, Caudy, Mangone, Day, Nickerson, Stajich, Harris, Arva and Lewis2002). The implementation of ToxoDB allows expressed peptides to be visualized in relation to various gene models and the genomic region from which the sequence is predicted to have been produced. Fig. 4 shows the peptides identified from one of the proteomics datasets aligned with unified MS/MS peptides and one RNA-Seq datasets for a particular gene TGME49_100100.
A more recent development involves the use of an automated proteogenomic pipeline for integration of mass spectrometry (MS) based proteomics evidence into genome databases (Krishna et al. Reference Krishna, Wastling and Jones2011). The pipeline uses MS data for confirming official gene models on the database, but also examines whether there is supporting evidence for alternate annotations at particular loci and for identifying novel genes. The pipeline is currently being used to assist proteomics data integration and gene annotation for a number of EuPathDB supported species, including T. gondii, N. caninum and C. parvum. Similar to other generic tools developed in EuPathDB, the algorithms used in the pipeline are not specific to any organism, thus enabling the pipeline to be used in conjunction with any genome.
Around 590 T. gondii genes are annotated to ToxoDB metabolic pathways, which were automatically reconstructed from KEGG pathways maps. Many of the pathways are not organism specific and based on the presence of one or two enzymes present in other pathways or on the basis of less precise partial EC number annotations. A recent effort has been made to develop a manually curated metabolic pathways database for apicomplexan parasites using biochemical and physiological evidences available in the literature as well as proteomics evidence and gene annotations available on EuPathDB. The result is currently hosted at Liverpool Library of Apicomplexan Metabolic Pathways (www.llamp.net).
With the increasing availability of quantitative parasite proteomics data and that collected from host cell systems, new interfaces need to be developed to enable analysis of these data. While quantitative proteomic data can probably be visualised in a similar way to transcriptomic data, a lack of data analysis tools for other protein features is holding back the wider integration and analysis of multiple ‘omics’ data. Not least is the complete lack of a platform on which host response data can be displayed and analysed alongside parasite data – something that is essential if we are to use these tools to answer questions about host-parasite interactions.
In a recent development outside EuPathDB, MOPED (Model Organism Protein Expression Database) provides for the rapid browsing of protein expression information in several model organisms including Caenorhabditis elegans (Kolker et al. Reference Kolker, Higdon, Haynes, Welch, Broomall, Lancet, Stanberry and Kolker2012). It also offers data comparison tools to produce overlap plot and heatmaps between existing data and user-uploaded data with user-specified expression value thresholding (Kolker et al. Reference Kolker, Higdon, Haynes, Welch, Broomall, Lancet, Stanberry and Kolker2012). A local data analysis tool, GProX, (Graphical Proteomics Data Explorer) is also available for comprehensive analysis, inspection and visualization of quantitative proteomics data (Rigbolt et al. Reference Rigbolt, Vanselow and Blagoev2011). Although the MOPED and GProX are currently limited to proteomics data only, a similar interface could be readily adapted on genome databases to facilitate large scale quantitative trans-omics studies.
Post-identification bioinformatic analysis
Once the protein identification and quantification has been determined, signature-based resources can be used to infer function and subcellular localizations where one or more protein signatures can be identified. Protein signatures are defined by either a regular expression method that shows patterns of conserved amino acid residues (Sigrist et al. Reference Sigrist, Cerutti, Hulo, Gattiker, Falquet, Pagni, Bairoch and Bucher2002) or the Hidden Markov Model (HMM) method which provides a statistical profile based on probabilities of finding an amino acid at a given position in the sequence (Krogh et al. Reference Krogh, Brown, Mian, Sjolander and Haussler1994). There are many publicly available signature databases of protein families and domains, including sequence-based PROSITE (Sigrist et al. Reference Sigrist, Cerutti, Hulo, Gattiker, Falquet, Pagni, Bairoch and Bucher2002), Pfam (Finn et al. Reference Finn, Tate, Mistry, Coggill, Sammut, Hotz, Ceric, Forslund, Eddy, Sonnhammer and Bateman2008), PRINTS (Attwood et al. Reference Attwood, Bradley, Flower, Gaulton, Maudling, Mitchell, Moulton, Nordle, Paine, Taylor, Uddin and Zygouri2003) PANTHER (Mi et al. Reference Mi, Guo, Kejariwal and Thomas2007) and structure-based SUPERFAMILY (Wilson et al., Reference Wilson, Madera, Vogel, Chothia and Gough2007) and Gene3D (Yeats et al. Reference Yeats, Maibaum, Marsden, Dibley, Lee, Addou and Orengo2006). Protein signatures can be used in combination to predict protein functions. For example, proteins with no significant sequence similarity but which have similar functions might be expected to share some common features like post-translational modifications, protein-sorting signals and similar sub-cellular localizations. In parasitology research, the identification of signal peptide and transmembrane domains are of special interests. The entry of virtually all proteins into the secretory pathway is controlled by signal peptides (Gierasch, Reference Gierasch1989; von Heijne, Reference von Heijne1990) and transmembrane proteins support essential biological functions acting as receptors, transporters or channels, which is essential in host-parasite interactions (Dowse and Soldati, Reference Dowse and Soldati2005; O'Donnell et al. Reference O'Donnell, Hackett, Howell, Treeck, Struck, Krnajski, Withers-Martinez, Gilberger and Blackman2006; Baxt et al. Reference Baxt, Baker, Singh and Urban2008).
Universal software packages were developed to predict certain protein features based on a set of trained rules, such as Signal P for signal peptide prediction (Bendtsen et al. Reference Bendtsen, Nielsen, von Heijne and Brunak2004), TMHMM for transmembrane domain prediction (Krogh et al. Reference Krogh, Larsson, von Heijne and Sonnhammer2001) and PSORTb for general sub-cellular localization prediction (Yu et al. Reference Yu, Wagner, Laird, Melli, Rey, Lo, Dao, Sahinalp, Ester, Foster and Brinkman2010). Additional organism-specific prediction tools and databases were also developed to predict important features in the organism under study, with more targeted training data. For example, PSEApred (Verma et al. Reference Verma, Tiwari, Kaur, Varshney and Raghava2008), PlasMit (Bender et al. Reference Bender, van Dooren, Ralph, McFadden and Schneider2003) and ApiLoc (http://apiloc.biochem.unimelb.edu.au/apiloc/apiloc) for apicomplexan parasites.
TRANSCRIPTOMICS AND PROTEOMICS IN PARASITE SYSTEMS
Gene expression process can be simply summarized using the central dogma of Gene-Transcription-Translation. However, many levels of control and regulation events during this process introduce uncertainty to the system where a simple one-to-one expression is not achieved. Early comparisons between transcriptomics data and proteomics data have generally indicated a weak correlation (de Sousa Abreu et al. Reference de Sousa Abreu, Penalva, Marcotte and Vogel2009; Maier et al. Reference Maier, Guell and Serrano2009). The same phenomenon has also been observed in some parasites and has been summarized in reviews of models systems such as Apicomplexa (Kooij et al. Reference Kooij, Janse and Waters2006; Wastling et al. Reference Wastling, Xia, Sohal, Chaussepied, Pain and Langsley2009) and Schistosoma (Hokke et al. Reference Hokke, Fitzpatrick and Hoffmann2007). These studies highlighted the discrepancies of mRNA and protein expression and the important involvement of the regulation of expression, which is likely to involve biological explanations such as selective protein degradation and variations in protein turn-over rates (Yen et al. Reference Yen, Xu, Chou, Zhao and Elledge2008; Doherty et al. Reference Doherty, Hammond, Clague, Gaskell and Beynon2009) as well as post-translational regulations such as mRNA decay and translational repression (Hakimi and Deitsch, Reference Hakimi and Deitsch2007; Shock et al. Reference Shock, Fischer and DeRisi2007; Filipowicz et al. Reference Filipowicz, Bhattacharyya and Sonenberg2008).
However, despite the biological reasons, technical factors have also restricted the full insight of the correlation between transcriptome and proteome, namely the incomplete transcriptomics survey and rather basic quantitative proteomics techniques. The lack of simultaneously collected sample for both proteomics and transcriptomics analyses also contributes to the discrepancies observed. The recent introduction of high-throughput sequencing of mRNA and microRNA (Hall, Reference Hall2007; Lister et al. Reference Lister, Gregory and Ecker2009; Wang et al. Reference Wang, Gerstein and Snyder2009), and the development of quantitative proteomics techniques, in particular the absolute quantification methods, have significantly improved our ability to measure the correlation between transcriptome and proteome in biological systems. Studies carried out by Schwanhäusser et al. (Reference Schwanhausser, Busse, Li, Dittmar, Schuchhardt, Wolf, Chen and Selbach2011) on mouse NIH3T3 cells and Nagaraj et al. (Reference Nagaraj, Wisniewski, Geiger, Cox, Kircher, Kelso, Paabo and Mann2011) on human Hela cells are among the first large-scale comparisons between RNA-Seq data and intensity-based absolute quantification (iBAQ) proteomics data from simultaneously collected samples and report correlation coefficient of between 0·41 and 0·6 (Spearman) (Nagaraj et al. Reference Nagaraj, Wisniewski, Geiger, Cox, Kircher, Kelso, Paabo and Mann2011; Schwanhausser et al. Reference Schwanhausser, Busse, Li, Dittmar, Schuchhardt, Wolf, Chen and Selbach2011).
CONCLUSIONS
In the first decade of the 21st Century we have witnessed the beginnings of a new era in biomedical research in which we see scientific reductionism being challenged by the availability of vast collections of biological information. Much of this information, gathered in defined experimental contexts, constitutes genomic and genome-related expression data generated by technologies such as transcriptomics and proteomics. Together with other related data such as metabolomics, glycomics and lipidomics, attention has turned to ways in which individual streams of information can be processed as a whole, rather than remaining as an isolated descriptor of the action of individual components. The desire to combine these data into a unified system is in part driven by the pragmatic view that biological events are the result of the concerted action of individual system components. However, more than that, biological systems also exhibit emergent properties; and these properties cannot be fully predicted by the sum of the component parts. In host-parasite interactions, emergent properties are likely to be even more complex because the host-parasite relationship is a product of the interaction of two genetically distinct biological systems. Such systems interact in often unpredictable ways – after all, if they were completely predictable, science would have been far more successful in developing vaccines, drugs that were less susceptible to resistance and solved some of the paradoxes of parasite evolutionary biology. It is an interesting observation that the most successful vaccines are live-attenuated vaccines whose development has often by-passed a detailed understanding of the ‘black-box’ of the host-parasite system itself. When we try and break our knowledge down into its component parts to develop sub-unit vaccines for example, the outcomes are far less impressive. Moreover, efforts to understand the system as a whole must extend not just in the scope of the data, but in scale also; instead of considering the system only at an individual organism level we must be cognisant of population responses of both parasites and hosts.
So has Descartes' reductionist vision really had its time – soon to be eclipsed by a return to a holistic view of the system/ And if so, what are the implications for the facilitating technologies of systems biology such as proteomics/ As with most paradigm shifts, any change will represent more of an evolution than a revolution. Few will seriously contend that we no-longer need to examine the fine detail of discrete biological components and their biochemical actions. On the contrary, experiments on discrete gene function are arguably as important as ever since reliable information on protein function, localisation and modification is an essential element in generating accurate systems models. It is worth considering also the vast number of genes in parasite genome databases that are still annotated as ‘unknown function’, even though protein expression data generated by proteomics experiments such as those described here show clearly that they exist as functional molecules. Similarly, our understanding of protein expression, including information on post-translational modifications and absolute protein expression and turnover requires far more refinement before we can be confident that it will fulfil its role in meaningful systems modelling. As we have seen in this review, coverage of accurate quantitative proteomics is still relatively poor and focused on proteins with high to medium level expression. A trade-off still exists between our ability to identify proteins (which we can do in large numbers) and our ability to make accurate quantitative measurements. Finally, the non-linear relationship between transcription, protein expression and activity, still needs to be defined in host and parasite. These advances will require continued advancement of instrumentation and concomitant development in proteome bioinformatics as dramatic as any of those we have seen in the last decade.
ACKNOWLEDGEMENTS
We thank the British Society for Parasitology for support in publishing this review.