Hostname: page-component-586b7cd67f-vdxz6 Total loading time: 0 Render date: 2024-11-28T00:43:43.211Z Has data issue: false hasContentIssue false

Functional genome annotation of Drosophila seminal fluid proteins using transcriptional genetic networks

Published online by Cambridge University Press:  22 December 2011

JULIEN F. AYROLES
Affiliation:
Department of Genetics, North Carolina State University, Raleigh, NC 27695, USA W. M. Keck Center for Behavioral Biology, North Carolina State University, Raleigh, NC 27695, USA
BROOKE A. LAFLAMME
Affiliation:
Departments of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853-2703, USA
ERIC A. STONE
Affiliation:
Department of Genetics, North Carolina State University, Raleigh, NC 27695, USA W. M. Keck Center for Behavioral Biology, North Carolina State University, Raleigh, NC 27695, USA
MARIANA F. WOLFNER
Affiliation:
Departments of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853-2703, USA
TRUDY F. C. MACKAY*
Affiliation:
Department of Genetics, North Carolina State University, Raleigh, NC 27695, USA W. M. Keck Center for Behavioral Biology, North Carolina State University, Raleigh, NC 27695, USA
*
*Corresponding author: North Carolina State University, Raleigh, NC, USA.
Rights & Permissions [Opens in a new window]

Summary

Predicting functional gene annotations remains a significant challenge, even in well-annotated genomes such as yeast and Drosophila. One promising, high-throughput method for gene annotation is to use correlated gene expression patterns to annotate target genes based on the known function of focal genes. The Drosophila melanogaster transcriptome varies genetically among wild-derived inbred lines, with strong genetic correlations among the transcripts. Here, we leveraged the genetic correlations in gene expression among known seminal fluid protein (SFP) genes and the rest of the genetically varying transcriptome to identify 176 novel candidate SFPs (cSFPs). We independently validated the correlation in gene expression between seven of the cSFPs and a known SFP gene, as well as expression in male reproductive tissues. We argue that this method can be extended to other systems for which information on genetic variation in gene expression is available.

Type
Research Papers
Copyright
Copyright © Cambridge University Press 2011

1. Introduction

The diminishing cost of high-throughput technologies such as whole genome transcript profiling, high-density genotyping and whole genome re-sequencing has shifted the focus of genomic sciences from data production to data interpretation. Foremost among the challenges in interpretation is functional gene annotation, through experimental validation or computational prediction. Even for the best-annotated genomes, a significant proportion of genes are yet to be functionally characterized (Peña-Castillo & Hughes, Reference Peña-Castillo and Hughes2007; Costello et al., Reference Costello, Dalkilic, Beason, Gehlhausen, Patwardhan, Middha, Eads and Andrews2009); less than half in Drosophila (Costello et al., Reference Costello, Dalkilic, Beason, Gehlhausen, Patwardhan, Middha, Eads and Andrews2009).

Most knowledge regarding gene function in eukaryotes comes from mutagenesis, single-gene knock-outs and RNA interference (RNAi) knock-down experiments performed in yeasts, Drosophila, Caenorhabditis elegans, mouse and Arabidopsis (Winzeler et al., Reference Winzeler, Shoemaker, Astromoff, Liang, Anderson, Andre, Bangham, Benito, Boeke, Bussey, Chu, Connelly, Davis, Dietrich, Dow, El Bakkoury, Foury, Friend, Gentalen, Giaever, Hegemann, Jones, Laub, Liao, Liebundguth, Lockhart, Lucau-Danila, Lussier, M'Rabet, Menard, Mittmann, Pai, Rebischung, Revuelta, Riles, Roberts, Ross-MacDonald, Scherens, Snyder, Sookhai-Mahadeo, Storms, Véronneau, Voet, Volckaert, Ward, Wysocki, Yen, Yu, Zimmermann, Philippsen, Johnston and Davis1999; Alonso et al., Reference Alonso, Stepanova, Leisse, Kim, Chen, Shinn, Stevenson, Zimmerman, Barajas, Cheuk, Gadrinab, Heller, Jeske, Koesema, Meyers, Parker, Prednis, Ansari, Choy, Deen, Geralt, Hazari, Hom, Karnes, Mulholland, Ndubaku, Schmidt, Guzman, Aguilar-Henonin, Schmid, Weigel, Carter, Marchand, Risseeuw, Brogden, Zeko, Crosby, Berry and Ecker2003; Kamath & Ahringer, Reference Kamath and Ahringer2003; Bellen et al., Reference Bellen, Levis, Liao, He, Carlson, Tsang, Evans-Holm, Hiesinger, Schulze, Rubin, Hoskins and Spradling2004; Dietzl et al., Reference Dietzl, Chen, Schnorrer, Su, Barinova, Fellner, Gasser, Kinsey, Oppel, Scheiblauer, Couto, Marra, Keleman and Dickson2007; Ni et al., Reference Ni, Liu, Binari, Hardy, Shim, Cavallaro, Booker, Pfeiffer, Markstein, Wang, Villalta, Laverty, Perkins and Perrimon2009; Guan et al., Reference Guan, Ye, Yang and Gao2010; Spirek et al., Reference Spirek, Benko, Carnecka, Rumpf, Cipak, Batova, Marova, Nam, Kim, Park, Hayles, Hoe, Nurse and Gregan2010). These approaches have provided functions for a large number of genes in many organisms and the basis for making gene function predictions based on gene sequence similarities. However, screening large mutant collections for quantitative phenotypes is highly laborious. Furthermore, unique mutations in the same gene, or the same mutation in multiple genetic backgrounds can give different phenotypes, further complicating the interpretation of such screens (Flint & Mackay, Reference Flint and Mackay2009; Mackay et al., Reference Mackay, Stone and Ayroles2009; Dowell et al., Reference Dowell, Ryan, Jansen, Cheung, Agarwala, Danford, Bernstein, Rolfe, Heisler, Chin, Nislow, Giaever, Phillips, Fink, Gifford and Boone2010).

Computational methods for gene annotation complement experimental approaches. Computational methods rely on the detection of particular sequence motifs (e.g. a binding domain) (Hrmova & Fincher, Reference Hrmova and Fincher2009); strong orthology with a gene of known function in a closely related species; or ‘guilt-by-association’ (Bréhélin et al., Reference Bréhélin, Florent, Gascuel and Maréchal2010). The last approach is based on correlative evidence, such as the co-regulation of gene expression or the existence of known protein–protein interactions. In all cases, the functional annotation of a known gene is transferred to its interacting or correlated partner, providing a hypothesis that can be verified experimentally.

Traditionally, guilt-by-association annotation has been used in the context of environmental perturbations (Walker et al., Reference Walker, Volkmuth, Sprinzak, Hodgson and Klingler1999; Reverter et al., Reference Reverter, Ingham and Dalrymple2008; Vandepoele et al., Reference Vandepoele, Quimbaya, Casneuf, De Veylder and Van de Peer2009; Klie et al., Reference Klie, Nikoloski and Selbig2010). A complementary approach is to utilize natural variation in genetically correlated transcriptional networks to identify co-regulated transcripts. Previously, we used genome wide transcript profiles from 40 lines from the Drosophila Genetic Reference Panel (DGRP; Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), a set of inbred lines recently derived from the wild, as a source of genetic variation in gene expression. The genetic variation among these inbred lines greatly exceeds that which can be obtained by mutagenesis screens or standard genetic crosses, while sampling multiple genetically identical individuals from each line reduces environmental variance. The genetically variable transcripts are highly correlated among the lines, forming 241 transcriptional co-expression modules (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009). These co-expression modules were enriched for common Gene Ontology (GO) categories, expression in the same tissues, common transcriptional factor binding sites and associations of gene expression with the same quantitative traits. These observations suggest that genetic correlation of gene expression with a co-expression module may be due to co-regulation and that transcripts genetically correlated with a target gene of known function are plausibly involved in the same biological process or molecular function as the target gene (Luo et al., Reference Luo, Yang, Zhong, Gao, Khan, Thompson and Zhou2007; Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009). Here, we test this hypothesis using seminal fluid proteins (SFPs) as the focal genes.

We chose SFPs as focal genes for two reasons. First, many of the gene products of the secretory tissues of the male reproductive tract that produce the SFPs are well understood in Drosophila melanogaster (Wolfner, Reference Wolfner2009). This is especially true for the male accessory glands (AGs), which produce proteins collectively known as ACcessory gland Proteins (ACPs). ACPs are transferred to females in the seminal fluid and affect a number of post-mating processes (Wolfner, Reference Wolfner2009), including sperm storage and maintenance (Neubaum & Wolfner, Reference Neubaum and Wolfner1999; Tram & Wolfner, Reference Tram and Wolfner1999; Ravi Ram & Wolfner, Reference Ravi Ram and Wolfner2007, Reference Ravi Ram and Wolfner2009), egg production and mating receptivity (Heifetz et al., Reference Heifetz, Lung, Frongillo and Wolfner2000; Chapman et al., Reference Chapman, Bangham, Vinti, Seifried, Lung, Wolfner, Smith and Partridge2003; Liu and Kubli, Reference Liu and Kubli2003), female feeding behaviour (Carvalho et al., Reference Carvalho, Kapahi, Anderson and Benzer2006) and sleep patterns (Isaac et al., Reference Isaac, Li, Leedale and Shirras2010). Proteomic (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008, Reference Findlay, MacCoss and Swanson2009) and gene expression (Swanson et al., Reference Swanson, Clark, Waldrip-Dail, Wolfner and Aquadro2001) studies have identified 187 SFPs, most of which are ACPs. Second, we observed strong genetic correlations in expression among the known ACPs (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), suggesting that new SFPs, and potentially genes important for the production or function of these proteins, could be found by analysing the correlation structure between genetically variable transcripts.

Using the DGRP expression data (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), we identified transcripts whose expression patterns correlated with known SFPs. These correlated transcripts are candidates for both previously unknown SFPs and genes that are required for regulation of SFP production. Very little is known about how SFP genes are regulated in the male; this method provides a means to identify candidate regulatory genes for further study. As a proof of principle, the only known transcription factor required for the expression of specific SFP genes (Xue & Noll, Reference Xue and Noll2002) was among the candidate genes we identified. Although proteins encoded by regulatory genes would not necessarily be transferred to females during mating, and are therefore not SFPs per se, we refer to our set of candidate SFPs as cSFPs.

We identified 176 cSFP genes. For validation, we selected seven candidates with varying levels of correlation to known SFP genes and used quantitative real-time PCR (qRT-PCR) to validate the correlation patterns. We also used RT-PCR to test the tissue of expression for these seven genes. We propose that this method can be widely applied to similar datasets, beyond the example of the SFP functional annotation we present.

2. Methods

(i) Gene expression data

The gene expression data are from Ayroles et al. (Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009) . Whole genome expression was quantified using Affymetrix Drosophila 2.0 arrays for two replicate pools of 3–5-day-old mated males and females for each of 40 DGRP lines. We median-centred the perfect match (PM) data and removed probes that were identified as likely single feature polymorphisms. We used the median log2 signal intensity of the remaining PM probes in each probe set as the measure of expression. A total of 14 840 (78·9%) of the 18 767 transcripts on the array were expressed. Because we focus here on highly male-biased transcripts, we only used the male gene expression data to identify genetically variable transcripts. We fitted the following model to the expression data: Y=L+e, where Y is the median log2 signal intensity, L is the line effect and e is the residual. We identified 7151 transcripts as genetically variable at a False Discovery Rate (FDR)<0·01.

The raw microarray data are deposited in the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress) under accession number E-MEXP-1594. The DGRP stocks are available from the Bloomington Drosophila Stock Center (Bloomington, Indiana).

(ii) cSFPs

Of the 187 known SFP genes, 107 had genetically variable expression levels in the DGRP lines. We computed pairwise Pearson correlations between the 107 genetically variable SFPs and all 7151 genetically variable transcripts, 1. We then calculated an ‘SFP score’ for each of the 7151 transcripts by tallying the number of significant correlations (P<0·01) with known SFPs, divided by 107. For a given transcript, a score of 100 indicates that it is correlated with all 107 known SFPs, and a score of 0·93 (1/107×100) indicates the absence of significant correlation between the focal gene and any of the known SFP genes (i.e. only showing correlation to itself). The thresholds used to compute this score are arbitrary, but this method is both simple and intuitive, and gives similar results to more sophisticated statistics such as the identification of eigengenes (Langfelder & Horvath, Reference Langfelder and Horvath2007) following the construction of co-expression gene networks and using the Principal Component Analysis (PCA) loadings to identify correlated transcripts.

In addition to the correlation structure, we used several criteria to identify transcripts as putative SFPs (proteins that are predominately or exclusively expressed in the male reproductive tract and likely to be transferred to females), or as potential regulatory genes (those that produce proteins unlikely to be transferred to females) but whose expression is also predominately limited to male reproductive tissues. We used FlyAtlas (Chintapalli et al., Reference Chintapalli, Wang and Dow2007), a database of tissue-specific expression for D. melanogaster, to examine the tissues of expression for each gene with an SFP score of greater than 8. In addition, because SFPs are secreted proteins, we used SignalP software (http://www.cbs.dtu.dk/services/SignalP/) to identify the presence of predicted signal sequences. The program calculates the probability that the input amino acid sequence contains an N-terminal secretion signal. Here, we used the signal peptide probability score given from the SignalP-HMM prediction method. Signal peptides are usually 15–30 amino acids long and contain a stereotypical pattern of charged, hydrophobic and uncharged residues, although the amino acid sequence itself is not conserved (Emanuelsson et al., Reference Findlay, MacCoss and Swanson2007). However, not all secreted proteins contain predicted signal sequences (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008), and not all proteins with secretion signals are secreted (Emanuelsson et al., Reference Emanuelsson, Brunak, von Heijne and Nielsen2007). Therefore, we do not exclude genes as being SFPs or ACP candidates based solely on a low SignalP score.

(iii) Experimental validation of cSFPs

We chose seven genes identified as cSFPs for validation of the guilt-by-association results as well as further characterization. These genes have a range of SFP scores and a few have predicted biochemical functions, though none were predicted to be involved with SFP function. In addition to the seven candidates, we also included a known ACP gene (CG9997; Swanson et al., Reference Swanson, Clark, Waldrip-Dail, Wolfner and Aquadro2001; Ravi Ram & Wolfner, Reference Ravi Ram and Wolfner2007), and a known ejaculatory duct (ED) protein gene (Dup99B; Saudan et al., Reference Saudan, Hauck, Soller, Choffat, Ottiger, Spörri, Ding, Hess, Gehrig, Klauser, Hunziker and Kubli2002), both of whose products are transferred to females, as positive controls. We expect cSFP genes, including those expressed in the ED or bulb, to correlate in expression with the known SFP, CG9997. We included CG34422 as a negative control, given its low SFP score and wide expression pattern across tissues, including the male AGs, brain, eye and hindgut. This gene should not show a significant correlation to CG9997 in the qRT-PCR experiment, in contrast to the seven cSFPs.

We independently validated the tissue-biased expression results from FlyAtlas (Chintapalli et al., Reference Chintapalli, Wang and Dow2007) for these 10 genes. We reared Canton-S males on standard yeast-glucose medium under uncrowded conditions at ~24°C. We dissected 50–60 testes (T), AGs, EDs, ejaculatory bulbs (EB) and male carcasses (C; no reproductive tract). Dissected tissues were placed directly into TRIzol Reagent (Invitrogen) on ice. We collected two biological replicates for each RNA extraction.

We used qRT-PCR to validate the correlation structure between the genes that had been inferred from the microarray experiment. We randomly selected 20 of the 40 DGRP lines used in the microarray study (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009), and isolated total RNA from two biological replicates, each with 8–12 males of each line (3–7 days post-eclosion). We then estimated the correlation of gene expression with the known SFP, CG9997.

(iv) RNA extractions and cDNA synthesis

We extracted total RNA by grinding dissected tissues in 150 μl of TRIzol Reagent (Invitrogen), following the manufacturer's recommendations for RNA isolation, except that 0·5 ml of chloroform was used for every 1 ml of TRIzol. Total RNA was treated with DNase1 (Invitrogen) and converted to cDNA with Superscript II Reverse Transcriptase (Invitrogen) and oligo-dT primers as recommended by the manufacturer. We used 500 ng of total RNA per 20 μl reverse transcription reaction. Negative controls without reverse transcriptase were tested once for all genes and all cDNA samples to exclude potential genomic DNA contamination.

v qRT-PCR

We quantified mRNA levels by qRT-PCR in 25 μl reactions with the SYBR green detection method (iQ SYBR Green Supermix, Bio-Rad) according to the protocol from MyiQ Single-Color Real-Time PCR Detection System (Bio-Rad). Each reaction was performed with 2 pg of total cDNA, using a BioRadMyiQ Single-Color Real-Time PCR Detection system. We used the actin5C gene as an internal standard. We used Primer3 (http://frodo.wi.mit.edu/primer3/) to design transcript-specific primers to amplify 85–148 bp regions of the genes of interest. CG34422 primers were designed to encompass the common regions of alternative transcripts. The starting template concentration of each transcript was calculated from the standard curve of that primer pair according to the method described by Qiagen (http://www1.qiagen.com/literature/brochures/pcr/qt/1037490_ag_pcr_0206_int_lr.pdf). We used the linear regression model Y=mX+b to quantify transcript abundance, where Y is the critical threshold (Ct) values from the qRT-PCR experiment, m is the slope, b is the intercept of the standard curve and X is the transcript abundance. We standardized this estimate by dividing by the transcript abundance of actin5C in the same sample.

vi GO analysis

We used the GO analysis to assign functional categories to the cSFP genes tested. We computed the genetic correlations between each of the seven new focal genes with the remainder of the genetically variable transcriptome. We then performed a GO enrichment analysis for the genes most strongly correlated to the focal gene (P<0·001 and |r|>0·5). The conclusions regarding enrichment were the same if the threshold was increased to P<0·0001. We performed this analysis using DAVID 6.7 (Huang et al., Reference Huang, Sherman and Lempicki2009).

3. Results and Discussion

Of the 187 known SFPs, 107 had genetically variable transcripts among the 40 DGRP lines (Ayroles et al., Reference Ayroles, Carbone, Stone, Jordan, Lyman, Magwire, Rollmann, Duncan, Lawrence, Anholt and Mackay2009). The 107 known SFPs were highly genetically correlated (Fig. 1), reinforcing the idea that gene co-expression may be a reflection of shared function. We attempted to cluster this correlation matrix further into modules using various clustering algorithms, including Modulated Modularity Clustering (MMC) (Stone & Ayroles, Reference Stone and Ayroles2009), but did not find strong community structure in the graph resulting from this correlation matrix. In addition, we did not find evidence supporting the idea that genes sharing a similar GO term were more strongly correlated with each other than they were to the rest of the genes.

Fig. 1. Graphical representation of the correlation among known SFPs. Each node represents a gene and each edge the correlation between two genes. The thickness of each edge is scaled proportional to the strength of the correlation between two genes. The absolute value of all correlations depicted is greater than 0·5 (P<0·001).

We then analysed the correlation matrix between the 107 known SFPs and 7151 transcripts that were genetically variable in males. We assigned an SFP score to each of the genetically variable transcripts based on the number of significant correlations with known SFPs (Supplementary Table 1 available at http://cambridge.journals.org/GRH). We next asked whether this approach would allow us to recover the known SFPs. We ranked the vector of SFP scores from the highest to the lowest and applied the filter that cSFPs should be expressed in male reproductive tissues based on FlyAtlas (Chintapalli et al., Reference Chintapalli, Wang and Dow2007) data. We found that 78% of the known SFPs are in the top 500 transcripts.

We identified 176 cSFP genes that have correlated expression patterns to at least 7 of the 107 genetically variable known seminal protein genes and are expressed in male reproductive tissues (Supplementary Table 1). A total of 37 of the 176 candidates have no known or predicted functions or GO terms. An additional 13 transcripts correspond to probe sets on the Affymetrix array but not annotated genes, and could correspond to new genes. Independent confirmation of cSFP identification comes from a proteomic screen aimed at identifying male proteins transferred during mating (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008, Reference Findlay, MacCoss and Swanson2009). Two candidate transcripts were confirmed as bona fide SFPs: CG34002 (with an SFP score of 15) and Sfp26Ad (with an SFP score of 41). Sfp26Ad was not annotated as a gene at the time we performed this experiment and corresponded to probe set 637742 at on the Affymetrix array.

We chose seven cSFP genes (CG9720, CG11828, CG31413, CG31493, CG31496, CG32985 and CG34002), as well as two positive control genes (the ACP gene CG9997 and the ED protein gene Dup99B) and one negative control gene (CG34422, with an SFP score of 0·93) for validation of the microarray correlation results using qRT-PCR in 20 of the DGRP lines. The candidate genes have SFP scores ranging from moderately low (8) to very high (42, the highest SFP score found) (Table 1). The RT-PCR results confirmed the correlation between all seven cSFPs and the known ACP gene CG9997 across the 20 lines (Fig. 2). As predicted, expression of the negative control CG34422 was not genetically correlated with that of CG9997. However, expression of the ED protein gene Dup99B, whose gene product is transferred with the seminal fluid to females, was genetically correlated with CG9997, demonstrating that non-ACP SFPs canals to be identified with this method.

Fig. 2. Correlation of qRT-PCR estimates of gene expression between cSFP genes and positive and negative SFP control genes (y-axis) to a known ACP gene (CG9997, x-axis) among males of 20 inbred lines. All estimates of gene expression are normalized to that of actin5C. The linear regression line is shown, along with the t-test P-value and the estimate of the correlation coefficient, r. (a) CG11828, r=0·68, P=0·001. (b) CG31413, r=0·81, P=0·000014. (c) CG31493, r=0·77, P=0·000063. (d) CG31496, r=0·51, P=0·022. (e) CG32985, r=0·53, P=0·015. (f) CG34002, r=0·66, P=0·0017. (g) CG9720, r=0·66, P=0·0016. (h) Dup99B (positive control), r=0·55, P=0·012. (i) CG34422 (negative control), r=0·12, P=0·61.

Table 1. Genes selected for experimental validation. The SFP score is the fraction of known SFPs with which the gene had correlated expression. Sprob is the predicted probability of a secretion signal sequence as given by SignalP. Tissue of expression is given from the FlyAtlas compilation and our RT-PCR data from the male reproductive tract and carcass. ED and bulb are not represented in FlyAtlas. Bold font denotes tissues of predominant expression. AG, accessory glands; ED, ejaculatory duct; EB, ejaculatory bulb; T, testis; LSG, larval salivary glands; HT, heart; HD, head. ED and EB are not represented in FlyAtlas.

Table 1 gives SFP scores, secretion signal peptide probability and tissue of expression for these seven genes and for the positive and negative controls. Three genes with high SFP scores were not predicted to have secretion signals. These genes’ products may be secreted nevertheless, as has been seen in other cases (Findlay et al., Reference Findlay, Yi, Maccoss and Swanson2008), or they may be non-SFP genes that are important for the regulation of other SFPs.

Among the seven genes, all that were predicted to be expressed in AGs (Chintapalli et al., Reference Chintapalli, Wang and Dow2007) were confirmed as expressed in that tissue (Table 1, Fig. 3). To gain insight into the possible biological processes and molecular functions of the candidate genes chosen for validation, we used a GO enrichment analysis implemented in DAVID (6.7) (Huang et al., Reference Huang, Sherman and Lempicki2009). For each candidate gene, we analysed the function of its most correlated transcripts (P<0·001 and r>0·5). Four of the seven candidate genes (CG11828, CG31413, CG31493 and CG34002) were significantly associated with serine-type endopeptidase inhibitor activity, a predicted function shared by several other SFPs (Wolfner, Reference Wolfner2009). However, it is important to note that CG11828, CG31413 and CG31493 do not contain conserved protease domains but do contain other types of predicted conserved domains. No significant GO-class enrichment was observed for CG9720, CG31496 or CG32985.

Fig. 3. RT-PCR analysis of gene expression for cSFP genes and positive and negative SFP control genes in five male tissues: AG, testes (T), EB, ED and carcass (C, non-reproductive tissues). The subscripts denote the two biological replicates of each tissue. The number of PCR cycles for each gene was normalized to give non-saturation results. Actin5C was used as a control for cDNA synthesis. Whole male cDNA was used as a positive (+) PCR control, and no DNA template was used as a negative (−) PCR control.

It is possible that some of the cSFP genes are important for SFP expression and function but may not encode proteins that are transferred to females as part of the seminal fluid. As proof of principle that such genes can be identified by this method, our analysis detected paired (SFP score=16), which encodes a transcription factor important in AG development and ACP expression (Supplementary Table 2 available online at http://cambridge.journals.org/GRH). This Pax gene has a dual function in Drosophila: it acts first as a pair-rule gene in early embryo development (Nüsslein-Volhard & Weischaus, Reference Nüsslein-Volhard and Wieschaus1980; Kilchherr et al., Reference Kilchherr, Schumaker, Phillips and Curtiss1986) and later is required for viability and male fertility (Bertuccioli et al., Reference Bertuccioli, Fasano, Jun, Wang, Sheng and Desplan1996; Xue & Noll, Reference Xue and Noll1996, Reference Xue and Noll2000). AG formation and expression of at least two SFPs expressed in the AG (ACP26Aa and SP) both require the function of paired (Xue & Noll, Reference Xue and Noll2000, Reference Xue and Noll2002).

Guilt-by-association methods most frequently rely on clustering algorithms to identify the functional membership of a candidate gene or transcript (Aravind, Reference Aravind2000; Miozzi et al., Reference Miozzi, Piro, Rosa, Ala, Silengo, Di Cunto and Provero2008; Reverter et al., Reference Reverter, Ingham and Dalrymple2008; Klie et al., Reference Klie, Nikoloski and Selbig2010). In its most common use, guilt-by-association is used to assign functions to any or all unannotated genes that respond to a given treatment or are differentially regulated under disease conditions. Here, we have demonstrated the use of guilt-by-association methods in another context: to identify genes in a specific functional class using correlated genetic variation in gene expression among wild-derived inbred lines. This method removes the requirement for relying on arbitrary clustering or reliance on GO terms to assign candidate functions to new genes. Instead, a group of genes that has been annotated and functionally clustered experimentally is used to find correlated transcripts that can then be included in the group. In this case, we used SFPs, a group defined by a biological phenomenon rather than a biochemical function. As in potentially many other cases, for example, identifying genes involved in specific behaviours, GO terms do not define our selected group of genes as belonging to a biologically significant group. The group of genes we identified (cSFPs) have diverse GO functions (ranging from proteases to pro-hormones). A given cSFP gene could not be predicted as an SFP on the basis of GO membership.

SFP genes are well suited for this study since their expression is specific to, or highly biased in, the male reproductive tract, facilitating their confirmation as SFPs; and expression of the known SFPs is genetically variable in the population of lines surveyed. An increasing number of studies are taking advantage of natural genetic variation to better understand the genetic basis of phenotypic variation (Mackay et al., Reference Mackay, Stone and Ayroles2009). In the future, the availability of sequence information for the D. melanogaster population used in this study will allow us to associate co-expression with expression quantitative trait locus (eQTL) analysis (Mackay et al., Reference Mackay, Stone and Ayroles2009). This additional layer of information will further our understanding of what genetic factors are driving co-expression between SFP genes, and may lead us to rethink what information should be considered when annotating a segment of sequence.

To complement this study, and generalize the simple analysis presented in this manuscript, we have created a web tool (http://dgrp.statgen.ncsu.edu) that allows the user to input the Affymetrix Drosophila 2.0 ID of any focal gene of interest and retrieve a vector of genes, their ranked correlation with the focal gene, as well as the GO of the correlated transcripts. This tool integrates FlyAtlas information (Chintapalli et al., Reference Chintapalli, Wang and Dow2007), allowing users to restrict the computation of correlations to genes expressed in specific tissue or to genes with strong tissue-biased expression.

Many studies using natural genetic variation to study phenotypic variation also investigate variation in gene expression and gene co-expression (Mackay et al., Reference Mackay, Stone and Ayroles2009). However, very rarely is this information translated in the form of hypothetical functional annotation for any unannotated genes involved. We advocate that such datasets be used more routinely as patterns should emerge across studies and this information will greatly improve our understanding of genes, their function and regulation. In particular, directed analyses such as the one presented here, in which genes involved in an experimentally defined group are sought, may help to uncover pleiotropy among previously annotated genes and increase our understanding of how various biological systems function together.

This work was funded by National Institutes of Health grants R01 GM45146 (T. F. C. M.) and R0I HD038921 (M. F. W.). B. A. L. was supported by an NSF Predoctoral Fellowship. We thank G. Findlay and E. Kelleher for comments on the manuscript, and J. Mezey and MBG colleagues for use of the qPCR machine. This is a publication of the W. M. Keck Center for Behavioral Biology.

References

Alonso, J. M., Stepanova, A. N., Leisse, T. J., Kim, C. J., Chen, H., Shinn, P., Stevenson, D. K., Zimmerman, J., Barajas, P., Cheuk, R., Gadrinab, C., Heller, C., Jeske, A., Koesema, E., Meyers, C. C., Parker, H., Prednis, L., Ansari, Y., Choy, N., Deen, H., Geralt, M., Hazari, N., Hom, E., Karnes, M., Mulholland, C., Ndubaku, R., Schmidt, I., Guzman, P., Aguilar-Henonin, L., Schmid, M., Weigel, D., Carter, D. E., Marchand, T., Risseeuw, E., Brogden, D., Zeko, A., Crosby, W. L., Berry, C. C. & Ecker, J. R. (2003). Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301, 653657.CrossRefGoogle ScholarPubMed
Aravind, L. (2000). Guilt by association: contextual information in genome analysis. Genome Research 10, 10741077.CrossRefGoogle ScholarPubMed
Ayroles, J. F., Carbone, M. A., Stone, E. A., Jordan, K. W., Lyman, R. F., Magwire, M. M., Rollmann, S. M., Duncan, L. H., Lawrence, F., Anholt, R. R. H. & Mackay, T. F. C. (2009). Systems genetics of complex traits in Drosophila melanogaster. Nature Genetics 41, 299307.Google Scholar
Bellen, H. J., Levis, R. W., Liao, G., He, Y., Carlson, J. W., Tsang, G., Evans-Holm, M., Hiesinger, P. R., Schulze, K. L., Rubin, G. M., Hoskins, R. A. & Spradling, A. C. (2004). The BDGP gene disruption project: single transposon insertions associated with 40% of Drosophila genes. Genetics 67, 761781.Google Scholar
Bertuccioli, C., Fasano, L., Jun, S., Wang, S., Sheng, G. & Desplan, C. (1996). In vivo requirement for the paired domain and homeodomain of the paired segmentation gene product. Development 122, 26732685.Google Scholar
Bréhélin, L., Florent, I., Gascuel, O. & Maréchal, E. (2010). Assessing functional annotation transfers with inter-species conserved coexpression: application to Plasmodium falciparum. BMC Genomics 11, 35.Google Scholar
Carvalho, G. B., Kapahi, P., Anderson, D. J. & Benzer, S. (2006). Allocrine modulation of feeding behavior by the sex peptide of Drosophila. Current Biology 16, 692696.CrossRefGoogle ScholarPubMed
Chapman, T., Bangham, J., Vinti, G., Seifried, B., Lung, O., Wolfner, M. F., Smith, H. K. & Partridge, L. (2003). The sex peptide of Drosophila melanogaster: female post-mating responses analyzed by using RNA interference. Proceedings of the National Academy of Sciences USA 100, 99239928.Google Scholar
Chintapalli, V. R., Wang, J. & Dow, J. A. (2007). Using FlyAtlas to identify better Drosophila melanogaster models of human disease. Nature Genetics 39, 715720.CrossRefGoogle ScholarPubMed
Costello, J. C., Dalkilic, M. M., Beason, S. M., Gehlhausen, J. R., Patwardhan, R., Middha, S., Eads, B. D. & Andrews, J. R. (2009). Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome Biology 10, R97.CrossRefGoogle ScholarPubMed
Dietzl, G., Chen, D., Schnorrer, F., Su, K. C., Barinova, Y., Fellner, M., Gasser, B., Kinsey, K., Oppel, S., Scheiblauer, S., Couto, A., Marra, V., Keleman, K. & Dickson, B. J. (2007). A genome-wide transgenic RNAi library for conditional gene inactivation in Drosophila. Nature 448, 151156.CrossRefGoogle ScholarPubMed
Dowell, R. D., Ryan, O., Jansen, A., Cheung, D., Agarwala, S., Danford, T., Bernstein, D. A., Rolfe, P. A., Heisler, L. E., Chin, B., Nislow, C., Giaever, G., Phillips, P. C., Fink, G. R., Gifford, D. K. & Boone, C. (2010). Genotype to phenotype: a complex problem. Science 328, 469.Google Scholar
Emanuelsson, O., Brunak, S., von Heijne, G. & Nielsen, H. (2007). Locating proteins in the cell using TargetP, SignalP and related tools. Nature Protocols 2, 953971.CrossRefGoogle ScholarPubMed
Findlay, G. D., MacCoss, M. J. & Swanson, W. J. (2009). Proteomic discovery of previously unannotated, rapidly evolving seminal fluid genes in Drosophila. Genome Research 19, 886896.CrossRefGoogle ScholarPubMed
Findlay, G. D., Yi, X., Maccoss, M. J. & Swanson, W. J. (2008). Proteomics reveals novel Drosophila seminal fluid proteins transferred at mating. Public Library of Science Biology 6, e178.Google ScholarPubMed
Flint, J. & Mackay, T. F. C. (2009). Genetic architecture of quantitative traits in mice, flies, and humans. Genome Research 19, 723733.Google Scholar
Guan, C., Ye, C., Yang, X. & Gao, J. (2010). A review of current large-scale mouse knockout efforts. Genesis 48, 7385.Google Scholar
Heifetz, Y., Lung, O., Frongillo, E. A. Jr. & Wolfner, M. F. (2000). The Drosophila seminal fluid protein Acp26Aa stimulates release of oocytes by the ovary. Current Biology 10, 99–102.CrossRefGoogle ScholarPubMed
Hrmova, M. & Fincher, G. B. (2009). Functional genomics and structural biology in the definition of gene function. Methods in Molecular Biology 513, 199227.Google Scholar
Huang, D. W., Sherman, B. T. & Lempicki, R. A. (2009). Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protocols 4, 4457.CrossRefGoogle ScholarPubMed
Isaac, R. E., Li, C., Leedale, A. E. & Shirras, A. D. (2010). Drosophila male sex peptide inhibits siesta sleep and promotes locomotor activity in the post-mated female. Proceedings of the Biological Sciences 277, 6570.Google Scholar
Kamath, R. S. & Ahringer, J. (2003). Genome-wide RNAi screening in Caenorhabditis elegans. Methods 30, 313321.CrossRefGoogle ScholarPubMed
Kilchherr, E., Schumaker, V. N., Phillips, M. L. & Curtiss, L. K. (1986). Activation of the first component of human complement, C1, by monoclonal antibodies directed against different domains of subcomponent C1q. Journal of Immunology 137, 255262.Google Scholar
Klie, S., Nikoloski, Z. & Selbig, J. (2010). Biological cluster evaluation for gene function prediction. Journal of Computational Biology (Epublication ahead of print).Google Scholar
Langfelder, P. & Horvath, S. (2007). Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 1, 54.CrossRefGoogle ScholarPubMed
Liu, H. & Kubli, E. (2003). Sex-peptide is the molecular basis of the sperm effect in Drosophila melanogaster. Proceedings of the National Academy of Sciences USA 100, 99299933.CrossRefGoogle ScholarPubMed
Luo, F., Yang, Y., Zhong, J., Gao, H., Khan, L., Thompson, D. K. & Zhou, J. (2007). Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory. BMC Bioinformatics 8, 299.CrossRefGoogle ScholarPubMed
Mackay, T. F. C., Stone, E. A. & Ayroles, J. F. (2009). The genetics of quantitative traits: challenges and prospects. Nature Reviews Genetics 10, 565677.CrossRefGoogle ScholarPubMed
Miozzi, L., Piro, R. M., Rosa, F., Ala, U., Silengo, L., Di Cunto, F. & Provero, P. (2008). Functional annotation and identification of candidate disease genes by computational analysis of normal tissue gene expression data. Public Library of Science One 6, e2439.Google Scholar
Neubaum, D. M. & Wolfner, M. F. (1999). Wise, winsome, or weird? Mechanisms of sperm storage in female animals. Current Topics in Developmental Biology 41, 6797.CrossRefGoogle ScholarPubMed
Ni, J. Q., Liu, L. P., Binari, R., Hardy, R., Shim, H. S., Cavallaro, A., Booker, M., Pfeiffer, B. D., Markstein, M., Wang, H., Villalta, C., Laverty, T. R., Perkins, L. A. & Perrimon, N. (2009). A Drosophila resource of transgenic RNAi lines for neurogenetics. Genetics 182, 10891100.Google Scholar
Nüsslein-Volhard, C. & Wieschaus, E. (1980). Mutations affecting segment number and polarity in Drosophila. Nature 287, 795801.Google Scholar
Peña-Castillo, L. & Hughes, T. R. (2007). Why are there still over 1000 uncharacterized yeast genes? Genetics 176, 7–14.Google Scholar
Ravi Ram, K. & Wolfner, M. F. (2007). Seminal influences: Drosophila ACPs and the molecular interplay between males and females during reproduction. Integrative and Comparative Biology 47, 427445.Google Scholar
Ravi Ram, K. & Wolfner, M. F. (2009). A network of interactions among seminal proteins underlies the long-term postmating response in Drosophila. Proceedings of the National Academy of Sciences USA 106, 1538415389.Google Scholar
Reverter, A., Ingham, A. & Dalrymple, B. P. (2008). Mining tissue specificity, gene connectivity and disease association to reveal a set of genes that modify the action of disease causing genes. BioData Mining 1, 8.Google Scholar
Saudan, P., Hauck, K., Soller, M., Choffat, Y., Ottiger, M., Spörri, M., Ding, Z., Hess, D., Gehrig, P. M., Klauser, S., Hunziker, P. & Kubli, E. (2002). Ductus ejaculatorius peptide 99B (DUP99B), a novel Drosophila melanogaster sex-peptide pheromone. European Journal of Biochemistry 269, 989997.Google Scholar
Spirek, M., Benko, Z., Carnecka, M., Rumpf, C., Cipak, L., Batova, M., Marova, I., Nam, M., Kim, D. U., Park, H. O., Hayles, J., Hoe, K. L., Nurse, P. & Gregan, J. (2010). S. pombe genome deletion project: an update. Cell Cycle 9, 23992402.CrossRefGoogle ScholarPubMed
Stone, E. A. & Ayroles, J. F. (2009). Modulated modularity clustering as an exploratory tool for functional genomic inference. Public Library of Science Genetics 5, e1000479.Google Scholar
Swanson, W. J., Clark, A. G., Waldrip-Dail, H. M., Wolfner, M. F. & Aquadro, C. F. (2001). Evolutionary EST analysis identifies rapidly evolving male reproductive proteins in Drosophila. Proceedings of the National Academy of Sciences USA 98, 73757379.Google Scholar
Tram, U. & Wolfner, M. F. (1999). Male seminal fluid proteins are essential for sperm storage in Drosophila melanogaster. Genetics 153, 837844.Google Scholar
Vandepoele, K., Quimbaya, M., Casneuf, T., De Veylder, L. & Van de Peer, Y. (2009). Unraveling transcriptional control in Arabidopsis using cis-regulatory elements and coexpression networks. Plant Physiology 150, 535546.Google Scholar
Walker, M. G., Volkmuth, W., Sprinzak, E., Hodgson, D. & Klingler, T. (1999). Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. Genome Research 9, 11981203.Google Scholar
Winzeler, E. A., Shoemaker, D. D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Boeke, J. D., Bussey, H., Chu, A. M., Connelly, C., Davis, K., Dietrich, F., Dow, S. W., El Bakkoury, M., Foury, F., Friend, S. H., Gentalen, E., Giaever, G., Hegemann, J. H., Jones, T., Laub, M., Liao, H., Liebundguth, N., Lockhart, D. J., Lucau-Danila, A., Lussier, M., M'Rabet, N., Menard, P., Mittmann, M., Pai, C., Rebischung, C., Revuelta, J. L., Riles, L., Roberts, C. J., Ross-MacDonald, P., Scherens, B., Snyder, M., Sookhai-Mahadeo, S., Storms, R. K., Véronneau, S., Voet, M., Volckaert, G., Ward, T. R., Wysocki, R., Yen, G. S., Yu, K., Zimmermann, K., Philippsen, P., Johnston, M. & Davis, R. W. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science 285, 901906.CrossRefGoogle Scholar
Wolfner, M. F. (2009). Battle and ballet: molecular interactions between the sexes in Drosophila. Journal of Heredity 100, 399410.Google Scholar
Xue, L. & Noll, M. (1996). The functional conservation of proteins in evolutionary alleles and the dominant role of enhancers in evolution. European Molecular Biology Organization Journal 15, 37223731.Google Scholar
Xue, L. & Noll, M. (2000). Drosophila female sexual behavior induced by sterile males showing copulation complementation. Proceedings of the National Academy of Sciences USA 97, 32723275.Google Scholar
Xue, L. & Noll, M. (2002). Dual role of the Pax gene paired in accessory gland development of Drosophila. Development 129, 339346.Google Scholar
Figure 0

Fig. 1. Graphical representation of the correlation among known SFPs. Each node represents a gene and each edge the correlation between two genes. The thickness of each edge is scaled proportional to the strength of the correlation between two genes. The absolute value of all correlations depicted is greater than 0·5 (P<0·001).

Figure 1

Fig. 2. Correlation of qRT-PCR estimates of gene expression between cSFP genes and positive and negative SFP control genes (y-axis) to a known ACP gene (CG9997, x-axis) among males of 20 inbred lines. All estimates of gene expression are normalized to that of actin5C. The linear regression line is shown, along with the t-test P-value and the estimate of the correlation coefficient, r. (a) CG11828, r=0·68, P=0·001. (b) CG31413, r=0·81, P=0·000014. (c) CG31493, r=0·77, P=0·000063. (d) CG31496, r=0·51, P=0·022. (e) CG32985, r=0·53, P=0·015. (f) CG34002, r=0·66, P=0·0017. (g) CG9720, r=0·66, P=0·0016. (h) Dup99B (positive control), r=0·55, P=0·012. (i) CG34422 (negative control), r=0·12, P=0·61.

Figure 2

Table 1. Genes selected for experimental validation. The SFP score is the fraction of known SFPs with which the gene had correlated expression. Sprob is the predicted probability of a secretion signal sequence as given by SignalP. Tissue of expression is given from the FlyAtlas compilation and our RT-PCR data from the male reproductive tract and carcass. ED and bulb are not represented in FlyAtlas. Bold font denotes tissues of predominant expression. AG, accessory glands; ED, ejaculatory duct; EB, ejaculatory bulb; T, testis; LSG, larval salivary glands; HT, heart; HD, head. ED and EB are not represented in FlyAtlas.

Figure 3

Fig. 3. RT-PCR analysis of gene expression for cSFP genes and positive and negative SFP control genes in five male tissues: AG, testes (T), EB, ED and carcass (C, non-reproductive tissues). The subscripts denote the two biological replicates of each tissue. The number of PCR cycles for each gene was normalized to give non-saturation results. Actin5C was used as a control for cDNA synthesis. Whole male cDNA was used as a positive (+) PCR control, and no DNA template was used as a negative (−) PCR control.

Supplementary material: File

Ayroles Supplementary Material

Table.xls

Download Ayroles Supplementary Material(File)
File 813.6 KB
Supplementary material: File

Ayroles Supplementary Material

Table.xls

Download Ayroles Supplementary Material(File)
File 127.5 KB