Structural genomic projects envision almost routine
protein structure determinations, which are currently imaginable
only for small proteins with molecular weights below 25,000
Da. For larger proteins, structural insight can be obtained
by breaking them into small segments of amino acid sequences
that can fold into native structures, even when isolated
from the rest of the protein. Such segments are autonomously
folding units (AFU) and have sizes suitable for fast structural
analyses. Here, we propose to expand an intuitive procedure
often employed for identifying biologically important domains
to an automatic method for detecting putative folded protein
fragments. The procedure is based on the recognition that
large proteins can be regarded as a combination of independent
domains conserved among diverse organisms. We thus have
developed a program that reorganizes the output of BLAST
searches and detects regions with a large number of similar
sequences. To automate the detection process, it is reduced
to a simple geometrical problem of recognizing rectangular
shaped elevations in a graph that plots the number of similar
sequences at each residue of a query sequence. We used
our program to quantitatively corroborate the premise that
segments with conserved sequences correspond to domains
that fold into native structures. We applied our program
to a test data set composed of 99 amino acid sequences
containing 150 segments with structures listed in the Protein
Data Bank, and thus known to fold into native structures.
Overall, the fragments identified by our program have an
almost 50% probability of forming a native structure, and
comparable results are observed with sequences containing
domain linkers classified in SCOP. Furthermore, we verified
that our program identifies AFU in libraries from various
organisms, and we found a significant number of AFU candidates
for structural analysis, covering an estimated 5 to 20%
of the genomic databases. Altogether, these results argue
that methods based on sequence similarity can be useful
for dissecting large proteins into small autonomously folding
domains, and such methods may provide an efficient support
to structural genomics projects.