Hostname: page-component-cd9895bd7-gbm5v Total loading time: 0 Render date: 2024-12-25T09:44:58.010Z Has data issue: false hasContentIssue false

Digital search trees and chaos game representation*

Published online by Cambridge University Press:  21 February 2009

Peggy Cénac
Affiliation:
INRIA Rocquencourt and Université Paul Sabatier (Toulouse III) – INRIA Domaine de Voluceau, B.P. 105, 78153 Le Chesnay Cedex, France.
Brigitte Chauvin
Affiliation:
LAMA, UMR CNRS 8100, Bâtiment Fermat, Université de Versailles – Saint-Quentin, 78035 Versailles, France.
Stéphane Ginouillac
Affiliation:
LAMA, UMR CNRS 8100, Bâtiment Fermat, Université de Versailles – Saint-Quentin, 78035 Versailles, France.
Nicolas Pouyanne
Affiliation:
LAMA, UMR CNRS 8100, Bâtiment Fermat, Université de Versailles – Saint-Quentin, 78035 Versailles, France.
Get access

Abstract

In this paper, we consider a possible representation of a DNA sequence in a quaternary tree, in which one can visualize repetitions of subwords(seen as suffixes of subsequences). The CGR-tree turns a sequence of letters into a Digital Search Tree (DST), obtained from the suffixes of the reversed sequence. Several results are known concerning the height, the insertion depth for DST built from independent successive random sequences having the same distribution. Here the successive inserted words are strongly dependent. We give the asymptotic behaviour of the insertion depth and the length of branches for the CGR-tree obtained from the suffixes of a reversed i.i.d. or Markovian sequence.This behaviour turns out to be at first order the same one as in the case of independent words. As a by-product, asymptotic results on the length of longest runs in a Markovian sequence are obtained.

Type
Research Article
Copyright
© EDP Sciences, SMAI, 2009

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

M. Abramowitz and I.A. Stegun, Handbook of mathematical functions with formulas, graphs, and mathematical tables, National Bureau of Standards Applied Mathematics Series 55. For sale by the superintendent of Documents, U.S. Government Printing Office, Washington, D.C. (1964).
Aldous, D. and Shields, P., A diffusion limit for a class of randomly-growing binary search trees. Probab. Theory Related Fields 79 (1998) 509542. CrossRef
Almeida, J.S., Carriço, J.A., Maretzek, A., Noble, P.A. and Fletcher, M., Analysis of genomic sequences by Chaos Game Representation. Bioinformatics 17 (2001) 429437. CrossRef
Blom, G. and Thorburn, D., How many random digits are required until given sequences are obtained? J. Appl. Probab. 19 (1982) 518531. CrossRef
Cénac, P., Test on the structure of biological sequences via chaos game representation. Stat. Appl. Genet. Mol. Biol. 4 (2005) 36 (electronic). CrossRef
P. Cénac, G. Fayolle and J.M. Lasgouttes, Dynamical systems in the analysis of biological sequences. Technical Report 5351, INRIA (2004).
Drmota, M., The variance of the height of digital search trees. Acta Informatica 38 (2002) 261276. CrossRef
M. Duflo, Random Iterative Models. Springer (1997).
P. Erdős and P. Révész, On the length of the longest head run, in Topics in Information Theory 16 (1975) 219–228, I. Csizàr and P. Elias, Eds. North-Holland, Amsterdam Colloq. Math. Soc. Jànos Bolyai.
P. Erdős and P. Révész, On the length of the longest head-run. In Topics in information theory (Second Colloq., Keszthely, 1975). Colloq. Math. Soc. János Bolyai 16 (1977) 219–228.
J. Fayolle, Compression de données sans perte et combinatoire analytique. Thèse de l'université Paris VI (2006), available at http://www.lri.fr/ fayolle/these.pdf.
Bounds, J.C. Fu for reliability of large consecutive-k-out-of-n:f system. IEEE Trans. Reliability 35 (1986) 316319.
Fu, J.C. and Koutras, M.V., Distribution theory of runs: a markov chain approach. J. Amer. Statist. Soc. 89 (1994) 10501058. CrossRef
Gerber, H. and The, S. Li occurence of sequence patterns in repeated experiments and hitting times in a markov chain. Stochastic Process. Appl. 11 (1981) 101108. CrossRef
Goldman, N., Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 21 (1993) 24872491. CrossRef
Gordon, L., Schilling, M.F. and Waterman, M.S., An extreme value theory for long head runs. Probab. Theory Related Fields 72 (1986) 279287. CrossRef
Jeffrey, H.J., Chaos Game Representation of gene structure. Nucleic Acid. Res. 18 (1990) 21632170. CrossRef
M.V. Koutras, Waiting times and number of appearances of events in a sequence of discrete random variables, in Advances in combinatorial methods and applications to probability and statistics, Stat. Ind. Technol., Birkhäuser Boston, Boston, MA (1997) 363–384.
Shuo-Yen Robert Li, A martingale approach to the study of occurrence of sequence patterns in repeated experiments. Ann. Probab. 8 (1980) 11711176.
H.M. Mahmoud, Evolution of random search trees. Wiley-Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons Inc., New York (1992).
Penney, W., Problem: Penney-ante. J. Recreational Math. 2 (1969) 241.
V. Petrov, On the probabilities of large deviations for sums of independent random variables. Theory Prob. Appl. (1965) 287–298.
Pittel, B., Asymptotic growth of a class of random trees. Annals Probab. 13 (1985) 414427. CrossRef
Pozdnyakov, V., Glaz, J., Kulldorff, M. and Steele, J.M., A martingale approach to scan statistics. Ann. Inst. Statist. Math. 57 (2005) 2137. CrossRef
Régnier, M., A unified approach to word occurence probabilities. Discrete Appl. Math. 104 (2000) 259280. CrossRef
Reinert, G., Schbath, S. and Waterman, M.S., Probabilistic and statistical properties of words: An overview. J. Comput. Biology 7 (2000) 146. CrossRef
Robin, S. and Daudin, J.J., Exact distribution of word occurences in a random sequence of letters. J. Appl. Prob. 36 (1999) 179193. CrossRef
Roy, A., Raychaudhury, C. and Nandy, A., Novel techniques of graphical representation and analysis of DNA sequences – A review. J. Biosci. 23 (1998) 5571. CrossRef
Samarova, S.S., On the length of the longest head-run for a markov chain with two states. Theory Probab. Appl. 26 (1981) 498509. CrossRef
Stefanov, V. and Pakes, A.G., Explicit distributional results in pattern formation. Ann. Appl. Probab. 7 (1997) 666678.
W. Szpankowski, Average Case Analysis of Algorithms on Sequences. John Wiley & Sons, New York (2001).
D. Williams, Probability with martingales. Cambridge Mathematical Textbooks. Cambridge University Press, Cambridge (1991).
* Partially supported by the French Agence Nationale de la Recherche, project SADA ANR-05-BLAN-0372.