Published online by Cambridge University Press: 14 July 2016
Computer analysis of biological sequences often detects deviations from a random model. In the usual model, sequence letters are chosen independently, according to some fixed distribution over the relevant alphabet. Real biological sequences often contain simple repeats, however, which can be broadly characterized as multiple contiguous copies (usually inexact) of a specific word. This paper quantifies inexact simple repeats as local sums in a Markov additive process (MAP). The maximum of the local sums has an asymptotic distribution with two parameters (λ and k), which are given by general MAP formulas. The general MAP formulas are usually computationally intractable, but an essential simplification in the case of repeats permits λ and k to be computed from matrices whose dimension equals the size of the relevant alphabet. The simplification applies to some MAPs where the summand distributions do not depend on consecutive pairs of Markov states as usual, but on pairs with a fixed time-lag larger than one.