Published online by Cambridge University Press: 18 December 2014
Basic concepts
Approximate string matching, also called “string matching allowing errors,” is the problem of finding a pattern p in a text T when a limited number k of differences is permitted between the pattern and its occurrences in the text.
From the many existing models defining a “difference,” we focus on the most popular one, called Levenshtein distance or edit distance [Lev65]. Other more complex models exist, especially in computational biology, but the edit distance model has received the most attention and the most effective algorithms have been developed for it. Some of these algorithms can be extended to more complex models.
Under edit distance, one difference equals one edit operation: a character insertion, deletion, or substitution. That is, the edit distance between two strings x and y, ed(x, y), is the minimum number of edit operations required to convert x into y, or vice versa. For example, ed(annual, annealing) = 4. The approximate string matching problem becomes that of finding all occurrences in T of every p′ that satisfies ed(p,p′) ≤ k. To ensure a linear size output it is customary to report only the starting or ending positions of the occurrences.
Note that the problem only makes sense for 0 < k < m, because otherwise every text substring of length m can be converted into p by substituting the m characters. The case k = 0 corresponds to exact string matching. We call α = k/m the “error level.” It gives a measure of the “fraction” of the pattern that can be altered.
We concentrate on algorithms that are the fastest in the cases that are likely to be of use in some foreseeable application, particularly text retrieval and computational biology. In particular, α < 1/2 in most cases of interest.
We present four approaches. The first approach, which is also the oldest and most flexible, adapts a dynamic programming algorithm that computes edit distance.
To save this book to your Kindle, first ensure [email protected] is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Find out more about the Kindle Personal Document Service.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.
To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.