Hostname: page-component-cd9895bd7-8ctnn Total loading time: 0 Render date: 2024-12-27T08:54:59.832Z Has data issue: false hasContentIssue false

Universal Data Compression Algorithm Based on Approximate String Matching

Published online by Cambridge University Press:  27 July 2009

Ilan Sadeh
Affiliation:
Department of Communications Engineering, Center for Technological Education and Research, Holon affiliated with Tel Aviv University, P.O. Box 305, Holon, 58102, Israel

Abstract

A practical source coding scheme based on approximate string matching is proposed. It is an approximate fixed-length string matching data compression combined with a block-coder based on the empirical distribution. A lemma on approximate string matching, which is an extension of the Kac Lemma, is proved. It is shown, based on the lemma, that the deterministic algorithm converts the stationary and ergodic source, u, into an output process v, and under the assumption that v is a stationary process, after the scheme has run for an infinite time, the optimal compression ratio R(D) is achieved. This reduces the problem of the universal lossy coder to the proof of stationarity of the output process ν in the proposed algorithm. The main advantages of the proposed method are the asymptotic sequential behavior of the encoder and the simplicity of the decoder.

Type
Research Article
Copyright
Copyright © Cambridge University Press 1996

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

1.Algoet, P.H. & Cover, T.M. (1988). A sandwich proof of the Shannon McMillan Breiman theorem. Annals of Probability 16(2): 899909.CrossRefGoogle Scholar
2.Alon, N. & Spencer, J. (1992). The probabilistic method. New York: John Wiley.Google Scholar
3.Arratia, R.Gordon, L. & Waterman, M. (1990). The Erdos Renyi law in distribution for coin tossing and sequence matching. Annals of Statistics 18: 539570.CrossRefGoogle Scholar
4.Arratia, R. & Waterman, M. (1989). The Erdos Renyi strong law for pattern matching with given proportion of mismatches. Annals of Probability 17: 11521169.CrossRefGoogle Scholar
5.Blahut, R.E. (1987). Principles and practice of information theory. Reading, MA: Addison-Wesley.Google Scholar
6.Breiman, L. (1957). The individual ergodic theorem of information theory. Annals of Mathematical Statistics 28: 809811.CrossRefGoogle Scholar
7.Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory IT-21: 194203.CrossRefGoogle Scholar
8.Galil, Z. & Giancarlo, R. (1988). Data structures and algorithms for approximate string matching. Journal of Complexity 4: 3372.CrossRefGoogle Scholar
9.Gilbert, E. & Kadota, T. (1992). The Lempel Ziv algorithm and the message complexity. IEEE Transactions on Information Theory IT-38: 18391842.CrossRefGoogle Scholar
10.Gray, R.M. (1975). Sliding block source coding. IEEE Transactions on Information Theory IT-21: 357368.CrossRefGoogle Scholar
11.Gray, R.M. (1990). Entropy and information theory. New York: Springer-Verlag.CrossRefGoogle Scholar
12.Jacquet, P. & Szpankowski, W. (1996). Asymptotic behaviour of the Lempel Ziv parsing scheme and digital search trees. Theoretical Computer Science June.Google Scholar
13.Kac, M. (1947). On the notion of recurrence in discrete stochastic processes. Bulletin of the American Mathematical Society 53: 10021010.CrossRefGoogle Scholar
14.Knuth, D.E. (1981). The art of computer science. Sorting and searching, Vol. 3. Reading, MA: Addison-Wesley.Google Scholar
15.Landau, G.M. & Vishkin, U. (1986). Introducing efficient parallelism into approximate string matching and a new serial algorithm. Proceedings of the 18th Annual ACM Symposium on the Theory of Computing, Berkeley, CA.Google Scholar
16.Le, Gall D. (1991). MPEG: A video compression standard for multimedia applications. Communications of the Association for Computing Machines 34: 4658.Google Scholar
17.Lempel, A. & Ziv, J. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 1T-23: 337343.Google Scholar
18.Lempel, A. & Ziv, J. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory IT-24: 530536.Google Scholar
19.Luczak, T. & Szpankowski, W. (to appear). A lossy data compression based on an approximate pattern matching. Submitted for publication.Google Scholar
20.Manber, U. & Myers, E.W. (1993). Suffix trees: A new method for on-line string searchers. SIAM Journal on Computing 22(5): 935948.CrossRefGoogle Scholar
21.McCreight, E.M. (1976). A space economical suffix tree construction algorithm. Journal of the Association for Computing Machines 32(2): 262272.CrossRefGoogle Scholar
22.Ornstein, D.S. & Shields, P.C. (1990). Universal almost sure data compression. Annals of Probability 18: 441452.CrossRefGoogle Scholar
23.Ornstein, D.S. & Weiss, B. (1993). Entropy and data compression schemes. IEEE Transactions on Information Theory 39(1).CrossRefGoogle Scholar
24.Pittel, B. (1985). Asymptotic growth of a class of random trees. Annals of Probability 13: 414427.CrossRefGoogle Scholar
25.Plotnik, E.Weinberger, M. & Ziv, J. (1992). Upper bounds on the probability of sequences emitted by finite state sources and on the redundancy of the Lempel Ziv algorithm. IEEE Transactions on Information Theory 38: 6672.CrossRefGoogle Scholar
26.Rodeh, M.Pratt, V. & Even, S. (1981). Linear algorithm for data compression via string matching. Journal of the Association for Computing Machines 28: 1624.CrossRefGoogle Scholar
27.Padmanabha Rao, R. & Pearlman, W.A. (1991). On entropy of pyramid structures. IEEE Transactions on Information Theory 37(2).Google Scholar
28.Pearlman, W. (1994). Personal communications.Google Scholar
29.Sadeh, I. (1993). Data compression in computer networks. Ph.D. dissertation, Tel-Aviv University.Google Scholar
30.Sadeh, I. (1993). On approximate string matching. Proceedings of the 1993 Data Compression Conference DCC '93, 03, Snowbird, Utah.Google Scholar
31.Sadeh, I. (1995). Operational rate distortion theory. Journal of Applied Mathematics and Computer Science 5(1): 139169.Google Scholar
32.Steinberg, Y. & Gutman, M. (1993). An algorithm for source coding subject to a fidelity criterion, based on string matching. IEEE Transactions on Information Theory 39: 877887.CrossRefGoogle Scholar
33.Storer, J.A. (1990). Lossy on line dynamic data compression. New York: Springer-Verlag.CrossRefGoogle Scholar
34.Szpankowski, W. (1993). A generalized suffix tree and its unexpected asymptotic behaviours. SIAM Journal of Computing 22: 11761198.CrossRefGoogle Scholar
35.Welch, T.A. (1984). A technique for high performance data compression. IEEE Transactions on Computers 17(6): 819.Google Scholar
36.Wyner, A.D. & Ziv, J. (1989). Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression. IEEE Transactions on Information Theory 35: 12501258.CrossRefGoogle Scholar
37.Ziv, J. (1978). Coding theorem for individual sequences. IEEE Transactions on Information Theory IT-24: 405412.CrossRefGoogle Scholar