TR05-125 Authors: Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, Amir Shpilka, Adam Smith

Publication: 4th November 2005 17:31

Downloads: 2956

Keywords:

We raise the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present algorithms and lower bounds for approximating compressibility with respect to both schemes. While we obtain much stronger results for RLE in terms of the efficiency of the algorithms, our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. We also show that compressibility with respect to LZ is related to approximating the support size of a distribution. This problem has been considered under different guises in the literature. We prove a strong lower bound for it, at the heart of which is a construction of two positive integer random variables, $X$ and $Y$, with very different expectations and the following condition on the moments up to $k$:

$E[X]/E[Y] = E[X^2]/E[Y^2] = ... = E[X^k]/E[Y^k].$