Weizmann Logo
Electronic Colloquium on Computational Complexity

Under the auspices of the Computational Complexity Foundation (CCF)

Login | Register | Classic Style



TR05-125 | 2nd November 2005 00:00

Sublinear Algorithms for Approximating String Compressibility and the Distribution Support Size



We raise the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present algorithms and lower bounds for approximating compressibility with respect to both schemes. While we obtain much stronger results for RLE in terms of the efficiency of the algorithms, our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. We also show that compressibility with respect to LZ is related to approximating the support size of a distribution. This problem has been considered under different guises in the literature. We prove a strong lower bound for it, at the heart of which is a construction of two positive integer random variables, $X$ and $Y$, with very different expectations and the following condition on the moments up to $k$:
$E[X]/E[Y] = E[X^2]/E[Y^2] = ... = E[X^k]/E[Y^k].$

ISSN 1433-8092 | Imprint