ECCC-Report TR05-125https://eccc.weizmann.ac.il/report/2005/125Comments and Revisions published for TR05-125en-usFri, 04 Nov 2005 17:31:07 +0200
Paper TR05-125
| Sublinear Algorithms for Approximating String Compressibility and the Distribution Support Size |
Sofya Raskhodnikova,
Dana Ron,
Ronitt Rubinfeld,
Amir Shpilka,
Adam Smith
https://eccc.weizmann.ac.il/report/2005/125We raise the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present algorithms and lower bounds for approximating compressibility with respect to both schemes. While we obtain much stronger results for RLE in terms of the efficiency of the algorithms, our investigation of LZ yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate compressibility of a string with respect to Lempel-Ziv to the number of distinct short substrings contained in it. We also show that compressibility with respect to LZ is related to approximating the support size of a distribution. This problem has been considered under different guises in the literature. We prove a strong lower bound for it, at the heart of which is a construction of two positive integer random variables, $X$ and $Y$, with very different expectations and the following condition on the moments up to $k$:
$E[X]/E[Y] = E[X^2]/E[Y^2] = ... = E[X^k]/E[Y^k].$
Fri, 04 Nov 2005 17:31:07 +0200https://eccc.weizmann.ac.il/report/2005/125