Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, Amir Shpilka, Adam Smith

We raise the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and Lempel-Ziv (LZ), and present algorithms and lower bounds for approximating compressibility with respect to ... more >>>

Gregory Valiant, Paul Valiant

We prove two new multivariate central limit theorems; the first relates the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance matric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central ... more >>>

Gregory Valiant, Paul Valiant

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear-sample additive estimators for a class of properties that includes entropy and distribution support size. Together with the lower bounds proven in the companion paper [29], this settles the longstanding question of the sample complexities ... more >>>

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, Ananda Theertha Suresh

The advent of data science has spurred interest in estimating properties of discrete distributions over large alphabets. Fundamental symmetric properties such as support size, support coverage, entropy, and proximity to uniformity, received most attention, with each property estimated using a different technique and often intricate analysis tools.

Motivated by the ... more >>>