ECCC-Report TR17-133https://eccc.weizmann.ac.il/report/2017/133Comments and Revisions published for TR17-133en-usFri, 08 Sep 2017 14:01:55 +0300
Paper TR17-133
| Sample-Optimal Identity Testing with High Probability |
Ilias Diakonikolas,
Themis Gouleakis,
John Peebles,
Eric Price
https://eccc.weizmann.ac.il/report/2017/133We study the problem of testing identity against a given distribution (a.k.a. goodness-of-fit) with a focus on the high confidence regime. More precisely, given samples from an unknown distribution $p$ over $n$ elements, an explicitly given distribution $q$, and parameters $0< \epsilon, \delta < 1$, we wish to distinguish, {\em with probability at least $1-\delta$}, whether
the distributions are identical versus $\epsilon$-far in total variation (or statistical) distance. Existing work has focused on the constant confidence regime, i.e., the case that $\delta = \Omega(1)$, for which the sample complexity of identity testing is known to be $\Theta(\sqrt{n}/\epsilon^2)$.
Typical applications of distribution property testing require small values of the confidence parameter $\delta$ (which correspond to small ``$p$-values'' in the statistical hypothesis testing terminology). Prior work achieved arbitrarily small values of $\delta$ via black-box amplification, which multiplies the required number of samples by $\Theta(\log(1/\delta))$. We show that this upper bound is suboptimal for any $\delta = o(1)$, and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is
\[
\Theta\left( \frac{1}{\epsilon^2}\left(\sqrt{n \log(1/\delta)} + \log(1/\delta) \right)\right)
\]
for any $n, \epsilon$, and $\delta$. For the special case of uniformity testing, where the given distribution is the uniform distribution $U_n$ over the domain, our new tester is surprisingly simple: to test whether $p = U_n$ versus $\dtv (p, U_n) \geq \epsilon$, we simply threshold $\dtv(\wh{p}, U_n)$, where $\wh{p}$ is the empirical probability distribution. We believe that our novel analysis techniques may be useful for other distribution testing problems as well.Fri, 08 Sep 2017 14:01:55 +0300https://eccc.weizmann.ac.il/report/2017/133