__
TR16-178 | 11th November 2016 04:51
__

#### Collision-based Testers are Optimal for Uniformity and Closeness

**Abstract:**
We study the fundamental problems of (i) uniformity testing of a discrete distribution,

and (ii) closeness testing between two discrete distributions with bounded $\ell_2$-norm.

These problems have been extensively studied in distribution testing

and sample-optimal estimators are known for them~\cite{Paninski:08, CDVV14, VV14, DKN:15}.

In this work, we show that the original collision-based testers proposed for these problems

~\cite{GRdist:00, BFR+:00} are sample-optimal, up to constant factors.

Previous analyses showed sample complexity upper bounds for these testers that are optimal

as a function of the domain size $n$, but suboptimal by polynomial factors

in the error parameter $\epsilon$. Our main contribution is a new tight analysis

establishing that these collision-based testers are information-theoretically optimal,

up to constant factors, both in the dependence on $n$ and in the dependence on $\epsilon$.

__
Comment #1 to TR16-178 | 6th September 2017 16:30
__
#### On the optimal analysis of the collision probability tester

**Abstract:**
The collision probability tester, introduced by Goldreich and Ron ({\em ECCC}, TR00-020, 2000),

distinguishes the uniform distribution over $[n]$ from any distribution that is $\e$-far from this distribution using $\poly(1/\epilon)\cdot{\sqrt n}$ samples.

While the original analysis established only an

upper bound of $O(1/\epsilon)^4\cdot{\sqrt n}$ on the sample complexity, a recent analysis of Diakonikolas, Gouleakis, Peebles, and Price

({\em ECCC}, TR16-178, 2016) established the optimal upper bound of $O(1/\epsilon)^2\cdot{\sqrt n}$.

In this note we survey their analysis,

while highlighting the sources of improvement. Specifically:

\begin{itemize}

\item

While the original analysis reduces the testing problem to approximating the collision probability of the unknown distribution up to a $1+\epsilon^2$ factor, the improved analysis capitalizes on the fact that the latter problem

needs only be solved ``at the extreme'' (i.e., it suffices to distinguish the uniform distribution, which has collision probability $1/n$, from any distribution that has collision probability exceeding $(1+4\epsilon^2)/n$).

\item

While the original analysis provides an almost

optimal analysis of the variance of the estimator when $\epsilon=\Omega(1)$, a more careful analysis yields a significantly better bound for the case of $\epsilon=o(1)$, which is the case that is relevant here.

\end{itemize}