TR18-079 Authors: Jayadev Acharya, Clement Canonne, Himanshu Tyagi

Publication: 25th April 2018 04:04

Downloads: 94

Keywords:

Independent samples from an unknown probability distribution $\mathbf{p}$ on a domain of size $k$ are distributed across $n$ players, with each player holding one sample. Each player can communicate $\ell$ bits to a central referee in a simultaneous message passing (SMP) model of communication to help the referee infer a property of the unknown $\mathbf{p}$. When $\ell\geq\log k$ bits, the problem reduces to the well-studied collocated case where all the samples are available in one place. In this work, we focus on the communication-starved setting of $\ell < \log k$, in which the landscape may change drastically. We

propose a general formulation for inference problems in this distributed setting, and instantiate it to two prototypical inference questions: Learning and uniformity testing.

One natural way to solve any distributed inference problem would be for the referee to simulate the distribution $\mathbf{p}$, i.e., generate enough samples from $\mathbf{p}$ using the players' messages, and then perform inference in a centralized fashion. This leads us to the following fundamental problem of distribution simulation, interesting in its own right: Can the referee simulate samples from $\mathbf{p}$ upon observing the messages from the players? Our first result shows that if $\ell<\log k$ perfect simulation of distribution is impossible for any finite number of players. This is perhaps surprising since $\log k$ bits from a single party would have sufficed. However, we provide next a Las Vegas protocol for distributed simulation that can generate a sample from the unknown distribution $\mathbf{p}$ using an expected $O(k/2^\ell)$ players, which we prove is optimal. It follows that any inference task can be solved in the distributed setting using a ``simulate-and-infer'' protocol with at most an $O(k/2^{\ell})$ multiplicative blow-up in the sample complexity compared to the collocated setting.

We then consider two core inference problems in the distributed setting: {Learning} and {uniformity testing} of $k$-ary distributions. The first is known to require $\Theta(k/\varepsilon^2)$ samples in the collocated setting, implying that the aforementioned simulate-and-infer protocol will work with $O(k^2/2^{\ell}\varepsilon^2)$-player protocol in the distributed setting. Furthermore, this number of players can be seen to be optimal using recent results, which in turn establishes that the ``simulate-and-infer'' approach is optimal for distribution learning.

The second inference problem of uniformity testing requires $\Theta(\sqrt{k}/\varepsilon^2)$ samples in the collocated setting [Paninski'08]. This leads to an $O(k^{3/2}/2^{\ell}\varepsilon^2)$-player protocol in the distributed setting using the simulate-and-infer scheme above. We show this is suboptimal if public randomness is allowed. Specifically, we provide a distributed uniformity testing protocol that requires only $O(\sqrt{k^2/2^\ell}/\varepsilon^2)$ players and prove that it is optimal up to constant factors.