TR99-035 Authors: Leonard J. Schulman

Publication: 15th September 1999 16:34

Downloads: 1545

Keywords:

We address the problem of partitioning a set of $n$ points into

clusters, so as to minimize the sum, over all intracluster pairs of

points, of the cost associated with each pair. We obtain a randomized

approximation algorithm for this problem, for the cost functions

$\ell_2^2,\ell_1$ and $\ell_2$, as well as any cost function

isometrically embeddable in $\ell_2^2$.

The maximization problem (maximize the costs of intercluster edges) is

approximated with high probability to within multiplicative factor

$(1-\ep)$; while the minimization problem either receives a

multiplicative $1+\ep$ approximation, or else the optimal clustering

is correctly identified except for a mislabelling of an $\ep$ fraction

of the point set. Given a fixed approximation parameter $\ep$, the

runtime is linear in $n$ for $\ell_2^2$ problems of dimension

$o(\log n / \log \log n)$; and $n^{O(\log \log n)}$ in the general

case.

The case $\ell_2^2$ is addressed by combining three elements: (a)

Variable-probability sampling of the given points, to reduce the size

of the data set. (b) Near-isometric dimension reduction. (c) A

deterministic exact algorithm which runs in time exponential in the

dimension (rather than the number of points). The remaining cases are

addressed by reduction to $\ell_2^2$.