We address the problem of partitioning a set of $n$ points into
clusters, so as to minimize the sum, over all intracluster pairs of
points, of the cost associated with each pair. We obtain a randomized
approximation algorithm for this problem, for the cost functions
$\ell_2^2,\ell_1$ and $\ell_2$, as well as any cost function
isometrically embeddable in $\ell_2^2$.
The maximization problem (maximize the costs of intercluster edges) is
approximated with high probability to within multiplicative factor
$(1-\ep)$; while the minimization problem either receives a
multiplicative $1+\ep$ approximation, or else the optimal clustering
is correctly identified except for a mislabelling of an $\ep$ fraction
of the point set. Given a fixed approximation parameter $\ep$, the
runtime is linear in $n$ for $\ell_2^2$ problems of dimension
$o(\log n / \log \log n)$; and $n^{O(\log \log n)}$ in the general
case.
The case $\ell_2^2$ is addressed by combining three elements: (a)
Variable-probability sampling of the given points, to reduce the size
of the data set. (b) Near-isometric dimension reduction. (c) A
deterministic exact algorithm which runs in time exponential in the
dimension (rather than the number of points). The remaining cases are
addressed by reduction to $\ell_2^2$.