Identifying Clusters from Positive Data

John Case; Sanjay Jain; Eric Martin; Arun Sharma; Frank Stephan

Electronic Colloquium on Computational Complexity

Under the auspices of the Computational Complexity Foundation (CCF)

REPORTS > DETAIL:

Paper:

TR04-058 | 28th May 2004 00:00

Identifying Clusters from Positive Data

TR04-058 Authors: John Case, Sanjay Jain, Eric Martin, Arun Sharma, Frank Stephan
Publication: 16th July 2004 21:11
Downloads: 3354

Keywords:

Clustering, inductive inference, Numbering, Topological and Geomertical Properties of Clusterable Classes, Turing degree

Abstract:

The present work studies clustering from an abstract point of view
and investigates its properties in the framework of inductive inference.
Any class $S$ considered is given by a numbering
$A_0,A_1,...$ of nonempty subsets of the natural numbers
or the rational k-dimensional vector space as a hypothesis space.
A clustering task is a finite and nonempty set of indices of
pairwise disjoint sets. The class $S$ is said to be
clusterable if there is an algorithm which, for every clustering
task $I$, converges in the limit on any text for the union of the
$A_i$ with $i$ in $I$ to a finite set $J$ of indices of pairwise
disjoint clusters having the same union.
A class is called semiclusterable if there is such an
algorithm which finds a $J$ with the last condition relaxed to
the union of the $A_j$ with $j$ in $J$ is a superset
of the union of the $A_i$ with $i$ in $I$.

The relationship between natural topological properties
and clusterability is investigated.
Topological properties can provide sufficient or necessary
conditions for clusterability but they cannot characterize it.
On one hand, many interesting conditions make use of both the
topological structure of the class and a well-chosen numbering.
On the other hand, the clusterability of a class does not depend on the
decision which numbering of the class is used as a hypothesis space
for the clusterer.

These ideas are demonstrated in the context of geometrically
defined classes. Clustering of many of these classes requires
besides the text for the clustering task some additional
information: the class of
convex hulls of finitely many points in a rational vector space
can be clustered with the number of clusters as additional information.
Interestingly, the class of polygons (together with their interiors)
is clusterable if the number of clusters and the overall number of
vertices of these clusters is given to the clusterer as additional
information. Intriguingly this additional information is not
sufficient for classes including figures with holes.

While some classes are unclusterable due to their topological structure,
others are only computationally intractable. Oracles can be used
to distinguish between both cases: the former cannot be clustered
using any oracle while the latter can be clustered using some oracle.
It is shown that there are maximal oracles that allow clustering
of all problems that can be clustered with the help of some oracle.
In particular, $E$ is maximal iff either $K$ is Turing reducible
to $E$ or $K'$ is Turing reducible to $E'$.
Furthermore, no 1-generic oracle below $K$ and no 2-generic
oracle permits to cluster any class which is not
clusterable without an oracle.

ISSN 1433-8092 | Imprint