TR04-058 Authors: John Case, Sanjay Jain, Eric Martin, Arun Sharma, Frank Stephan

Publication: 16th July 2004 21:11

Downloads: 3178

Keywords:

The present work studies clustering from an abstract point of view

and investigates its properties in the framework of inductive inference.

Any class $S$ considered is given by a numbering

$A_0,A_1,...$ of nonempty subsets of the natural numbers

or the rational k-dimensional vector space as a hypothesis space.

A clustering task is a finite and nonempty set of indices of

pairwise disjoint sets. The class $S$ is said to be

clusterable if there is an algorithm which, for every clustering

task $I$, converges in the limit on any text for the union of the

$A_i$ with $i$ in $I$ to a finite set $J$ of indices of pairwise

disjoint clusters having the same union.

A class is called semiclusterable if there is such an

algorithm which finds a $J$ with the last condition relaxed to

the union of the $A_j$ with $j$ in $J$ is a superset

of the union of the $A_i$ with $i$ in $I$.

The relationship between natural topological properties

and clusterability is investigated.

Topological properties can provide sufficient or necessary

conditions for clusterability but they cannot characterize it.

On one hand, many interesting conditions make use of both the

topological structure of the class and a well-chosen numbering.

On the other hand, the clusterability of a class does not depend on the

decision which numbering of the class is used as a hypothesis space

for the clusterer.

These ideas are demonstrated in the context of geometrically

defined classes. Clustering of many of these classes requires

besides the text for the clustering task some additional

information: the class of

convex hulls of finitely many points in a rational vector space

can be clustered with the number of clusters as additional information.

Interestingly, the class of polygons (together with their interiors)

is clusterable if the number of clusters and the overall number of

vertices of these clusters is given to the clusterer as additional

information. Intriguingly this additional information is not

sufficient for classes including figures with holes.

While some classes are unclusterable due to their topological structure,

others are only computationally intractable. Oracles can be used

to distinguish between both cases: the former cannot be clustered

using any oracle while the latter can be clustered using some oracle.

It is shown that there are maximal oracles that allow clustering

of all problems that can be clustered with the help of some oracle.

In particular, $E$ is maximal iff either $K$ is Turing reducible

to $E$ or $K'$ is Turing reducible to $E'$.

Furthermore, no 1-generic oracle below $K$ and no 2-generic

oracle permits to cluster any class which is not

clusterable without an oracle.