One of the strongest problems
afflicting current machine learning techniques is dataset
dimensionality. In many applications to real world problems, we deal
with data with anywhere from a few dozen to many thousands of
dimensions. Such high-dimensional data spaces are often encountered in
areas such as medicine or biology, where DNA microarray technology can
produce a large number of measurements at once, the clustering of text
documents, where, if a word-frequency vector is used, the number of
dimensions equals the size of the dictionary, and many others,
including data integration and management, and social network analysis.
In all these cases, the dimensionality of data makes learning problems
hardly tractable.
In particular, the high dimensionality of data is a highly critical
factor for the clustering task. The following problems need to be faced
for clustering high-dimensional data:
- When the dimensionality is high, the volume of the space
increases so fast that the available data becomes sparse, and we cannot
find reliable clusters, as clusters are data aggregations (curse of
dimensionality).
- The concept of distance becomes less precise as the number of
dimensions grows, since the distance between any two points in a given
dataset converges (concentration effects).
- Different clusters might be found in different subspaces, so a
global filtering of attributes is not sufficient (local feature
relevance problem).
- Given a large number of attributes, it is likely that some
attributes are correlated. Hence, clusters might exist in arbitrarily
oriented affine subspaces.
- High-dimensional data could likely include irrelevant features,
which may obscure the effect of the relevant ones.
The project is aimed to the study the current approaches for clustering
high-dimensional data with particular stress on relational clustering,
data reduction using rough and fuzzy sets, biclustering/co-clustering
and related methods for intrinsic dimension estimation and for
clustering comparison.