Abstract:
|
Topic detection is usually considered as a decision process implemented
in some relevant context, for example clustering. In this case, clusters correspond
to topics that should be identifed. Density-based clustering, for example, uses only
a density level E and a lower bound for the number of points in a cluster. As the
density level is hard to be estimated, a stochastic process, called the DBSCANMartingale,
is constructed for the combination of several outputs of DBSCAN for
various randomly selected values of E in a predefned closed interval [0; Emax] from
the uniform distribution. We have observed that most of the clusters are extracted
in the interval [0; Emax=2], and moreover in the interval [Emax=2; Emax] the DBSCANMartingale
stochastic process is less innovative, i.e. extracts only a few or no clusters.
Therefore, non-symmetric skewed distributions are needed to generate density levels
for the extraction of all clusters in a fast way. In this work we show that skewed
distributions may be used instead of the uniform, so as to extract all clusters as quickly
as possible. Experiments on real datasets show that the average innovation time of
the DBSCAN-Martingale stochastic process is reduced when skewed distributions are
employed, so less time is needed to extract all clusters. |