org.apache.mahout.clustering.streaming.cluster
Class StreamingKMeans

java.lang.Object
  extended by org.apache.mahout.clustering.streaming.cluster.StreamingKMeans
All Implemented Interfaces:
Iterable<Centroid>

public class StreamingKMeans
extends Object
implements Iterable<Centroid>

Implements a streaming k-means algorithm for weighted vectors. The goal clustering points one at a time, especially useful for MapReduce mappers that get inputs one at a time. A rough description of the algorithm: Suppose there are l clusters at one point and a new point p is added. The new point can either be added to one of the existing l clusters or become a new cluster. To decide: - let c be the closest cluster to point p; - let d be the distance between c and p; - if d > distanceCutoff, create a new cluster from p (p is too far away from the clusters to be part of them; distanceCutoff represents the largest distance from a point its assigned cluster's centroid); - else (d <= distanceCutoff), create a new cluster with probability d / distanceCutoff (the probability of creating a new cluster increases as d increases). There will be either l points or l + 1 points after processing a new point. As the number of clusters increases, it will go over the numClusters limit (numClusters represents a recommendation for the number of clusters that there should be at the end). To decrease the number of clusters the existing clusters are treated as data points and are re-clustered (collapsed). This tends to make the number of clusters go down. If the number of clusters is still too high, distanceCutoff is increased. For more details, see: - "Streaming k-means approximation" by N. Ailon, R. Jaiswal, C. Monteleoni http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf - "Fast and Accurate k-means for Large Datasets" by M. Shindler, A. Wong, A. Meyerson, http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf


Constructor Summary
StreamingKMeans(UpdatableSearcher searcher, int numClusters)
          Calls StreamingKMeans(searcher, numClusters, 1.3, 10, 2).
StreamingKMeans(UpdatableSearcher searcher, int numClusters, double distanceCutoff)
          Calls StreamingKMeans(searcher, numClusters, distanceCutoff, 1.3, 10, 2).
StreamingKMeans(UpdatableSearcher searcher, int numClusters, double distanceCutoff, double beta, double clusterLogFactor, double clusterOvershoot)
          Creates a new StreamingKMeans class given a searcher and the number of clusters to generate.
 
Method Summary
 UpdatableSearcher cluster(Centroid datapoint)
          Cluster one data point.
 UpdatableSearcher cluster(Iterable<Centroid> datapoints)
          Cluster the data points in an Iterable.
 UpdatableSearcher cluster(Matrix data)
          Cluster the rows of a matrix, treating them as Centroids with weight 1.
 double getDistanceCutoff()
           
 DistanceMeasure getDistanceMeasure()
           
 int getNumClusters()
           
 Iterator<Centroid> iterator()
           
 void reindexCentroids()
           
 void setDistanceCutoff(double distanceCutoff)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StreamingKMeans

public StreamingKMeans(UpdatableSearcher searcher,
                       int numClusters)
Calls StreamingKMeans(searcher, numClusters, 1.3, 10, 2).

See Also:
StreamingKMeans(org.apache.mahout.math.neighborhood.UpdatableSearcher, int, double, double, double, double)

StreamingKMeans

public StreamingKMeans(UpdatableSearcher searcher,
                       int numClusters,
                       double distanceCutoff)
Calls StreamingKMeans(searcher, numClusters, distanceCutoff, 1.3, 10, 2).

See Also:
StreamingKMeans(org.apache.mahout.math.neighborhood.UpdatableSearcher, int, double, double, double, double)

StreamingKMeans

public StreamingKMeans(UpdatableSearcher searcher,
                       int numClusters,
                       double distanceCutoff,
                       double beta,
                       double clusterLogFactor,
                       double clusterOvershoot)
Creates a new StreamingKMeans class given a searcher and the number of clusters to generate.

Parameters:
searcher - A Searcher that is used for performing nearest neighbor search. It MUST BE EMPTY initially because it will be used to keep track of the cluster centroids.
numClusters - An estimated number of clusters to generate for the data points. This can adjusted, but the actual number will depend on the data. The
distanceCutoff - The initial distance cutoff representing the value of the distance between a point and its closest centroid after which the new point will definitely be assigned to a new cluster.
beta - Ratio of geometric progression to use when increasing distanceCutoff. After n increases, distanceCutoff becomes distanceCutoff * beta^n. A smaller value increases the distanceCutoff less aggressively.
clusterLogFactor - Value multiplied with the number of points counted so far estimating the number of clusters to aim for. If the final number of clusters is known and this clustering is only for a sketch of the data, this can be the final number of clusters, k.
clusterOvershoot - Multiplicative slack factor for slowing down the collapse of the clusters.
Method Detail

iterator

public Iterator<Centroid> iterator()
Specified by:
iterator in interface Iterable<Centroid>
Returns:
an Iterator to the Centroids contained in this clusterer.

cluster

public UpdatableSearcher cluster(Matrix data)
Cluster the rows of a matrix, treating them as Centroids with weight 1.

Parameters:
data - matrix whose rows are to be clustered.
Returns:
the UpdatableSearcher containing the resulting centroids.

cluster

public UpdatableSearcher cluster(Iterable<Centroid> datapoints)
Cluster the data points in an Iterable.

Parameters:
datapoints - Iterable whose elements are to be clustered.
Returns:
the UpdatableSearcher containing the resulting centroids.

cluster

public UpdatableSearcher cluster(Centroid datapoint)
Cluster one data point.

Parameters:
datapoint - to be clustered.
Returns:
the UpdatableSearcher containing the resulting centroids.

getNumClusters

public int getNumClusters()
Returns:
the number of clusters computed from the points until now.

reindexCentroids

public void reindexCentroids()

getDistanceCutoff

public double getDistanceCutoff()
Returns:
the distanceCutoff (an upper bound for the maximum distance within a cluster).

setDistanceCutoff

public void setDistanceCutoff(double distanceCutoff)

getDistanceMeasure

public DistanceMeasure getDistanceMeasure()


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.