org.apache.mahout.clustering.streaming.mapreduce
Class StreamingKMeansDriver

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansDriver
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public final class StreamingKMeansDriver
extends AbstractJob

Classifies the vectors into different clusters found by the clustering algorithm.


Field Summary
static String ESTIMATED_DISTANCE_CUTOFF
          The initial estimated distance cutoff between two points for forming new clusters.
static String ESTIMATED_NUM_MAP_CLUSTERS
          The number of cluster that Mappers will use should be \(O(k log n)\) where k is the number of clusters to get at the end and n is the number of points to cluster.
static String IGNORE_WEIGHTS
          Whether to correct the weights of the centroids after the clustering is done.
static float INVALID_DISTANCE_CUTOFF
           
static String MAX_NUM_ITERATIONS
          After mapping finishes, we get an intermediate set of vectors that represent approximate clusterings of the data from each Mapper.
static String NUM_BALLKMEANS_RUNS
          The percentage of points that go into the "training" set when evaluating BallKMeans runs in the reducer.
static String NUM_PROJECTIONS_OPTION
          The number of projections to use when using a projection searcher like ProjectionSearch or FastProjectionSearch.
static String RANDOM_INIT
          Whether to use k-means++ initialization or random initialization of the seed centroids.
static String REDUCE_STREAMING_KMEANS
          Whether to run another pass of StreamingKMeans on the reducer's points before BallKMeans.
static String SEARCH_SIZE_OPTION
          When using approximate searches (anything that's not BruteSearch), more than just the seemingly closest element must be considered.
static String SEARCHER_CLASS_OPTION
          The Searcher class when performing nearest neighbor search in StreamingKMeans.
static String TEST_PROBABILITY
          The percentage of points that go into the "test" set when evaluating BallKMeans runs in the reducer.
static String TRIM_FRACTION
          The "ball" aspect of ball k-means means that only the closest points to the centroid will actually be used for updating.
 
Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
 
Method Summary
static void configureOptionsForWorkers(org.apache.hadoop.conf.Configuration conf, int numClusters, int estimatedNumMapClusters, float estimatedDistanceCutoff, int maxNumIterations, float trimFraction, boolean randomInit, boolean ignoreWeights, float testProbability, int numBallKMeansRuns, String measureClass, String searcherClass, int searchSize, int numProjections, String method, boolean reduceStreamingKMeans)
          Checks the parameters for a StreamingKMeans job and prepares a Configuration with them.
static void main(String[] args)
           
static int run(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output)
          Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to cluster the input vectors.
 int run(String[] args)
           
static int runMapReduce(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path input, org.apache.hadoop.fs.Path output)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ESTIMATED_NUM_MAP_CLUSTERS

public static final String ESTIMATED_NUM_MAP_CLUSTERS
The number of cluster that Mappers will use should be \(O(k log n)\) where k is the number of clusters to get at the end and n is the number of points to cluster. This doesn't need to be exact. It will be adjusted at runtime.

See Also:
Constant Field Values

ESTIMATED_DISTANCE_CUTOFF

public static final String ESTIMATED_DISTANCE_CUTOFF
The initial estimated distance cutoff between two points for forming new clusters.

See Also:
Defaults to 10e-6., Constant Field Values

MAX_NUM_ITERATIONS

public static final String MAX_NUM_ITERATIONS
After mapping finishes, we get an intermediate set of vectors that represent approximate clusterings of the data from each Mapper. These can be clustered by the Reducer using BallKMeans in memory. This variable is the maximum number of iterations in the final BallKMeans algorithm. Defaults to 10.

See Also:
Constant Field Values

TRIM_FRACTION

public static final String TRIM_FRACTION
The "ball" aspect of ball k-means means that only the closest points to the centroid will actually be used for updating. The fraction of the points to be used is those points whose distance to the center is within trimFraction * distance to the closest other center. Defaults to 0.9.

See Also:
Constant Field Values

RANDOM_INIT

public static final String RANDOM_INIT
Whether to use k-means++ initialization or random initialization of the seed centroids. Essentially, k-means++ provides better clusters, but takes longer, whereas random initialization takes less time, but produces worse clusters, and tends to fail more often and needs multiple runs to compare to k-means++. If set, uses randomInit.

See Also:
BallKMeans, Constant Field Values

IGNORE_WEIGHTS

public static final String IGNORE_WEIGHTS
Whether to correct the weights of the centroids after the clustering is done. The weights end up being wrong because of the trimFraction and possible train/test splits. In some cases, especially in a pipeline, having an accurate count of the weights is useful. If set, ignores the final weights.

See Also:
Constant Field Values

TEST_PROBABILITY

public static final String TEST_PROBABILITY
The percentage of points that go into the "test" set when evaluating BallKMeans runs in the reducer.

See Also:
Constant Field Values

NUM_BALLKMEANS_RUNS

public static final String NUM_BALLKMEANS_RUNS
The percentage of points that go into the "training" set when evaluating BallKMeans runs in the reducer.

See Also:
Constant Field Values

SEARCHER_CLASS_OPTION

public static final String SEARCHER_CLASS_OPTION
The Searcher class when performing nearest neighbor search in StreamingKMeans. Defaults to ProjectionSearch.

See Also:
Constant Field Values

NUM_PROJECTIONS_OPTION

public static final String NUM_PROJECTIONS_OPTION
The number of projections to use when using a projection searcher like ProjectionSearch or FastProjectionSearch. Projection searches work by projection the all the vectors on to a set of basis vectors and searching for the projected query in that totally ordered set. This however can produce false positives (vectors that are closer when projected than they would actually be. So, there must be more than one projection vectors in the basis. This variable is the number of vectors in a basis. Defaults to 3

See Also:
Constant Field Values

SEARCH_SIZE_OPTION

public static final String SEARCH_SIZE_OPTION
When using approximate searches (anything that's not BruteSearch), more than just the seemingly closest element must be considered. This variable has different meanings depending on the actual Searcher class used but is a measure of how many candidates will be considered. See the ProjectionSearch, FastProjectionSearch, LocalitySensitiveHashSearch classes for more details. Defaults to 2.

See Also:
Constant Field Values

REDUCE_STREAMING_KMEANS

public static final String REDUCE_STREAMING_KMEANS
Whether to run another pass of StreamingKMeans on the reducer's points before BallKMeans. On some data sets with a large number of mappers, the intermediate number of clusters passed to the reducer is too large to fit into memory directly, hence the option to collapse the clusters further with StreamingKMeans.

See Also:
Constant Field Values

INVALID_DISTANCE_CUTOFF

public static final float INVALID_DISTANCE_CUTOFF
See Also:
Constant Field Values
Method Detail

run

public int run(String[] args)
        throws Exception
Throws:
Exception

configureOptionsForWorkers

public static void configureOptionsForWorkers(org.apache.hadoop.conf.Configuration conf,
                                              int numClusters,
                                              int estimatedNumMapClusters,
                                              float estimatedDistanceCutoff,
                                              int maxNumIterations,
                                              float trimFraction,
                                              boolean randomInit,
                                              boolean ignoreWeights,
                                              float testProbability,
                                              int numBallKMeansRuns,
                                              String measureClass,
                                              String searcherClass,
                                              int searchSize,
                                              int numProjections,
                                              String method,
                                              boolean reduceStreamingKMeans)
                                       throws ClassNotFoundException
Checks the parameters for a StreamingKMeans job and prepares a Configuration with them.

Parameters:
conf - the Configuration to populate
numClusters - k, the number of clusters at the end
estimatedNumMapClusters - O(k log n), the number of clusters requested from each mapper
estimatedDistanceCutoff - an estimate of the minimum distance that separates two clusters (can be smaller and will be increased dynamically)
maxNumIterations - the maximum number of iterations of BallKMeans
trimFraction - the fraction of the points to be considered in updating a ball k-means
randomInit - whether to initialize the ball k-means seeds randomly
ignoreWeights - whether to ignore the invalid final ball k-means weights
testProbability - the percentage of vectors assigned to the test set for selecting the best final centers
numBallKMeansRuns - the number of BallKMeans runs in the reducer that determine the centroids to return (clusters are computed for the training set and the error is computed on the test set)
measureClass - string, name of the distance measure class; theory works for Euclidean-like distances
searcherClass - string, name of the searcher that will be used for nearest neighbor search
searchSize - the number of closest neighbors to look at for selecting the closest one in approximate nearest neighbor searches
numProjections - the number of projected vectors to use for faster searching (only useful for ProjectionSearch or FastProjectionSearch); @see org.apache.mahout.math.neighborhood.ProjectionSearch
Throws:
ClassNotFoundException

run

public static int run(org.apache.hadoop.conf.Configuration conf,
                      org.apache.hadoop.fs.Path input,
                      org.apache.hadoop.fs.Path output)
               throws IOException,
                      InterruptedException,
                      ClassNotFoundException,
                      ExecutionException
Iterate over the input vectors to produce clusters and, if requested, use the results of the final iteration to cluster the input vectors.

Parameters:
input - the directory pathname for input points.
output - the directory pathname for output points.
Returns:
0 on success, -1 on failure.
Throws:
IOException
InterruptedException
ClassNotFoundException
ExecutionException

runMapReduce

public static int runMapReduce(org.apache.hadoop.conf.Configuration conf,
                               org.apache.hadoop.fs.Path input,
                               org.apache.hadoop.fs.Path output)
                        throws IOException,
                               ClassNotFoundException,
                               InterruptedException
Throws:
IOException
ClassNotFoundException
InterruptedException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.