org.apache.mahout.clustering.lda.cvb
Class CachingCVB0Mapper

java.lang.Object
  extended by org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
      extended by org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper
Direct Known Subclasses:
CVB0DocInferenceMapper

public class CachingCVB0Mapper
extends org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>

Run ensemble learning via loading the ModelTrainer with two TopicModel instances: one from the previous iteration, the other empty. Inference is done on the first, and the learning updates are stored in the second, and only emitted at cleanup().

In terms of obvious performance improvements still available, the memory footprint in this Mapper could be dropped by half if we accumulated model updates onto the model we're using for inference, which might also speed up convergence, as we'd be able to take advantage of learning during iteration, not just after each one is done. Most likely we don't really need to accumulate double values in the model either, floats would most likely be sufficient. Between these two, we could squeeze another factor of 4 in memory efficiency.

In terms of CPU, we're re-learning the p(topic|doc) distribution on every iteration, starting from scratch. This is usually only 10 fixed-point iterations per doc, but that's 10x more than only 1. To avoid having to do this, we would need to do a map-side join of the unchanging corpus with the continually-improving p(topic|doc) matrix, and then emit multiple outputs from the mappers to make sure we can do the reduce model averaging as well. Tricky, but possibly worth it.

ModelTrainer already takes advantage (in maybe the not-nice way) of multi-core availability by doing multithreaded learning, see that class for details.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Mapper.Context
 
Constructor Summary
CachingCVB0Mapper()
           
 
Method Summary
protected  void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
           
protected  int getMaxIters()
           
protected  ModelTrainer getModelTrainer()
           
protected  int getNumTopics()
           
 void map(org.apache.hadoop.io.IntWritable docId, VectorWritable document, org.apache.hadoop.mapreduce.Mapper.Context context)
           
protected  void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.Mapper
run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CachingCVB0Mapper

public CachingCVB0Mapper()
Method Detail

getModelTrainer

protected ModelTrainer getModelTrainer()

getMaxIters

protected int getMaxIters()

getNumTopics

protected int getNumTopics()

setup

protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
              throws IOException,
                     InterruptedException
Overrides:
setup in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
Throws:
IOException
InterruptedException

map

public void map(org.apache.hadoop.io.IntWritable docId,
                VectorWritable document,
                org.apache.hadoop.mapreduce.Mapper.Context context)
         throws IOException,
                InterruptedException
Overrides:
map in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
Throws:
IOException
InterruptedException

cleanup

protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
                throws IOException,
                       InterruptedException
Overrides:
cleanup in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.IntWritable,VectorWritable,org.apache.hadoop.io.IntWritable,VectorWritable>
Throws:
IOException
InterruptedException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.