org.apache.mahout.vectorizer.collocations.llr
Class CollocMapper

java.lang.Object
  extended by org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
      extended by org.apache.mahout.vectorizer.collocations.llr.CollocMapper

public class CollocMapper
extends org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>

Pass 1 of the Collocation discovery job which generated ngrams and emits ngrams an their component n-1grams. Input is a SequeceFile, where the key is a document id and the value is the tokenized documents.


Nested Class Summary
static class CollocMapper.Count
           
 
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Mapper.Context
 
Field Summary
static String MAX_SHINGLE_SIZE
           
 
Constructor Summary
CollocMapper()
           
 
Method Summary
protected  void map(org.apache.hadoop.io.Text key, StringTuple value, org.apache.hadoop.mapreduce.Mapper.Context context)
          Collocation finder: pass 1 map phase.
protected  void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
           
 
Methods inherited from class org.apache.hadoop.mapreduce.Mapper
cleanup, run
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_SHINGLE_SIZE

public static final String MAX_SHINGLE_SIZE
See Also:
Constant Field Values
Constructor Detail

CollocMapper

public CollocMapper()
Method Detail

map

protected void map(org.apache.hadoop.io.Text key,
                   StringTuple value,
                   org.apache.hadoop.mapreduce.Mapper.Context context)
            throws IOException,
                   InterruptedException
Collocation finder: pass 1 map phase.

Receives a token stream which gets passed through a Lucene ShingleFilter. The ShingleFilter delivers ngrams of the appropriate size which are then decomposed into head and tail subgrams which are collected in the following manner

 k:head_key,           v:head_subgram
 k:head_key,ngram_key, v:ngram
 k:tail_key,           v:tail_subgram
 k:tail_key,ngram_key, v:ngram
 

The 'head' or 'tail' prefix is used to specify whether the subgram in question is the head or tail of the ngram. In this implementation the head of the ngram is a (n-1)gram, and the tail is a (1)gram.

For example, given 'click and clack' and an ngram length of 3:

 k: head_'click and'                         v:head_'click and'
 k: head_'click and',ngram_'click and clack' v:ngram_'click and clack'
 k: tail_'clack',                            v:tail_'clack'
 k: tail_'clack',ngram_'click and clack'     v:ngram_'click and clack'
 

Also counts the total number of ngrams encountered and adds it to the counter CollocDriver.Count.NGRAM_TOTAL

Overrides:
map in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
Throws:
IOException - if there's a problem with the ShingleFilter reading data or the collector collecting output.
InterruptedException

setup

protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
              throws IOException,
                     InterruptedException
Overrides:
setup in class org.apache.hadoop.mapreduce.Mapper<org.apache.hadoop.io.Text,StringTuple,GramKey,Gram>
Throws:
IOException
InterruptedException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.