org.apache.mahout.vectorizer.encoders
Class TextValueEncoder

java.lang.Object
  extended by org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
      extended by org.apache.mahout.vectorizer.encoders.TextValueEncoder
Direct Known Subclasses:
CachingTextValueEncoder, LuceneTextValueEncoder

public class TextValueEncoder
extends FeatureVectorEncoder

Encodes text that is tokenized on non-alphanum separators. Each word is encoded using a settable encoder which is by default an StaticWordValueEncoder which gives all words the same weight.

See Also:
LuceneTextValueEncoder

Field Summary
 
Fields inherited from class org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
CONTINUOUS_VALUE_HASH_SEED, WORD_LIKE_VALUE_HASH_SEED
 
Constructor Summary
TextValueEncoder(String name)
           
 
Method Summary
 void addText(byte[] originalForm)
          Adds text to the internal word counter, but delays converting it to vector form until flush is called.
 void addText(CharSequence text)
          Adds text to the internal word counter, but delays converting it to vector form until flush is called.
 void addToVector(byte[] originalForm, double weight, Vector data)
          Adds a value to a vector after tokenizing it by splitting on non-alphanum characters.
 String asString(String originalForm)
          Converts a value into a form that would help a human understand the internals of how the value is being interpreted.
 void flush(double weight, Vector data)
          Adds all of the tokens that we counted up to a vector.
protected  Iterable<Integer> hashesForProbe(byte[] originalForm, int dataSize, String name, int probe)
          Returns all of the hashes for this probe.
protected  int hashForProbe(byte[] originalForm, int dataSize, String name, int probe)
          Provides the unique hash for a particular probe.
 void setWordEncoder(FeatureVectorEncoder wordEncoder)
           
protected  Iterable<String> tokenize(CharSequence originalForm)
          Tokenizes a string using the simplest method.
 
Methods inherited from class org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
addToVector, addToVector, addToVector, bytesForString, getName, getProbes, getWeight, hash, hash, hash, hash, hash, isTraceEnabled, setProbes, setTraceDictionary, trace, trace
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TextValueEncoder

public TextValueEncoder(String name)
Method Detail

addToVector

public void addToVector(byte[] originalForm,
                        double weight,
                        Vector data)
Adds a value to a vector after tokenizing it by splitting on non-alphanum characters.

Specified by:
addToVector in class FeatureVectorEncoder
Parameters:
originalForm - The original form of the value as a string.
data - The vector to which the value should be added.

addText

public void addText(byte[] originalForm)
Adds text to the internal word counter, but delays converting it to vector form until flush is called.

Parameters:
originalForm - The original text encoded as UTF-8

addText

public void addText(CharSequence text)
Adds text to the internal word counter, but delays converting it to vector form until flush is called.

Parameters:
text - The original text encoded as UTF-8

flush

public void flush(double weight,
                  Vector data)
Adds all of the tokens that we counted up to a vector.


hashForProbe

protected int hashForProbe(byte[] originalForm,
                           int dataSize,
                           String name,
                           int probe)
Description copied from class: FeatureVectorEncoder
Provides the unique hash for a particular probe. For all encoders except text, this is all that is needed and the default implementation of hashesForProbe will do the right thing. For text and similar values, hashesForProbe should be over-ridden and this method should not be used.

Specified by:
hashForProbe in class FeatureVectorEncoder
Parameters:
originalForm - The original byte array value
dataSize - The length of the vector being encoded
name - The name of the variable being encoded
probe - The probe number
Returns:
The hash of the current probe

hashesForProbe

protected Iterable<Integer> hashesForProbe(byte[] originalForm,
                                           int dataSize,
                                           String name,
                                           int probe)
Description copied from class: FeatureVectorEncoder
Returns all of the hashes for this probe. For most encoders, this is a singleton, but for text, many hashes are returned, one for each word (unique or not). Most implementations should only implement hashForProbe for simplicity.

Overrides:
hashesForProbe in class FeatureVectorEncoder
Parameters:
originalForm - The original byte array value.
dataSize - The length of the vector being encoded
name - The name of the variable being encoded
probe - The probe number
Returns:
an Iterable of the hashes

tokenize

protected Iterable<String> tokenize(CharSequence originalForm)
Tokenizes a string using the simplest method. This should be over-ridden for more subtle tokenization.

See Also:
LuceneTextValueEncoder

asString

public String asString(String originalForm)
Converts a value into a form that would help a human understand the internals of how the value is being interpreted. For text-like things, this is likely to be a list of the terms found with associated weights (if any).

Specified by:
asString in class FeatureVectorEncoder
Parameters:
originalForm - The original form of the value as a string.
Returns:
A string that a human can read.

setWordEncoder

public final void setWordEncoder(FeatureVectorEncoder wordEncoder)


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.