|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.mahout.vectorizer.encoders.FeatureVectorEncoder
org.apache.mahout.vectorizer.encoders.TextValueEncoder
public class TextValueEncoder
Encodes text that is tokenized on non-alphanum separators. Each word is encoded using a settable encoder which is by default an StaticWordValueEncoder which gives all words the same weight.
LuceneTextValueEncoder
Field Summary |
---|
Fields inherited from class org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder |
---|
CONTINUOUS_VALUE_HASH_SEED, WORD_LIKE_VALUE_HASH_SEED |
Constructor Summary | |
---|---|
TextValueEncoder(String name)
|
Method Summary | |
---|---|
void |
addText(byte[] originalForm)
Adds text to the internal word counter, but delays converting it to vector form until flush is called. |
void |
addText(CharSequence text)
Adds text to the internal word counter, but delays converting it to vector form until flush is called. |
void |
addToVector(byte[] originalForm,
double weight,
Vector data)
Adds a value to a vector after tokenizing it by splitting on non-alphanum characters. |
String |
asString(String originalForm)
Converts a value into a form that would help a human understand the internals of how the value is being interpreted. |
void |
flush(double weight,
Vector data)
Adds all of the tokens that we counted up to a vector. |
protected Iterable<Integer> |
hashesForProbe(byte[] originalForm,
int dataSize,
String name,
int probe)
Returns all of the hashes for this probe. |
protected int |
hashForProbe(byte[] originalForm,
int dataSize,
String name,
int probe)
Provides the unique hash for a particular probe. |
void |
setWordEncoder(FeatureVectorEncoder wordEncoder)
|
protected Iterable<String> |
tokenize(CharSequence originalForm)
Tokenizes a string using the simplest method. |
Methods inherited from class org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder |
---|
addToVector, addToVector, addToVector, bytesForString, getName, getProbes, getWeight, hash, hash, hash, hash, hash, isTraceEnabled, setProbes, setTraceDictionary, trace, trace |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TextValueEncoder(String name)
Method Detail |
---|
public void addToVector(byte[] originalForm, double weight, Vector data)
addToVector
in class FeatureVectorEncoder
originalForm
- The original form of the value as a string.data
- The vector to which the value should be added.public void addText(byte[] originalForm)
originalForm
- The original text encoded as UTF-8public void addText(CharSequence text)
text
- The original text encoded as UTF-8public void flush(double weight, Vector data)
protected int hashForProbe(byte[] originalForm, int dataSize, String name, int probe)
FeatureVectorEncoder
hashForProbe
in class FeatureVectorEncoder
originalForm
- The original byte array valuedataSize
- The length of the vector being encodedname
- The name of the variable being encodedprobe
- The probe number
protected Iterable<Integer> hashesForProbe(byte[] originalForm, int dataSize, String name, int probe)
FeatureVectorEncoder
hashesForProbe
in class FeatureVectorEncoder
originalForm
- The original byte array value.dataSize
- The length of the vector being encodedname
- The name of the variable being encodedprobe
- The probe number
protected Iterable<String> tokenize(CharSequence originalForm)
LuceneTextValueEncoder
public String asString(String originalForm)
asString
in class FeatureVectorEncoder
originalForm
- The original form of the value as a string.
public final void setWordEncoder(FeatureVectorEncoder wordEncoder)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |