Package org.apache.mahout.text

Class Summary
LuceneIndexHelper Utility for checking if a field exist in a Lucene index.
LuceneSegmentInputFormat InputFormat implementation which splits a Lucene index at the segment level.
LuceneSegmentInputSplit InputSplit implementation that represents a Lucene segment.
LuceneSegmentRecordReader RecordReader implementation for Lucene segments.
LuceneStorageConfiguration Holds all the configuration for SequenceFilesFromLuceneStorage, which generates a sequence file with id as the key and a content field as value.
MailArchivesClusteringAnalyzer Custom Lucene Analyzer designed for aggressive feature reduction for clustering the ASF Mail Archives using an extended set of stop words, excluding non-alpha-numeric tokens, and porter stemming.
MultipleTextFileInputFormat Used in combining a large number of text files into one text input reader along with the WholeFileRecordReader class.
PrefixAdditionFilter Default parser for parsing text into sequence files.
ReadOnlyFileSystemDirectory This class implements a read-only Lucene Directory on top of a general FileSystem.
SequenceFilesFromDirectory Converts a directory of text documents into SequenceFiles of Specified chunkSize.
SequenceFilesFromDirectoryFilter Implement this interface if you wish to extend SequenceFilesFromDirectory with your own parsing logic.
SequenceFilesFromDirectoryMapper Map class for SequenceFilesFromDirectory MR job
SequenceFilesFromLuceneStorage Generates a sequence file from a Lucene index with a specified id field as the key and a content field as the value.
SequenceFilesFromLuceneStorageDriver Driver class for the lucene2seq program.
SequenceFilesFromLuceneStorageMapper Maps document IDs to key value pairs with ID field as the key and the concatenated stored field(s) as value.
SequenceFilesFromLuceneStorageMRJob Generates a sequence file from a Lucene index via MapReduce.
SequenceFilesFromMailArchives Converts a directory of gzipped mail archives into SequenceFiles of specified chunkSize.
SequenceFilesFromMailArchivesMapper Map Class for the SequenceFilesFromMailArchives job
TextParagraphSplittingJob  
TextParagraphSplittingJob.SplitMap  
WholeFileRecordReader RecordReader used with the MultipleTextFileInputFormat class to read full files as k/v pairs and groups of files as single input splits.
WikipediaToSequenceFile Create and run the Wikipedia Dataset Creator.
 

Enum Summary
SequenceFilesFromLuceneStorageMapper.DataStatus  
 



Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.