org.apache.mahout.text
Class SequenceFilesFromDirectory

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.mahout.common.AbstractJob
          extended by org.apache.mahout.text.SequenceFilesFromDirectory
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable, org.apache.hadoop.util.Tool

public class SequenceFilesFromDirectory
extends AbstractJob

Converts a directory of text documents into SequenceFiles of Specified chunkSize. This class takes in a parent directory containing sub folders of text documents and recursively reads the files and creates the SequenceFiles of docid => content. The docid is set as the relative path of the document from the parent directory prepended with a specified prefix. You can also specify the input encoding of the text files. The content of the output SequenceFiles are encoded as UTF-8 text.


Field Summary
static String BASE_INPUT_PATH
           
static String[] FILE_FILTER_CLASS_OPTION
           
static String[] KEY_PREFIX_OPTION
           
 
Fields inherited from class org.apache.mahout.common.AbstractJob
argMap, inputFile, inputPath, outputFile, outputPath, tempPath
 
Constructor Summary
SequenceFilesFromDirectory()
           
 
Method Summary
protected  void addOptions()
          Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job.
static void main(String[] args)
           
protected  Map<String,String> parseOptions()
          Override this method in order to parse your additional options from the command line.
 int run(String[] args)
           
 
Methods inherited from class org.apache.mahout.common.AbstractJob
addFlag, addInputOption, addOption, addOption, addOption, addOption, addOutputOption, buildOption, buildOption, getAnalyzerClassFromOption, getCLIOption, getConf, getDimensions, getFloat, getFloat, getGroup, getInputFile, getInputPath, getInt, getInt, getOption, getOption, getOption, getOptions, getOutputFile, getOutputPath, getOutputPath, getTempPath, getTempPath, hasOption, keyFor, maybePut, parseArguments, parseArguments, parseDirectories, prepareJob, prepareJob, prepareJob, prepareJob, setConf, setS3SafeCombinedInputPath, shouldRunNextPhase
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FILE_FILTER_CLASS_OPTION

public static final String[] FILE_FILTER_CLASS_OPTION

KEY_PREFIX_OPTION

public static final String[] KEY_PREFIX_OPTION

BASE_INPUT_PATH

public static final String BASE_INPUT_PATH
See Also:
Constant Field Values
Constructor Detail

SequenceFilesFromDirectory

public SequenceFilesFromDirectory()
Method Detail

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

run

public int run(String[] args)
        throws Exception
Throws:
Exception

addOptions

protected void addOptions()
Override this method in order to add additional options to the command line of the SequenceFileFromDirectory job. Do not forget to call super() otherwise all standard options (input/output dirs etc) will not be available.


parseOptions

protected Map<String,String> parseOptions()
Override this method in order to parse your additional options from the command line. Do not forget to call super() otherwise standard options (input/output dirs etc) will not be available.

Returns:
Map of options


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.