org.apache.mahout.text.wikipedia
Class WikipediaXmlSplitter

java.lang.Object
  extended by org.apache.mahout.text.wikipedia.WikipediaXmlSplitter

public final class WikipediaXmlSplitter
extends Object

The Bayes example package provides some helper classes for training the Naive Bayes classifier on the Twenty Newsgroups data. See PrepareTwentyNewsgroups for details on running the trainer and formatting the Twenty Newsgroups data properly for the training.

The easiest way to prepare the data is to use the ant task in core/build.xml:

ant extract-20news-18828

This runs the arg line:

-p $\{working.dir\}/20news-18828/ -o $\{working.dir\}/20news-18828-collapse -a $\{analyzer\} -c UTF-8

To Run the Wikipedia examples (assumes you've built the Mahout Job jar):

  1. Download the Wikipedia Dataset. Use the Ant target: ant enwiki-files
  2. Chunk the data using the WikipediaXmlSplitter (from the Hadoop home): bin/hadoop jar $MAHOUT_HOME/target/mahout-examples-0.x org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o $MAHOUT_HOME/examples/work/wikipedia/chunks/ -c 64


Method Summary
static void main(String[] args)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

main

public static void main(String[] args)
                 throws IOException
Throws:
IOException


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.