apache mahout
因此,我做了一个小项目,以了解Apache Mahout的工作方式。 我决定使用Apache Maven 2来管理所有依赖关系,因此我将首先从POM文件开始。
<!--?xml version="1.0" encoding="UTF-8"?-->
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelversion>4.0.0</modelversion>
<groupid>org.acme</groupid>
<artifactid>mahout</artifactid>
<version>0.94</version>
<name>Mahout Examples</name>
<description>Scalable machine learning library examples</description>
<packaging>jar</packaging>
<properties>
<project.build.sourceencoding>UTF-8</project.build.sourceencoding>
<apache.mahout.version>0.4</apache.mahout.version>
</properties>
<build>
<plugins>
<plugin>
<groupid>org.apache.maven.plugins</groupid>
<artifactid>maven-compiler-plugin</artifactid>
<configuration>
<encoding>UTF-8</encoding>
<source>1.6
<target>1.6</target>
<optimize>true</optimize>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupid>org.apache.mahout</groupid>
<artifactid>mahout-core</artifactid>
<version>${apache.mahout.version}</version>
</dependency>
<dependency>
<groupid>org.apache.mahout</groupid>
<artifactid>mahout-math</artifactid>
<version>${apache.mahout.version}</version>
</dependency>
<dependency>
<groupid>org.apache.mahout</groupid>
<artifactid>mahout-utils</artifactid>
<version>${apache.mahout.version}</version>
</dependency>
<dependency>
<groupid>org.slf4j</groupid>
<artifactid>slf4j-api</artifactid>
<version>1.6.0</version>
</dependency>
<dependency>
<groupid>org.slf4j</groupid>
<artifactid>slf4j-jcl</artifactid>
<version>1.6.0</version>
</dependency>
</dependencies>
</project>
然后,我研究了可用于文本分类问题的Apache Mahout示例和算法。 最简单,最准确的方法是朴素贝叶斯分类器 。 这是一个代码片段:
package org.acme;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.FileReader;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.mahout.classifier.ClassifierResult;
import org.apache.mahout.classifier.bayes.TrainClassifier;
import org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm;
import org.apache.mahout.classifier.bayes.common.BayesParameters;
import org.apache.mahout.classifier.bayes.datastore.InMemoryBayesDatastore;
import org.apache.mahout.classifier.bayes.exceptions.InvalidDatastoreException;
import org.apache.mahout.classifier.bayes.interfaces.Algorithm;
import org.apache.mahout.classifier.bayes.interfaces.Datastore;
import org.apache.mahout.classifier.bayes.model.ClassifierContext;
import org.apache.mahout.common.nlp.NGrams;
public class Starter {
public static void main( final String[] args ) {
final BayesParameters params = new BayesParameters();
params.setGramSize( 1 );
params.set( "verbose", "true" );
params.set( "classifierType", "bayes" );
params.set( "defaultCat", "OTHER" );
params.set( "encoding", "UTF-8" );
params.set( "alpha_i", "1.0" );
params.set( "dataSource", "hdfs" );
params.set( "basePath", "/tmp/output" );
try {
Path input = new Path( "/tmp/input" );
TrainClassifier.trainNaiveBayes( input, "/tmp/output", params );
Algorithm algorithm = new BayesAlgorithm();
Datastore datastore = new InMemoryBayesDatastore( params );
ClassifierContext classifier = new ClassifierContext( algorithm, datastore );
classifier.initialize();
final BufferedReader reader = new BufferedReader( new FileReader( args[ 0 ] ) );
String entry = reader.readLine();
while( entry != null ) {
List< String > document = new NGrams( entry,
Integer.parseInt( params.get( "gramSize" ) ) )
.generateNGramsWithoutLabel();
ClassifierResult result = classifier.classifyDocument(
document.toArray( new String[ document.size() ] ),
params.get( "defaultCat" ) );
entry = reader.readLine();
}
} catch( final IOException ex ) {
ex.printStackTrace();
} catch( final InvalidDatastoreException ex ) {
ex.printStackTrace();
}
}
}
这里有一个重要的注意事项:开始分类之前必须教系统。 为此,有必要提供不同文本分类的示例(更多–更好)。 它应该是简单的文件,其中每一行都以用制表符分隔的类别与文本本身开头。 铁
SUGGESTION That's a great suggestion
QUESTION Do you sell Microsoft Office?
...
您可以提供更多的文件,可以获得更精确的分类。 所有文件都必须放在“ / tmp / input”文件夹中,它们将首先由Apache Hadoop处理。 :)
参考: JCG合作伙伴的 Apache Mahout入门 Andriy Redko {devmind}的 Andrey Redko。
翻译自: https://www.javacodegeeks.com/2012/02/apache-mahout-getting-started.html
apache mahout