使用Spark NLP开始使用Java进行机器学习

澹台鸿光
2023-12-01

It is increasingly common for software developers to require Machine Learning technology in their applications. While Python is the de facto standard environment for Machine Learning, this might not be an ideal fit when building web applications or enterprise software. Learn below how to train and run Machine Learning models in Java, using the Spark NLP open source library.

对于软件开发人员而言,在其应用程序中要求使用机器学习技术变得越来越普遍。 尽管Python是机器学习的事实上的标准环境,但是在构建Web应用程序或企业软件时,这可能不是理想的选择。 下面了解使用Spark NLP开源库如何在Java中训练和运行机器学习模型。

At Parito, we are building AI-driven applications that apply machine learning algorithms to text data. For the back-end we are using Java and Spring Boot, which provides a robust and secure framework for building REST web services. While looking at options for the Machine Learning component, we came across Spark NLP, an open source library for Natural Language Processing based around the Machine Learning library (MLlib) in Apache Spark.

在Parito,我们正在构建将机器学习算法应用于文本数据的AI驱动的应用程序。 对于后端,我们使用Java和Spring Boot,它们为构建REST Web服务提供了强大而安全的框架。 在查看机器学习组件的选项时,我们遇到了Spark NLP ,这是一个基于Apache Spark中机器学习库( MLlib )的自然语言处理开源库。

The claim was that this would be easy to use in Java, as it’s Scala-based. The reality is that it took a bit of fiddling to get things going, so we thought it would be worth sharing our findings, as good Java examples seem to be a bit scarce.

声称这是基于Scala的Java易于使用。 现实情况是,花点时间使事情进展,所以我们认为值得分享我们的发现,因为好的Java示例似乎有点稀缺。

Note that this is purely a technical article on getting Spark NLP running within a Java/Spring Boot REST service. I will explain further articles more what Spark NLP has to offer functionally and an evaluation on the effectiveness of integrating it in this fashion.

请注意,这纯粹是一篇有关使Spark NLP在Java / Spring Boot REST服务中运行的技术文章。 我将进一步解释更多文章,Spark NLP必须提供什么功能,并评估以这种方式集成它的有效性。

For a fully working example, check here: https://github.com/willprice76/sparknlp-examples/tree/develop/sparknlp-java-example

有关完整的示例,请在此处查看: https : //github.com/willprice76/sparknlp-examples/tree/develop/sparknlp-java-example

基本设置-依赖关系 (Basic setup — Dependencies)

Firstly, it’s important to understand the underlying dependencies of Spark NLP. At the time of writing, the latest version was built on Apache Spark 2.4.4, and uses Scala 2.11 (the Scala version is included in the name of the artifact).

首先,了解Spark NLP的基础依赖性很重要。 在撰写本文时,最新版本基于Apache Spark 2.4.4构建,并使用Scala 2.11(工件名称中包含Scala版本)。

The most important consequence of this is you can only run it on Java 8. If this is an issue, you are just going to have to wait for the guys at Spark NLP to work out Spark 3.0/Scala 2.12 support.

最重要的后果是只能在Java 8上运行它 。 如果这是一个问题,您将只需要等待Spark NLP的家伙来解决对Spark 3.0 / Scala 2.12的支持。

I will show how to get a REST API working with Spring Boot Web so you need add the following dependencies:

我将展示如何使REST API与Spring Boot Web一起使用,因此您需要添加以下依赖项:

  • Spring Boot

    Sprint Boot
  • Spark MLlib

    火花MLlib
  • Spark NLP

    星火NLP
  • Logback

    退回

You will end up with a basic pom.xml something like this:

您将得到一个基本的pom.xml,如下所示:

To avoid Slf4j conflicts you will need to exclude log4j from the spark MLlib dependency (or you can just use log4j if you prefer).

为了避免Slf4j冲突,您将需要从spark MLlib依赖项中排除log4j(或者,如果愿意,也可以只使用log4j)。

Now we can add the Spring Boot application class:

现在我们可以添加Spring Boot应用程序类:

… and a controller with a test hello world method :

…以及具有测试hello world方法的控制器:

Now compile and run your application, you should be able to test the controller using a GET request to the URL http://localhost:8080/hello and get a response with the text “Hello world”.

现在编译并运行您的应用程序,您应该能够使用对URL http:// localhost:8080 / hello的GET请求测试控制器,并获得带有文本“ Hello world”的响应。

初始化Spark NLP并下载预训练的管道 (Initialize Spark NLP and download a pre-trained pipeline)

Now we are going to check that we can run Spark NLP. Update your controller as follows:

现在,我们将检查是否可以运行Spark NLP。 如下更新控制器:

All we are doing here is initializing Spark NLP and starting a Spark Session in the constructor and downloading a pre-trained pipeline for sentiment analysis (more explanation to follow).

我们在这里要做的只是初始化Spark NLP,并在构造函数中启动Spark Session,并下载经过预训练的管道以进行情感分析(更多解释请参见)。

Now compile and run the application to check everything is working OK.

现在编译并运行该应用程序以检查一切是否正常。

If you get an exception like this:

如果出现这样的异常:

…Constructor threw exception; nested exception is java.lang.IllegalArgumentException: Unsupported class file major version 55

……施工者抛出异常; 嵌套的异常是java.lang.IllegalArgumentException:不支持的类文件主要版本55

…you are not running Java 8 — check your IDE run/debug configuration, as this can be different to what’s specified in your project pom.

…您没有运行Java 8-检查您的IDE运行/调试配置,因为这可能与项目pom中指定的配置不同。

Pipelines are simply a number of processing stages which transform the data from the previous stage. Spark NLP caches the downloaded pipeline in the cached_pretrained folder in your home directory.

管道仅仅是多个处理阶段,可以转换前一阶段的数据。 Spark NLP将下载的管道缓存在主目录的cached_pretrained文件夹中。

If you open up this folder you will see the sentiment analysis pipeline in a folder named something like analyze_sentiment_en_2.4.0_2.4_1580483464667. Within it the pipeline stages as numbered subfolders within the stages folder. In this case we have 5 stages:

如果打开此文件夹,您将在名为analytics_sentiment_en_2.4.0_2.4_1580483464667之类的文件夹中看到情绪分析管道。 在其中,管线阶段被称为stages文件夹中的编号子文件夹。 在这种情况下,我们有5个阶段:

  1. 0_document_b52f0e78aa1c

    0_document_b52f0e78aa1c
  2. 1_SENTENCE_199413a74375

    1_SENTENCE_199413a74375
  3. 2_REGEX_TOKENIZER_b6ca4d4dc972

    2_REGEX_TOKENIZER_b6ca4d4dc972
  4. 3_SPELL_e4ea67180337

    3_SPELL_e4ea67180337
  5. 4_VIVEKN_8ac4c5e76848

    4_VIVEKN_8ac4c5e76848

I won’t go into details, but the first four stages prepare the text data by breaking it into sentences, tokenizing it and correcting spelling mistakes. The final stage is a machine learning model which can infer the sentiment from the processed text data. It’s important to understand a bit about pipelines, if you are going to work with Spark NLP especially when you train your own models (coming later in this article).

我不会详细介绍,但是前四个阶段通过将文本数据分解为句子,将其标记化并纠正拼写错误来准备文本数据。 最后阶段是机器学习模型,它可以从处理后的文本数据中推断出情感。 如果要使用Spark NLP,尤其是在训练自己的模型时,一定要对管道有一点了解 (本文稍后会介绍)。

产生一些见解 (Generate some insights)

We are now ready to do start using machine learning to generate insights from data. This is sometimes also known as scoring or annotating.

现在,我们准备开始使用机器学习从数据生成见解。 有时也称为评分或注释。

We are going to use the pre-trained model for sentiment analysis which we downloaded in the previous step. This means you can provide text data as input, and the model will infer if the sentiment in the text is positive or negative. In a commercial context this could be useful if you are trying to automatically gauge overall customer satisfaction based on large volumes of data from, for example product reviews, customer support incidents, or social media postings.

我们将使用预先训练的模型进行情感分析,该模型已在上一步中下载。 这意味着您可以提供文本数据作为输入,并且模型将推断文本中的情绪是正面还是负面。 在商业环境中,如果您尝试根据来自产品评论,客户支持事件或社交媒体发布的大量数据来自动评估总体客户满意度,这可能会很有用。

Add a score method to your controller as follows:

将分数方法添加到您的控制器,如下所示:

Here we simply use the annotate method on the pipeline we downloaded, to infer the sentiments on an array of input strings. Getting the sentiment results out is a bit of a fiddle mostly due conversion between Scala and Java objects. I am new to this, so if you have a neater way please let me know!

在这里,我们只是在下载的管道上使用annotate方法,以推断输入字符串数组中的情感。 得出情感结果有点费力,主要是因为Scala和Java对象之间的转换。 我对此并不陌生,所以如果您有更整洁的方法,请告诉我!

Run the application and use curl or Postman or some other tool to Post some data to your controller to try it out. For example the command:

运行该应用程序,然后使用curl或Postman或其他工具将一些数据发布到您的控制器以进行尝试。 例如,命令:

Gives response corresponding to the 2 input sentences of:

给出对应于以下两个输入语句的响应:

[“positive”, “negative”]

[“正负”]

训练自己的模型 (Train your own model)

So far so good, but what happens if the pre-trained model doesn’t always infer the insight you expect. This is often the case; pre-trained models are trained on data in a different context or domain from where you want to apply them, which gives them a particular bias. You probably want to train your own model, on your own data.

到目前为止,一切都很好,但是如果预训练的模型不能总是推断出您期望的见解,将会发生什么。 通常是这种情况; 预训练模型在与您要应用它们的环境或域不同的数据上进行训练,这给它们带来了特殊的偏差。 您可能想根据自己的数据训练自己的模型。

In order to do this, you will need to create your own pipeline, and you will need some data with known outcomes (in our case pre-labelled with sentiment; positive or negative).

为了做到这一点,您将需要创建自己的管道,并且需要一些具有已知结果的数据(在我们的情况下,该数据预先标记有情绪;肯定或否定)。

We create a training pipeline using the same type of model as in the pre-trained pipeline, its not exactly the same pipeline as we skip the bits to break into sentences and do spell checking for simplicity. Add the following method to your controller:

我们使用与预训练管道中相同的模型类型来创建训练管道,该模型与我们跳过位分解为句子并进行拼写检查以简化操作时所使用的模型并不完全相同。 将以下方法添加到您的控制器:

private Pipeline getSentimentTrainingPipeline() {
    DocumentAssembler document = new DocumentAssembler();
    document.setInputCol("text");
    document.setOutputCol("document");


    String[] tokenizerInputCols = {"document"};
    Tokenizer tokenizer = new Tokenizer();
    tokenizer.setInputCols(tokenizerInputCols);
    tokenizer.setOutputCol("token");


    String[] sentimentInputCols = {"document", "token"};
    ViveknSentimentApproach sentimentApproach = new ViveknSentimentApproach();
    sentimentApproach.setInputCols(sentimentInputCols);
    sentimentApproach.setOutputCol("sentiment");
    sentimentApproach.setSentimentCol("label");
    sentimentApproach.setCorpusPrune(0);


    Pipeline pipeline = new Pipeline();
    pipeline.setStages(new PipelineStage[]{document, tokenizer, sentimentApproach});
    return pipeline;
}

This simply defines the minimum pipeline we need to train this particular sentiment analysis model, ensuring that the name of the output column of each stage matches the input columns for subsequent stages.

这只是定义了我们训练该特定情感分析模型所需的最少管道,从而确保每个阶段的输出列的名称与后续阶段的输入列的匹配。

Now add a new class to represent the input data (text + sentiment):

现在添加一个新类来表示输入数据(文本+情感):

package org.example.sparknlp;


public class TextData {
    private String text;
    private String label;


    public TextData(String text, String label) {
        this.text = text;
        this.label = label;
    }


    public String getText() {
        return text;
    }


    public void setText(String text) {
        this.text = text;
    }


    public String getLabel() {
        return label;
    }


    public void setLabel(String label) {
        this.label = label;
    }
}

And create an endpoint to train the pipeline with a list of TextData elements:

并创建一个端点来训练带有TextData元素列表的管道:

@PostMapping("/sentiment/train")
public String train(@RequestBody List<TextData> data) throws IOException {
    Instant start = Instant.now();
    Dataset<Row> input = spark.createDataFrame(data, TextData.class);
    LOG.debug("Running training with {} rows of text data", data.size());
    Pipeline pipeline = getSentimentTrainingPipeline();
    PipelineModel newPipelineModel = pipeline.fit(input);
    long trainingTime = Duration.between(start, Instant.now()).toMillis();
    //Overwrite the existing scoring pipeline
    scoringPipeline = new LightPipeline(newPipelineModel, false);
    return String.format("Training completed in %s milliseconds", trainingTime);
}

The Pipeline.fit() method is where the training happens, returning a PipelineModel which can be used for scoring. We convert this into a LightPipeline as this is a more efficient way to score when with working with smaller datasets on a single machine.

管道fit()方法是进行训练的地方,返回可以用于评分的PipelineModel。 我们将其转换为LightPipeline,因为这是在单台计算机上处​​理较小数据集时更有效的评分方式

Finally we need to update the pom.xml with some additional dependencies, which help us work with data using Spark.

最后,我们需要使用其他一些依赖项来更新pom.xml,这有助于我们使用Spark处理数据。

<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-sql_2.11</artifactId>
	<version>${spark.version}</version>
	<exclusions>
		<exclusion>
			<groupId>org.slf4j</groupId>
			<artifactId>slf4j-log4j12</artifactId>
		</exclusion>
	</exclusions>
</dependency>
<!-- spark SQL needs older version of janino -->
<dependency>
	<groupId>org.codehaus.janino</groupId>
	<artifactId>commons-compiler</artifactId>
	<version>3.0.8</version>
</dependency>
<dependency>
	<groupId>org.codehaus.janino</groupId>
	<artifactId>janino</artifactId>
	<version>3.0.8</version>
</dependency>

Normally you would need a significant amount (hundreds or thousands of rows) of labelled data to train a model properly, but just to check if things are working, we can do a simple post request to this new endpoint:

通常,您将需要大量(数百行或数千行)标记的数据来正确地训练模型,但是仅是为了检查事情是否正常,我们可以向此新端点执行简单的发布请求:

curl -X POST -H --data 
  "[{\"text\": \"I love Spark NLP\", \"label\": \"positive\"}, {\"text\": \"I hate using Scala objects in Java\", \"label\": \"negative\"}]" 
  http://localhost:8080/sentiment/train

And then test the (overwritten) scoring pipeline with a request to the /sentiment/score endpoint.

然后使用对/ sentiment / score端点的请求测试(覆盖)评分管道。

It’s important to understand that the Spark pipeline concept relies on a full data set when you do retraining, not increments. We have now overwritten the pre-trained pipeline with a very badly trained pipeline (only 2 rows of training data), so this example is not particularly useful except to walk through how to code up the training process.

重要的是要了解,Spark管道概念在您进行重新训练时依赖于完整的数据集,而不是增量。 现在,我们已经用非常差的训练流水线(仅2行训练数据)覆盖了预先训练的流水线,因此,该示例除了逐步介绍如何对训练过程进行编码之外,没有什么用处。

If you want some decent sized datasets for sentiment analysis there are plenty out there, but for the best quality results use data you have curated from the domain where you want to apply machine learning.

如果您想要一些体面大小的数据集来进行情感分析,那么这里有很多 ,但是要获得最佳质量的结果,请使用您从要应用机器学习的领域中精选的数据。

带走 (Takeaway)

Theres a lot more to Spark NLP, including training and using deep learning models which I plan to share in future articles, however for now you should now be able to integrate the Spark NLP library into your own Java application, create and train simple Machine Learning pipelines and use them to derive insights from text data.

Spark NLP还有很多东西,包括培训和使用我打算在以后的文章中分享的深度学习模型,但是现在,您现在应该能够将Spark NLP库集成到您自己的Java应用程序中,创建和训练简单的机器学习管道,并使用它们从文本数据中获取见解。

Its great that an Java application developer can get started with NLP-based Machine Learning without too much difficulty, but if you are considering using Spark NLP, its important to evaluate the impact of using Java 8 before you seriously consider using this for a production project.

Java应用程序开发人员可以毫不费力地开始基于NLP的机器学习,这一点非常好,但是如果您考虑使用Spark NLP,则在认真考虑将Java 8用于生产项目之前,评估使用Java 8的影响非常重要。 。

As we added more functionality to our application we came across more and more dependency and compatibility issues and had to downgrade Guava and Jackson versions among other libraries.

当我们向应用程序添加更多功能时,我们遇到了越来越多的依赖性和兼容性问题,因此不得不在其他库中降级Guava和Jackson版本。

It’s also a pain to mangle Scala objects in Java and finally, remember Spark is designed for heavy loads and distributed computing — it’s not really intended to train or score large datasets within a single lightweight microservice.

在Java中处理Scala对象也是一件痛苦的事情,最后,请记住Spark是为重负载和分布式计算而设计的-它并不是真正旨在在单个轻量级微服务中训练或评分大型数据集。

For these reasons it might be best to plan to isolate the training part of your application in a separate Scala-based service, with access to a Spark cluster for the heavy lifting.

由于这些原因,最好计划将应用程序的培训部分隔离在单独的基于Scala的服务中,并可以访问Spark集群以进行繁重的工作。

Many thanks to the Spark NLP team at John Snow Labs for making their work open source and easy to integrate. If you want to know more, check the website, documentation and repo with pretrained models and pipelines. They also operate a slack channel: spark-nlp.slack.com

非常感谢John Snow Labs的Spark NLP团队将他们的工作开源并易于集成。 如果您想了解更多信息,请查看网站文档使用预训练的模型和管道进行回购 。 他们还经营一个闲置频道:spark-nlp.slack.com

翻译自: https://medium.com/parito-labs-blog/get-started-with-machine-learning-in-java-using-spark-nlp-9eb8ef2ea2ce

 类似资料: