Wednesday, March 16, 2022

Train Stanford CoreNLP about the sentiment of domain-specific phrases

Core Java, Oracle Java, Java Career, Java Certification, Java Skills, Java Jobs, Java Tutorial and Materials, Java Certification, Java Certified, Java Preparation

Make sure your Java application knows that a “five-star” review is positive, not neutral, when it performs sentiment analysis.

If you want to determine whether text—such as a social media post or customer review—is positive or negative, you can perform sentiment analysis from a Java application using a library such as Stanford CoreNLP. It’s a powerful tool for understanding general-purpose text. However, what if your information domain uses specialized vocabulary or phrases?

The first article in this three-part series, “Perform textual sentiment analysis in Java using a deep learning model,” showed how to perform sentiment analysis on single sentences. The second article, “Sentiment analysis in Java: Analyzing multi-sentence text blocks,” talked about computing a composite sentiment score for multisentence text blocks. This final article retrains the existing general-purpose model used in the Stanford CoreNLP sentiment tool with your own data to understand domain-specific phrases.

Domain-specific phrases

The meaning (and the sentiment) of certain words or phrases can be specific to a certain domain and, therefore, can be correctly identified only when using an appropriate domain-specific model. For example, consider the following statement: “I give it five stars.”

This sentence would be considered neutral when evaluated by a general-purpose natural language processing (NLP) model. However, you and I know that the sentence should be considered positive or very positive in the context of product review analysis, as five stars is the highest customer rating a product can receive on many retail websites. Thus, if this phrase (or, alternatively, “I give it zero stars” or “I give it a thumbs-down”) appears in your business use case or domain, you need to train the model to learn that meaning.

How does the Stanford CoreNLP’s sentiment classifier work? As explained in the first article of this series, a major strength of that NLP library is that it can capture the compositional effects of sentiment, meaning that the underlying model can identify the sentiment of a sentence rather than treating the words in the sentence separately.

Consider the following sentence: “Not five stars.”

Here, the phrase five stars can be considered positive in the context of product review analysis, while the entire sentence is negative overall. For the sentiment classifier to produce this result, the following split is required:

1 Not five stars

2 five

2 stars

3 five stars

2 .

On the first line, you have the entire sentence labeled with 1 (negative), because of the word not. The other lines contain the tokens derivable from the sentence and labeled mostly with 2 (neutral). The only exception is the phrase five stars, which is labeled with 3 (positive). Remember, on this scale, sentiment ranges from 1 to 3.

The above example reveals what allows the model to capture the compositional effects of sentiment rather than merely evaluating words in isolation. When given this sample in particular, the model learns that the phrase five stars conveys positive sentiment, while the words five and stars appearing separately should be identified as neutral.

Preparing the training data

To make the above structure ready for use in the model training process, first turn it into a binarized tree. This can be done with the following call, where the sample.txt file contains the sentence along with the phrases and individual tokens derived from it and labeled as shown previously:

$ java -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input sample.txt

As a result, the binarized tree generated for this sample will look as follows:

(1 (1 (1 Not) (3 (2 five) (2 stars))) (2 .))

Of course, you’re not limited to a single sentence as onetime input to the NLP constructor used here. The accompanying samples.txt file contains a few more, and you can pass in an arbitrary number of sentences. Here’s what the samples.txt file contains.

1 Not five stars.

2 five

2 stars

3 five stars

2 .

3 I like it.

2 I

2 it

3 I like

3 like it

2 .

3 I give it five stars.

2 I

2 give

2 it

2 I give

2 give it

2 five

2 stars

3 five stars

2 .

Feel free to add your own sentences to samples.txt, separating each sentence by a blank line. Split those sentences into phrases that indicate your domain-specific terminology, putting one phrase per line and labeling their sentiment as negative with a 1, neutral with a 2, or positive with a 3. (All the other possible phrases derivable from a sentence, but which are not labeled, will take on the main sentence sentiment label when converted into a binarized tree with BuildBinarizedDataset().)

With the following command, convert the labeled sentences found in samples.txt into binarized trees, saving the results to the binarized_trees.txt file for further use:

$ java -mx5g edu.stanford.nlp.sentiment.BuildBinarizedDataset -input samples.txt > binarized_trees.txt

As you no doubt have realized, the most time-consuming part when preparing data for conversion into binarized trees might be annotating your data (each phrase in a sample sentence) with a sentiment label.

If you don’t want to do this task manually to prepare an entire data set for retraining the model, you can look for an existing general-purpose data set and add your own samples to it. For example, you can use and modify the data set downloadable from the Stanford CoreNLP Sentiment Analysis page, as follows:

1. Click the Train,Dev,Test Splits in PTB Tree Format link found in the Dataset Downloads sidebar on the right side of that web page to download the trainDevTestTrees_PTB.zip file. (I am not providing the direct link here because it might change.)

2. Unpack the .zip file into a local folder on your machine.

3. Open the train.txt file and append to it the trees you generated and saved to binarized_trees.txt previously.

Retraining the sentiment model with your data

You’re ready now to start the training process. The process can take a long time, lasting several hours, depending on your hardware’s capabilities and the training options you select. To familiarize yourself with the available options, you might look at the following source code on two Stanford CoreNLP GitHub pages: RNNOptions.java and RNNTrainOptions.java. (RNN stands for recurrent neural network.)

One way to shorten the training time is to reduce the number of epochs. (An epoch is a complete pass through the training set.) By default, this parameter is set to 400 epochs. However, if your goal is just to try to train a model for illustration purposes or to see what happens after a tweak to the training data set, you can set it, say, to 100, thus reducing the training time of the process significantly at the expense of a slight decrease in accuracy.

The following command starts the training process:

$ java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -epochs 100 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz

Once the training process is completed, you will find a new model in the model.ser.gz file.

Testing your custom model

Now that you have the custom model, how can you make your program load it for use? You can do this with a single line of code you insert into the init() method of the nlpPipeline class introduced in the first article in this series. The line to be inserted is in the following listing:

public static void init()

    {

        Properties props = new Properties();

        props.put("sentiment.model", "model.ser.gz");

        props.setProperty("annotators", "tokenize, ssplit, parse, sentiment");

        pipeline = new StanfordCoreNLP(props);

    }

To test the newly created model, use a sentence with a phrase shown to the model during the training process. For example, it could be the following sentence: “I have posted five stars.” For that, edit the SentenceSentiment class, which was also introduced in the first article, as follows:

public class SentenceSentiment

{

  public static void main(String[] args)

  {

                String text = "I have posted five stars.";

    nlpPipeline.init();

    nlpPipeline.estimatingSentiment(text);

  }

}

Recompile nlpPipeline and SentenceSentiment, and then run the latter as follows:

$ javac nlpPipeline.java

$ javac SentenceSentiment.java

$ java SentenceSentiment

The new sentiment model used during this call should identify the sample sentence as follows:

Positive 3 I have posted five stars.

If you come back to the default model by removing the props.put(..) line in the init() method of nlpPipeline.class and recompiling it, SentenceSentiment invoked after that will identify this same sample as follows:

Neutral 2 I have posted five stars.

Source: oracle.com

Related Posts

0 comments:

Post a Comment