Skip to content

Commit e042183

Browse files
authored
Merge pull request #197 from JohnSnowLabs/fix-vivekn-train-with-col
Fixed Vivekn sentiment analysis when training from dataset
2 parents 44bbd0f + 547322c commit e042183

File tree

6 files changed

+122
-11
lines changed

6 files changed

+122
-11
lines changed

CHANGELOG

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,21 @@ Bug fixes
1313
* Fixed a bug causing the library to fail when trying to save or read an annotator with an unset Feature without default
1414
* Added missing default Param value to SentenceDetector. Thanks @superman24-7
1515
* Symmetric spell checker now utilizes List instead of ListBuffer on its prediction layer
16-
17-
---------------
18-
Other
19-
---------------
20-
* Downloader now works retroactively when a newer version finds a model of a previous release
21-
* Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
16+
* Fixed Vivekn Sentiment Analysis failing when training with a sentiment column
2217

2318
---------------
2419
Models
2520
---------------
2621
* Symmetric Spell Checker pretrained model now works well and may be downloaded
2722
* Vivekn Sentiment pretrained model now defaults to "token" input column instead of "spell"
2823

24+
---------------
25+
Other
26+
---------------
27+
* Downloader now works retroactively when a newer version finds a model of a previous release
28+
* Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
29+
* Added new Scala example in example folder, also available on website
30+
2931
========
3032
1.5.2
3133
========

docs/notebooks.html

Lines changed: 33 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,34 @@ <h1 class="doc-title"><span aria-hidden="true" class="icon icon_genius"></span>
7979
<div id="showcase" class="doc-body">
8080
<div class="doc-content">
8181
<div class="content-inner">
82+
<section id="ScalaNotebook" class="doc-section">
83+
<h2 id="scala-theme-start" class="section-title" style="margin-bottom: 10px;" >Scala notebooks</h2>
84+
<p>
85+
In this section, we present with different example use cases of both training and running
86+
predictions with SparkNLP in Python PySpark. Please lookup our
87+
<a href="http://nlp.johnsnowlabs.com/components.html">Annotators</a> page for reference.
88+
</p>
89+
<div>
90+
<h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
91+
<p>
92+
In the following example, we walk-through Sentiment Analysis training and
93+
prediction using Spark NLP Annotators, Light Pipelines and Spark ML Pipelines
94+
</p>
95+
<p>
96+
The ViveknSentimentApproach annotator will compute Vivek Narayanan algorithm with either
97+
a column in training dataset with rows labelled 'positive' or 'negative' or a folder full
98+
of positive text and a folder with negative text. Using n-grams and negation of sequences,
99+
this statistical model can achieve high accuracy if trained properly.
100+
</p>
101+
<p>
102+
In this use case we are training with spark datasets passed to fit() and transform().
103+
Since we are dealing with small amounts of data, we put in practice LightPipelines.
104+
</p>
105+
<p>
106+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.3/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
107+
</p>
108+
</div>
109+
</section>
82110
<section id="Notebook" class="doc-section">
83111
<h2 id="theme-start" class="section-title" style="margin-bottom: 10px;" >Python notebooks</h2>
84112
<p>
@@ -243,7 +271,11 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
243271
<ul id="doc-menu" class="nav doc-menu hidden-xs" data-spy="affix">
244272

245273
<li>
246-
<a class="scrollto" href="#Notebook">Notebook</a>
274+
<a class="scrollto" href="#ScalaNotebook">Scala Notebooks</a>
275+
<ul class="nav doc-sub-menu">
276+
<li><a class="scrollto" href="#scala-vivekn-notebook">Vivekn Sentiment Analysis</a></li>
277+
</ul>
278+
<a class="scrollto" href="#Notebook">Python Notebooks</a>
247279
<ul class="nav doc-sub-menu">
248280
<li><a class="scrollto" href="#vivekn-notebook">Vivekn Sentiment Analysis</a></li>
249281
<li><a class="scrollto" href="#sentiment-notebook">Rule-based Sentiment Analysis</a></li>
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
import com.johnsnowlabs.nlp.annotator._
2+
import com.johnsnowlabs.nlp.base._
3+
import com.johnsnowlabs.util.Benchmark
4+
import org.apache.spark.ml.Pipeline
5+
import org.apache.spark.sql.SparkSession
6+
7+
object TrainViveknSentiment extends App {
8+
9+
val spark: SparkSession = SparkSession
10+
.builder()
11+
.appName("test")
12+
.master("local[*]")
13+
.config("spark.driver.memory", "4G")
14+
.config("spark.kryoserializer.buffer.max","200M")
15+
.config("spark.serializer","org.apache.spark.serializer.KryoSerializer")
16+
.getOrCreate()
17+
18+
spark.sparkContext.setLogLevel("WARN")
19+
20+
import spark.implicits._
21+
22+
val training = Seq(
23+
("I really liked this movie!", "positive"),
24+
("The cast was horrible", "negative"),
25+
("Never going to watch this again or recommend it to anyone", "negative"),
26+
("It's a waste of time", "negative"),
27+
("I loved the protagonist", "positive"),
28+
("The music was really really good", "positive")
29+
).toDS.toDF("train_text", "train_sentiment")
30+
31+
val testing = Array(
32+
"I don't recommend this movie, it's horrible",
33+
"Dont waste your time!!!"
34+
)
35+
36+
val document = new DocumentAssembler()
37+
.setInputCol("train_text")
38+
.setOutputCol("document")
39+
40+
val token = new Tokenizer()
41+
.setInputCols("document")
42+
.setOutputCol("token")
43+
44+
val normalizer = new Normalizer()
45+
.setInputCols("token")
46+
.setOutputCol("normal")
47+
48+
val vivekn = new ViveknSentimentApproach()
49+
.setInputCols("document", "normal")
50+
.setOutputCol("result_sentiment")
51+
.setSentimentCol("train_sentiment")
52+
53+
val finisher = new Finisher()
54+
.setInputCols("result_sentiment")
55+
.setOutputCols("final_sentiment")
56+
57+
val pipeline = new Pipeline().setStages(Array(document, token, normalizer, vivekn, finisher))
58+
59+
val sparkPipeline = pipeline.fit(training)
60+
61+
val lightPipeline = new LightPipeline(sparkPipeline)
62+
63+
Benchmark.time("Light pipeline quick annotation") { lightPipeline.annotate(testing) }
64+
65+
Benchmark.time("Spark pipeline, this may be too much for just two rows!") {
66+
val testingDS = testing.toSeq.toDS.toDF("testing_text")
67+
println("Updating DocumentAssembler input column")
68+
document.setInputCol("testing_text")
69+
sparkPipeline.transform(testingDS).show()
70+
}
71+
72+
73+
}

src/main/scala/com/johnsnowlabs/nlp/AnnotatorApproach.scala

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ abstract class AnnotatorApproach[M <: Model[M]]
3939

4040
/** requirement for pipeline transformation validation. It is called on fit() */
4141
override final def transformSchema(schema: StructType): StructType = {
42-
require(validate(schema), s"Wrong annotators in pipeline. Make sure the following annotator types are present in inputCols: " +
42+
require(validate(schema), s"Wrong annotators in pipeline for ${this.uid}. Make sure the following annotator types are present in inputCols: " +
4343
s"${requiredAnnotatorTypes.mkString(", ")}")
4444
getInputCols.foreach {
4545
annotationColumn =>

src/main/scala/com/johnsnowlabs/nlp/annotators/sda/vivekn/ViveknSentimentApproach.scala

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,8 @@ class ViveknSentimentApproach(override val uid: String)
7171
import ResourceHelper.spark.implicits._
7272
val positiveDS = new MapAccumulator()
7373
val negativeDS = new MapAccumulator()
74+
dataset.sparkSession.sparkContext.register(positiveDS)
75+
dataset.sparkSession.sparkContext.register(negativeDS)
7476
val prefix = "not_"
7577
val tokenColumn = dataset.schema.fields
7678
.find(f => f.metadata.contains("annotatorType") && f.metadata.getString("annotatorType") == AnnotatorType.TOKEN)

src/main/scala/com/johnsnowlabs/util/spark/MapAccumulator.scala

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,14 @@ class MapAccumulator(defaultMap: MMap[String, Long] = MMap.empty[String, Long].w
1212

1313
override def add(v: (String, Long)): Unit = mmap(v._1) += v._2
1414

15-
override def value: Map[String, Long] = mmap.toMap
15+
override def value: Map[String, Long] = mmap.toMap.withDefaultValue(0)
1616

17-
override def copy(): AccumulatorV2[(String, Long), Map[String, Long]] = new MapAccumulator(MMap[String, Long](value.toSeq:_*))
17+
override def copy(): AccumulatorV2[(String, Long), Map[String, Long]] =
18+
new MapAccumulator(MMap[String, Long](value.toSeq:_*).withDefaultValue(0))
1819

1920
override def isZero: Boolean = mmap.isEmpty
2021

21-
override def merge(other: AccumulatorV2[(String, Long), Map[String, Long]]): Unit = other.value.foreach{case (k, v) => mmap(k) += v}
22+
override def merge(other: AccumulatorV2[(String, Long), Map[String, Long]]): Unit =
23+
other.value.foreach{case (k, v) => mmap(k) += v}
2224

2325
}

0 commit comments

Comments
 (0)