Skip to content

Commit e78c00c

Browse files
authored
Merge pull request #242 from JohnSnowLabs/160-release-candidate
Release Candidate 1.6.0
2 parents 2b1dd2f + 3b8e2e4 commit e78c00c

File tree

14 files changed

+118
-55
lines changed

14 files changed

+118
-55
lines changed

CHANGELOG

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,58 @@
1+
========
2+
1.6.0
3+
========
4+
---------------
5+
Overview
6+
---------------
7+
We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
8+
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
9+
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
10+
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
11+
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
12+
We'll be working on the documentation as this release follows. Thank you.
13+
14+
---------------
15+
New Features
16+
---------------
17+
* New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
18+
* OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.
19+
20+
---------------
21+
Enhancements
22+
---------------
23+
* TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
24+
* LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
25+
* LightPipelines can now handle embedded Pipelines
26+
* PerceptronApproach now trains utilizing full Spark distributed algoritm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
27+
* Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
28+
* WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
29+
* DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
30+
* Improved SentenceDetector handling of enumerations and lists
31+
* Slightly improved SentenceDetector performance through non-tail-recursive optimizations
32+
* Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)
33+
34+
---------------
35+
Bug fixes
36+
---------------
37+
* AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
38+
* Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
39+
* Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
40+
* Fixed NerDLModel returning non-deterministic results during prediction
41+
* Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
42+
* WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
43+
* Fixed SentenceDetector incorrectly bounding sentences with non-English characters (Thanks @lorenz-nlp)
44+
* Python Spark-NLP annotator models should now have all appropriate setter and getter functions for Params
45+
* Fixed wrong-format of column when showing Metadata through Finisher's output as Array
46+
* Added missing python Finisher's include metadata function (thanks @PinusSilvestris for reporting the bug)
47+
* Fixed Symmetric Delete Spell Checker throwing wrong error when training with an empty dataset (Thanks @ankush)
48+
49+
---------------
50+
Developer API
51+
---------------
52+
* Deep Learning models may now be read through SavedModelBundle API into Tensorflow for Java in TensorflowWrapper
53+
* WordEmbeddings now allow checking if word exists with contains()
54+
* Included tool that converts text into CoNLL format for further labeling for training NER models (
55+
156
========
257
1.5.4
358
========

README.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ Questions? Feedback? Request access sending an email to [email protected]
1414

1515
This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
1616

17-
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.5.4` to you spark command
17+
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.6.0` to you spark command
1818

1919
```sh
20-
spark-shell --packages JohnSnowLabs:spark-nlp:1.5.4
20+
spark-shell --packages JohnSnowLabs:spark-nlp:1.6.0
2121
```
2222

2323
```sh
24-
pyspark --packages JohnSnowLabs:spark-nlp:1.5.4
24+
pyspark --packages JohnSnowLabs:spark-nlp:1.6.0
2525
```
2626

2727
```sh
28-
spark-submit --packages JohnSnowLabs:spark-nlp:1.5.4
28+
spark-submit --packages JohnSnowLabs:spark-nlp:1.6.0
2929
```
3030

3131
## Jupyter Notebook
@@ -35,23 +35,23 @@ export SPARK_HOME=/path/to/your/spark/folder
3535
export PYSPARK_DRIVER_PYTHON=jupyter
3636
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
3737
38-
pyspark --packages JohnSnowLabs:spark-nlp:1.5.4
38+
pyspark --packages JohnSnowLabs:spark-nlp:1.6.0
3939
```
4040

4141
## Apache Zeppelin
4242
This way will work for both Scala and Python
4343
```
44-
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.5.4"
44+
export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:1.6.0"
4545
```
4646
Alternatively, add the following Maven Coordinates to the interpreter's library list
4747
```
48-
com.johnsnowlabs.nlp:spark-nlp_2.11:1.5.4
48+
com.johnsnowlabs.nlp:spark-nlp_2.11:1.6.0
4949
```
5050

5151
## Python without explicit Spark installation
5252
If you installed pyspark through pip, you can now install sparknlp through pip
5353
```
54-
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.5.4
54+
pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.6.0
5555
```
5656
Then you'll have to create a SparkSession manually, for example:
5757
```
@@ -65,11 +65,13 @@ spark = SparkSession.builder \
6565
.getOrCreate()
6666
```
6767

68-
## Pre-compiled Spark-NLP
68+
## Pre-compiled Spark-NLP and Spark-NLP-OCR
6969
You may download fat-jar from here:
70-
[Spark-NLP 1.5.4 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.5.4.jar)
70+
[Spark-NLP 1.6.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-assembly-1.6.0.jar)
7171
or non-fat from here
72-
[Spark-NLP 1.5.4 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.5.4/spark-nlp_2.11-1.5.4.jar)
72+
[Spark-NLP 1.6.0 PKG JAR](http://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/1.6.0/spark-nlp_2.11-1.6.0.jar)
73+
Spark-NLP-OCR Module (Requires native Tesseract 4.x+ for image based OCR. Does not require Spark-NLP to work but highly suggested)
74+
[Spark-NLP-OCR 1.6.0 FAT-JAR](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/spark-nlp-ocr-assembly-1.6.0.jar)
7375

7476
## Maven central
7577

@@ -81,19 +83,19 @@ Our package is deployed to maven central. In order to add this package as a depe
8183
<dependency>
8284
<groupId>com.johnsnowlabs.nlp</groupId>
8385
<artifactId>spark-nlp_2.11</artifactId>
84-
<version>1.5.4</version>
86+
<version>1.6.0</version>
8587
</dependency>
8688
```
8789

8890
#### SBT
8991
```sbtshell
90-
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.5.4"
92+
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.6.0"
9193
```
9294

9395
If you are using `scala 2.11`
9496

9597
```sbtshell
96-
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.5.4"
98+
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.6.0"
9799
```
98100

99101
## Using the jar manually

build.sbt

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ name := "spark-nlp"
99

1010
organization := "com.johnsnowlabs.nlp"
1111

12-
version := "1.5.4"
12+
version := "1.6.0"
1313

1414
scalaVersion in ThisBuild := scalaVer
1515

@@ -94,9 +94,12 @@ lazy val testDependencies = Seq(
9494
lazy val utilDependencies = Seq(
9595
"com.typesafe" % "config" % "1.3.0",
9696
"org.rocksdb" % "rocksdbjni" % "5.1.4",
97-
"org.slf4j" % "slf4j-api" % "1.7.25",
98-
"org.apache.commons" % "commons-compress" % "1.15",
99-
"com.amazonaws" % "aws-java-sdk" % "1.7.4",
97+
"com.amazonaws" % "aws-java-sdk" % "1.7.4"
98+
exclude("com.fasterxml.jackson.core", "jackson-core")
99+
exclude("com.fasterxml.jackson.core", "jackson-annotations")
100+
exclude("com.fasterxml.jackson.core", "jackson-databind")
101+
exclude("com.fasterxml.jackson.dataformat", "jackson-dataformat-smile")
102+
exclude("com.fasterxml.jackson.datatype", "jackson-datatype-joda"),
100103
"org.tensorflow" % "tensorflow" % "1.8.0"
101104
/** Enable the following for tensorflow GPU support */
102105
//"org.tensorflow" % "libtensorflow" % "1.8.0",
@@ -124,10 +127,17 @@ val ocrMergeRules: String => MergeStrategy = {
124127
case _ => MergeStrategy.deduplicate
125128
}
126129

130+
assemblyMergeStrategy in assembly := {
131+
case PathList("com.fasterxml.jackson") => MergeStrategy.first
132+
case x =>
133+
val oldStrategy = (assemblyMergeStrategy in assembly).value
134+
oldStrategy(x)
135+
}
136+
127137
lazy val ocr = (project in file("ocr"))
128138
.settings(
129139
name := "spark-nlp-ocr",
130-
version := "1.5.4",
140+
version := "1.6.0",
131141
libraryDependencies ++= ocrDependencies ++
132142
analyticsDependencies ++
133143
testDependencies,

docs/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,8 @@ <h2 class="title">High Performance NLP with Apache Spark </h2>
7878
</p>
7979
<a class="btn btn-info btn-cta" style="float: center;margin-top: 10px;" href="mailto:[email protected]?subject=SparkNLP%20Slack%20access" target="_blank"> Questions? Join our Slack</a>
8080
<b/><p/><p/>
81-
<p><span class="label label-warning">2018 May 19th - Update!</span> 1.5.4 Released! Better Normalizer with slang dictionary, improvements in annotators and fixed python2 support
82-
Learn more <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/CHANGELOG">HERE</a> and check out updated documentation below</p>
81+
<p><span class="label label-warning">2018 Jul 7th - Update!</span> 1.6.0 Released! OCR PDF to Spark-NLP capabilities, new Chunker annotator, fixed AWS compatibility, better performance and much more.
82+
Learn changes <a href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/CHANGELOG">HERE</a> and check out for updated documentation below</p>
8383
</div>
8484
<div id="cards-wrapper" class="cards-wrapper row">
8585
<div class="item item-green col-md-4 col-sm-6 col-xs-6">

docs/notebooks.html

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ <h4 id="scala-vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis<
103103
Since we are dealing with small amounts of data, we put in practice LightPipelines.
104104
</p>
105105
<p>
106-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
106+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
107107
</p>
108108
</div>
109109
</section>
@@ -135,7 +135,7 @@ <h4 id="vivekn-notebook" class="section-block"> Vivekn Sentiment Analysis</h4>
135135
better Sentiment Analysis accuracy
136136
</p>
137137
<p>
138-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
138+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
139139
</p>
140140
</div>
141141
<div>
@@ -157,7 +157,7 @@ <h4 id="sentiment-notebook" class="section-block"> Rule-based Sentiment Analysis
157157
Each of these sentences will be used for giving a score to text
158158
</p>
159159
</p>
160-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
160+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
161161
</p>
162162
</div>
163163
<div>
@@ -177,7 +177,7 @@ <h4 id="crfner-notebook" class="section-block"> CRF Named Entity Recognition</h4
177177
approach to use the same pipeline for tagging external resources.
178178
</p>
179179
<p>
180-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
180+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
181181
</p>
182182
</div>
183183
<div>
@@ -196,7 +196,7 @@ <h4 id="dlner-notebook" class="section-block"> CNN Deep Learning NER</h4>
196196
and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
197197
</p>
198198
<p>
199-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
199+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
200200
</p>
201201
</div>
202202
<div>
@@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
211211
This annotator is an AnnotatorModel and does not require training.
212212
</p>
213213
<p>
214-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
214+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
215215
</p>
216216
</div>
217217
<div>
@@ -226,7 +226,7 @@ <h4 id="assertion-notebook" class="section-block"> Assertion Status with LogReg<
226226
dataset will return the appropriate result.
227227
</p>
228228
<p>
229-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
229+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
230230
</p>
231231
</div>
232232
<div>
@@ -241,7 +241,7 @@ <h4 id="dlassertion-notebook" class="section-block"> Deep Learning Assertion Sta
241241
graphs may be redesigned if needed.
242242
</p>
243243
<p>
244-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
244+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
245245
</p>
246246
</div>
247247
<div>
@@ -260,7 +260,7 @@ <h4 id="downloader-notebook" class="section-block"> Retrieving Pretrained models
260260
Such components may then be injected seamlessly into further pipelines, and so on.
261261
</p>
262262
<p>
263-
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
263+
<a class="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
264264
</p>
265265
</div>
266266
</section>

0 commit comments

Comments
 (0)