You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG
+55Lines changed: 55 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,58 @@
1
+
========
2
+
1.6.0
3
+
========
4
+
---------------
5
+
Overview
6
+
---------------
7
+
We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
8
+
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
9
+
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
10
+
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
11
+
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
12
+
We'll be working on the documentation as this release follows. Thank you.
13
+
14
+
---------------
15
+
New Features
16
+
---------------
17
+
* New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
18
+
* OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.
19
+
20
+
---------------
21
+
Enhancements
22
+
---------------
23
+
* TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
24
+
* LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
25
+
* LightPipelines can now handle embedded Pipelines
26
+
* PerceptronApproach now trains utilizing full Spark distributed algoritm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
27
+
* Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
28
+
* WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
29
+
* DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
30
+
* Improved SentenceDetector handling of enumerations and lists
31
+
* Slightly improved SentenceDetector performance through non-tail-recursive optimizations
32
+
* Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)
33
+
34
+
---------------
35
+
Bug fixes
36
+
---------------
37
+
* AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
38
+
* Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
39
+
* Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
40
+
* Fixed NerDLModel returning non-deterministic results during prediction
41
+
* Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
42
+
* WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
<p><spanclass="label label-warning">2018 May 19th - Update!</span> 1.5.4 Released! Better Normalizer with slang dictionary, improvements in annotators and fixed python2 support
82
-
Learn more<ahref="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/CHANGELOG">HERE</a> and check out updated documentation below</p>
81
+
<p><spanclass="label label-warning">2018 Jul 7th - Update!</span> 1.6.0 Released! OCR PDF to Spark-NLP capabilities, new Chunker annotator, fixed AWS compatibility, better performance and much more.
82
+
Learn changes<ahref="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/CHANGELOG">HERE</a> and check out for updated documentation below</p>
Since we are dealing with small amounts of data, we put in practice LightPipelines.
104
104
</p>
105
105
<p>
106
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
106
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/example/src/TrainViveknSentiment.scala" target="_blank"> Take me to notebook!</a>
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
138
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/vivekn-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
Each of these sentences will be used for giving a score to text
158
158
</p>
159
159
</p>
160
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
160
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dictionary-sentiment/sentiment.ipynb" target="_blank"> Take me to notebook!</a>
approach to use the same pipeline for tagging external resources.
178
178
</p>
179
179
<p>
180
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
180
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/crf-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction.
197
197
</p>
198
198
<p>
199
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
199
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-ner/ner.ipynb" target="_blank"> Take me to notebook!</a>
200
200
</p>
201
201
</div>
202
202
<div>
@@ -211,7 +211,7 @@ <h4 id="text-notebook" class="section-block"> Simple Text Matching</h4>
211
211
This annotator is an AnnotatorModel and does not require training.
212
212
</p>
213
213
<p>
214
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
214
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/text-matcher/extractor.ipynb" target="_blank"> Take me to notebook!</a>
215
215
</p>
216
216
</div>
217
217
<div>
@@ -226,7 +226,7 @@ <h4 id="assertion-notebook" class="section-block"> Assertion Status with LogReg<
226
226
dataset will return the appropriate result.
227
227
</p>
228
228
<p>
229
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
229
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/logreg-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
230
230
</p>
231
231
</div>
232
232
<div>
@@ -241,7 +241,7 @@ <h4 id="dlassertion-notebook" class="section-block"> Deep Learning Assertion Sta
241
241
graphs may be redesigned if needed.
242
242
</p>
243
243
<p>
244
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
244
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/dl-assertion/assertion.ipynb" target="_blank"> Take me to notebook!</a>
Such components may then be injected seamlessly into further pipelines, and so on.
261
261
</p>
262
262
<p>
263
-
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.5.4/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
263
+
<aclass="btn btn-warning btn-cta" style="float: center;margin-top: 10px;" href="https://github.com/JohnSnowLabs/spark-nlp/blob/1.6.0/python/example/model-downloader/ModelDownloaderExample.ipynb" target="_blank"> Take me to notebook!</a>
0 commit comments