Releases · JohnSnowLabs/spark-nlp

16 Aug 04:22

2.2.0-rc1

64b48ea

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

We are so glad to present the first release candidate of this new release. Last time, following a release candidate schedule allowed
us to move from 2.1.0 straight to 2.2.0! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests.
This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
Progress bar report when downloading models and loading embeddings

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter

Contributors

atomobianco

Assets 2

13 Jul 21:41

saif-ellafi

2.1.0

5cbaa02

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens was redesigned into a cleanupMode for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.

New Features

Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)

Enhancements

DocumentAssembler new param cleanupMode allows the user to decide what kind of cleanup to apply to source
Tokenizer has been severely enhanced to allow easier and more intuitive customization
Norvig and Symmetric spell checkers now report confidence scores in metadata
NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development

Bugfixes

Fixed Dependency Parser not reporting offsets correctly
Dependency Parser now only shows head token as part of the result, instead of pairs
Fixed NerDLModel not allowing to pick noncontrib versions from Linux
Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
Removed unintentional GC calls causing some performance issues

Framework

ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)

Documentation

Scaladocs for Spark NLP reference
Added Google Colab walkthrough guide
Added Approach and Model class names in the reference documentation
Fixed various typos and outdated pieces in documentation

Contributors

csnardi

Assets 2

02 Jul 18:37

saif-ellafi

2.0.9

e210957

John Snow Labs Spark-NLP 2.0.9: EmbeddingsRef fixes, disabled rule factory debugging

This release fixes a bug in embeddingsRef param causing embeddings not to be loadable when setIncludeEmbeddings was set to false

Bugfixes

WordEmbeddingsModel can now be loaded using embeddingsRef correctly
Disabled RuleFactory debug mode spam messages

Assets 2

29 Jun 22:09

saif-ellafi

2.1.0-rc2

bff13d7

John Snow Labs Spark-NLP 2.1.0-rc2: Bugfixes in resource downloader Pre-release

Pre-release

Release candidate #2 for 2.1.0

Fixed Tokenizer missing pretrained() functions
Fixed issue in metadata not allowing to retrieve model with anonymous s3 credenthials
Added metadata options to differentiate internally between pipelines and models
Fix resource downloader correctly resolve release candidate build versions

Assets 2

28 Jun 18:35

saif-ellafi

2.1.0-rc1

fa22ee3

John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module Pre-release

Pre-release

This is a pre-release for 2.1.0. The tokenizer has been revamped, and some of the DocumentAssembler defaults changed.
For this reason, many pipelines and models may now change their accuracies and performance. Old tokenizer default rules
will be translated in a new english specific pretrained Tokenizer.
NerDLApproach will now report metrics if setTrainValidationProp has been set, as well as confidence scores reporting in spell checkers.
DependencyParser output has been reviewed and fixed a bunch of other bugs in the embeddings scope.
Please feedback and bugs, and remember, this is a pre-release, so not yet intended for production use.
Join Slack!

Enhancements

Norvig and Symmetric spell checkers now report confidence scores in metadata
Tokenizer has been severely enhanced to allow easier and faster customization
NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through setTrainValidationProp
Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
Added spark-nlp-eval evaluation model with multiple scripts that help users evaluate their models and pipelines. To be improved.

Bugfixes

Fixed Dependency Parser not reporting offsets correctly
Dependency Parser now only shows head token as part of the result, instead of pairs
Fixed NerDLModel not allowing to pick noncontrib versions from linux
Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible

Framework

ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)

Documentation

Added Google Colab workthrough guide
Added Approach and Model class names in reference documentation
Fixed various typos and outdated pieces in documentation

Assets 2

05 Jun 13:16

saif-ellafi

2.0.8

e07fe54

John Snow Labs Spark-NLP 2.0.8: Model compatibility bugfixes

This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.

Bugfixes

Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
Deleted unnecessary chunk index from tokens
Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala

Assets 2

02 Jun 13:25

saif-ellafi

2.0.7

f4ed9f3

John Snow Labs Spark-NLP 2.0.7: Cluster compatibility improvements

This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems

Bugfixes

Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
NerDLModel was not properly reading user provided config proto bytes during prediction
Improved cluster embeddings message to hit user of cluster mode without shared filesystems
Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility

Assets 2

29 May 23:26

saif-ellafi

2.0.6

3be5630

John Snow Labs Spark-NLP 2.0.6: NerDL customizable graphs and bug fixes

Following after 2.0.5 release (read notes below), this release fixes a bug when disabling useContrib param in NerDLApproach on non-windows OS.

Bugfixes

Fixed NerDLApproach failing when training with setUseContrib(false)

Assets 2

29 May 19:46

saif-ellafi

2.0.5

8324a2f

John Snow Labs Spark-NLP 2.0.5: NerDL customizable graphs and cluster fixes

This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting important and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!

Enhancements

ViveknSentiment annotator now includes confidence score in metadata
NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
Spark version bumped to 2.4.3

Bugfixes

Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
Fixed DependencyParser pretrained models not working properly in Python

Models and Pipelines

NerDL will download noncontrib model if windows is detected, for better compatibility
noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
Improved error message when user is under windows and trying to load a contrib NerDL model
Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)

Developer API

Embeddings in python moved to annotator module for consistency
SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
Metadata model reader now ignores empty lines instead of failing
Unified lang instead of language attribute name in pretrained API

Contributors

CyborgDroid

Assets 2

22 May 19:46

saif-ellafi

2.0.4

d919814

John Snow Labs Spark-NLP 2.0.4: Fixes for dependency parser and pretrained models

We are excited about the Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving the website's documentation to an easy to maintain Jekyll template with Markdown. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.

Bugfixes

Fixed DependencyParser and TypedDependencyParser working inaccurately
Fixed a bug preventing a load of WordEmbeddingsModel class from python
Fixed wrong pre-trained model names preventing some pre-trained models to work properly
Fixed BertEmbeddings not being capable of loading from file due to a reader exception

Documentation

Website documentation migrated to GitHub wiki page (WIP)

Developer API

OcrHelper now reports failed file name when throwing exceptions (Thanks @kgeis)
Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
Fixed TRAVIS CI unit tests

Contributors

kgeis

Assets 2

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned

New Features

Enhancements

Bugfixes

Framework

Documentation

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.0.9: EmbeddingsRef fixes, disabled rule factory debugging

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.1.0-rc2: Bugfixes in resource downloader

Uh oh!

John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module

Enhancements

Bugfixes

Framework

Documentation

Uh oh!

John Snow Labs Spark-NLP 2.0.8: Model compatibility bugfixes

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.0.7: Cluster compatibility improvements

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.0.6: NerDL customizable graphs and bug fixes

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.0.5: NerDL customizable graphs and cluster fixes

Enhancements

Bugfixes

Models and Pipelines

Developer API

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.0.4: Fixes for dependency parser and pretrained models

Bugfixes

Documentation

Developer API

Contributors

Uh oh!