Releases: JohnSnowLabs/spark-nlp
John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation
We are so glad to present the first release candidate of this new release. Last time, following a release candidate schedule allowed
us to move from 2.1.0 straight to 2.2.0! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests.
This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!
New Features
- OCRHelper now returns coordinate positions matrix for text converted from PDF
- New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
- Evaluation module now also ported to Python
- WordEmbeddings now include coverage metadata information and new static functions
withCoverageColumnandoverallCoverageoffer metric analysis - Progress bar report when downloading models and loading embeddings
Enhancements
- BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
- Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
- Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
- Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
Bugfixes
- Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
- Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
- Fixed missing setters for whitelist param in NerConverter
John Snow Labs Spark-NLP 2.1.0: DocumentAssembler and Tokenizer redesigned
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler trimAndClearNewLiens was redesigned into a cleanupMode for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language-based Tokenizers.
Another big introduction is the eval module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the NerDLApproach.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.
New Features
- Spark NLP Eval module includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)
Enhancements
- DocumentAssembler new param
cleanupModeallows the user to decide what kind of cleanup to apply to source - Tokenizer has been severely enhanced to allow easier and more intuitive customization
- Norvig and Symmetric spell checkers now report confidence scores in metadata
- NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through
setTrainValidationProp - Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets the ground base for further development
Bugfixes
- Fixed Dependency Parser not reporting offsets correctly
- Dependency Parser now only shows head token as part of the result, instead of pairs
- Fixed NerDLModel not allowing to pick noncontrib versions from Linux
- Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
- Removed unintentional GC calls causing some performance issues
Framework
- ResourceDownloader now capable of utilizing credentials from AWS standard means (variables, credentials folder)
Documentation
- Scaladocs for Spark NLP reference
- Added Google Colab walkthrough guide
- Added Approach and Model class names in the reference documentation
- Fixed various typos and outdated pieces in documentation
John Snow Labs Spark-NLP 2.0.9: EmbeddingsRef fixes, disabled rule factory debugging
This release fixes a bug in embeddingsRef param causing embeddings not to be loadable when setIncludeEmbeddings was set to false
Bugfixes
- WordEmbeddingsModel can now be loaded using embeddingsRef correctly
- Disabled RuleFactory debug mode spam messages
John Snow Labs Spark-NLP 2.1.0-rc2: Bugfixes in resource downloader
Release candidate #2 for 2.1.0
- Fixed Tokenizer missing
pretrained()functions - Fixed issue in metadata not allowing to retrieve model with anonymous s3 credenthials
- Added metadata options to differentiate internally between pipelines and models
- Fix resource downloader correctly resolve release candidate build versions
John Snow Labs Spark-NLP 2.1.0-rc1: Tokenizer revamped, NerDLApproach metrics and eval module
This is a pre-release for 2.1.0. The tokenizer has been revamped, and some of the DocumentAssembler defaults changed.
For this reason, many pipelines and models may now change their accuracies and performance. Old tokenizer default rules
will be translated in a new english specific pretrained Tokenizer.
NerDLApproach will now report metrics if setTrainValidationProp has been set, as well as confidence scores reporting in spell checkers.
DependencyParser output has been reviewed and fixed a bunch of other bugs in the embeddings scope.
Please feedback and bugs, and remember, this is a pre-release, so not yet intended for production use.
Join Slack!
Enhancements
- Norvig and Symmetric spell checkers now report confidence scores in metadata
- Tokenizer has been severely enhanced to allow easier and faster customization
- NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through
setTrainValidationProp - Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
- Added
spark-nlp-evalevaluation model with multiple scripts that help users evaluate their models and pipelines. To be improved.
Bugfixes
- Fixed Dependency Parser not reporting offsets correctly
- Dependency Parser now only shows head token as part of the result, instead of pairs
- Fixed NerDLModel not allowing to pick noncontrib versions from linux
- Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
Framework
- ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)
Documentation
- Added Google Colab workthrough guide
- Added Approach and Model class names in reference documentation
- Fixed various typos and outdated pieces in documentation
John Snow Labs Spark-NLP 2.0.8: Model compatibility bugfixes
This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.
Bugfixes
- Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
- Deleted unnecessary chunk index from tokens
- Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala
John Snow Labs Spark-NLP 2.0.7: Cluster compatibility improvements
This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems
Bugfixes
- Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
- NerDLModel was not properly reading user provided config proto bytes during prediction
- Improved cluster embeddings message to hit user of cluster mode without shared filesystems
- Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
- Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility
John Snow Labs Spark-NLP 2.0.6: NerDL customizable graphs and bug fixes
Following after 2.0.5 release (read notes below), this release fixes a bug when disabling useContrib param in NerDLApproach on non-windows OS.
Bugfixes
- Fixed NerDLApproach failing when training with setUseContrib(false)
John Snow Labs Spark-NLP 2.0.5: NerDL customizable graphs and cluster fixes
This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting important and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!
Enhancements
- ViveknSentiment annotator now includes confidence score in metadata
- NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
- NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
- NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
- Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
- Spark version bumped to 2.4.3
Bugfixes
- Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
- Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
- Fixed DependencyParser pretrained models not working properly in Python
Models and Pipelines
- NerDL will download noncontrib model if windows is detected, for better compatibility
- noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
- Improved error message when user is under windows and trying to load a contrib NerDL model
- Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)
Developer API
- Embeddings in python moved to annotator module for consistency
- SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
- Metadata model reader now ignores empty lines instead of failing
- Unified lang instead of language attribute name in pretrained API
John Snow Labs Spark-NLP 2.0.4: Fixes for dependency parser and pretrained models
We are excited about the Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving the website's documentation to an easy to maintain Jekyll template with Markdown. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.
Bugfixes
- Fixed DependencyParser and TypedDependencyParser working inaccurately
- Fixed a bug preventing a load of WordEmbeddingsModel class from python
- Fixed wrong pre-trained model names preventing some pre-trained models to work properly
- Fixed BertEmbeddings not being capable of loading from file due to a reader exception
Documentation
- Website documentation migrated to GitHub wiki page (WIP)
Developer API
- OcrHelper now reports failed file name when throwing exceptions (Thanks @kgeis)
- Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
- Fixed TRAVIS CI unit tests