Releases · JohnSnowLabs/spark-nlp

08 Nov 16:05

maziyarpanahi

2.3.2

6f24a59

John Snow Labs Spark-NLP 2.3.2: Multiple fixes and enhancements

This release addresses multiple bug fixes and some enhancements regarding memory consumption in our BertEmbeddings.

Bugfixes

Fix missing EmbeddingsFinisher in Scala and Python
Reverted embeddings move to copy due to CRC issue
Fix IndexOutOfBoundsException in SentenceEmbeddings

Enhancement

Optimize BertEmbeddings memory consumption

Assets 2

29 Oct 20:44

saif-ellafi

2.3.1

d026bfb

John Snow Labs Spark-NLP 2.3.1: EmbeddingsHelper and Lemmatizer fix

This quick release addresses a bug in Lemmatizer loading/pretrained function causing it not to work in 2.3.0.
We took the chance to include a feature which did not make it for base 2.3.0 and slightly changed protected variables for
better Java API, also including a pretrained compatible function with Java. Thanks for the quick issue feedback again!

New Features

New EmbeddingsFinisher specializes in dealing with embedding annotators output. Traditional finisher still behaves the same as 2.3.0

Bugfixes

Fixed a bug in previous release causing LemmatizerModel not to be loaded or pretrained load
Fixed pretrained() function to return proper type in Java

Assets 2

25 Oct 22:33

saif-ellafi

2.3.0

fc78f0c

John Snow Labs Spark-NLP 2.3.0: More embedding builders and better Java support

Thanks for your contributions and feedback on Slack. This amazing release comes with many new features in the scope of the embeddings, allowing pipeline builders to retrieve embeddings for specific bodies of texts in any form given, from sentences to chunks or n-grams.
We also worked a lot on making sure Spark NLP in Java works as intended. Finally, we improved the AWS profile's compatibility for frameworks that utilize multiple credential profiles. Unfortunately, we have deprecated Eval and OCR due to internal patents in some of the latest improvements John Snow Labs has contributed to.

New Features

New SentenceEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate sentence or document embeddings
New ChunkEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs
New StopWordsCleaner integrates Spark ML StopWordsRemoval function into Spark NLP pipeline
New NGramGenerator annotator integrates Spark ML NGram function into Spark ML with a new cumulative feature to also generate range ngrams like the scikit-learn library

Enhancements

Improved Java incompatibility on Pretrained and LightPipeline APIs. Examples added.
Finisher and LightPipelines Parse Embeddings Vector flag allows for optional vector processing to save memory and improve performance
setInputCols in python can be passed as *args
new Param enableScore in SentimentDetector to switch output types between confidence score and results (Thanks @maxwellpaulm)
spark_nlp profile name by default in AWS config allows for multiple profile download compatible

Bugfixes

Fixed POS training dataset creator to improve performance

Deprecations

OCR Module dropped from open source support
Eval Module dropped from open source support

Contributors

maxwellpaulm

Assets 2

26 Sep 19:48

saif-ellafi

2.2.2

f51d0fd

John Snow Labs Spark-NLP 2.2.2: Better Evaluation module in python, fixed duplicate coordinates, graph script

Thank you again for all your feedback and questions in our Slack channel. Such feedback from users and contributors
(thank you Stuart Lynn @sllynn) helped to find several python module bugs. We also fixed and improved OCR support
towards extracting page coordinates and fixed NerDL evaluator from Python

Enhancements

Added a create_models.py python script to generate Graphs for NerDL without the need of jupyter
Added a new annotator Token2Chunk to convert all tokens to chunk types (useful for extracting token coordinates from OCR)
Added OCR Page Dimensions
Python setInputCols now accepts *args no need to input list

Bugfixes

Fixed python support of NerDL evaluation not taking all params appropriately
Fixed a bug in case sensitivity matching of embeddings format in python (Thanks @sllynn)
Fixed a bug in python DateMatcher with dateFormat param not working (Thanks @sllynn)
Fixed a bug in PositionFinder reporting duplicate coordinate elements

Developer API

Renamed trainValidationProp to validationSplit in NerDLApproach

Documentation

Added several missing annotator documentation in docs page

Contributors

sllynn

Assets 2

28 Aug 18:18

saif-ellafi

2.2.1

3dc7a45

John Snow Labs Spark-NLP 2.2.1: Python PipelineModel bugfixes

This short release is to address a few uncovered issues in the previous 2.2.0 release. Thank you all for quick feedback.

Enhancements

NerDLApproach new param includeValidationProp allows partitioning the training set and exclude a fraction
NerDLApproach trainValidationProp now randomly samples the data as opposed to head first

Bugfixes

Fixed a bug in ResourceHelper causing folder resources to fail when a folder is empty (affects various annotators)
Fixed a bug in python embeddings format not parsed to upper case
Fixed a bug in python causing an incapability to load PipelineModels after loading embeddings

Assets 2

23 Aug 06:06

saif-ellafi

2.2.0

c44128f

John Snow Labs Spark-NLP 2.2.0: BERT improvements, OCR Coordinates, python evaluation

Last time, following a release candidate schedule proved to be a quite effective method to avoid silly bugs right after release!
Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
NerDLApproach now has enableOutputLog outputs training metric logs to file
New Param in BERT poolingLayer allows for polling layer selection

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Progress bar and size estimate report when downloading pretrained models and loading embeddings
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations
DocumentAssembler new cleanup modes: each, each_full and delete_full allow more control over text cleaning up (different ways of dealing with new lines and tabs)

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter
Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)

Contributors

atomobianco

Assets 2

20 Aug 23:29

saif-ellafi

2.2.0-rc3

d6ae43c

John Snow Labs Spark-NLP 2.2.0-rc3: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

We are so glad to present the release candidate of this new release. Last time, following a release candidate schedule proved
to be a quite effective method to avoid silly bugs right after release! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
New Param in BERT poolingLayer allows for polling layer selection

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Progress bar and size estimate report when downloading pretrained models and loading embeddings
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter
Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)

Contributors

atomobianco

Assets 2

18 Aug 15:05

saif-ellafi

2.2.0-rc2

dc37976

John Snow Labs Spark-NLP 2.2.0-rc2: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
New Param in BERT poolingLayer allows for polling layer selection
Progress bar report when downloading models and loading embeddings

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter
Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized

Contributors

atomobianco

Assets 2

18 Aug 01:37

saif-ellafi

2.1.1

76eaf77

John Snow Labs Spark-NLP 2.1.1: Fixed flush entities bug, added missing setters to NerConverter

Thank you so much for your feedback on slack. This release is to extend life length of the 2.1.x release, with important bugfixes from upstream

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter

Assets 2

16 Aug 04:22

saif-ellafi

2.2.0-rc1

64b48ea

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation Pre-release

Pre-release

We are so glad to present the first release candidate of this new release. Last time, following a release candidate schedule allowed
us to move from 2.1.0 straight to 2.2.0! Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests.
This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!

New Features

OCRHelper now returns coordinate positions matrix for text converted from PDF
New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
Evaluation module now also ported to Python
WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
Progress bar report when downloading models and loading embeddings

Enhancements

BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)

Bugfixes

Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
Fixed missing setters for whitelist param in NerConverter

Contributors

atomobianco

Assets 2

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 2.3.2: Multiple fixes and enhancements

Bugfixes

Enhancement

Uh oh!

John Snow Labs Spark-NLP 2.3.1: EmbeddingsHelper and Lemmatizer fix

New Features

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.3.0: More embedding builders and better Java support

New Features

Enhancements

Bugfixes

Deprecations

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.2.2: Better Evaluation module in python, fixed duplicate coordinates, graph script

Enhancements

Bugfixes

Developer API

Documentation

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.2.1: Python PipelineModel bugfixes

Enhancements

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.2.0: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.2.0-rc3: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.2.0-rc2: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

Uh oh!

John Snow Labs Spark-NLP 2.1.1: Fixed flush entities bug, added missing setters to NerConverter

Bugfixes

Uh oh!

John Snow Labs Spark-NLP 2.2.0-rc1: BERT improvements, OCR Coordinates, python evaluation

New Features

Enhancements

Bugfixes

Contributors

Uh oh!