Releases: data-prep-kit/data-prep-kit
Releases · data-prep-kit/data-prep-kit
v1.1.7
What's Changed
- preparing for a new release by @swith005 in #1527
- replacing existing text_encoder with lanceDB integration by @klwuibm in #1512
- Doc id fix by @touma-I in #1534
- update deployed module names, fixes OpenSearch input parameters by @roytman in #1528
- updated docling2parquet to remove image data from contents field when… by @swith005 in #1531
- Enhancement to README file by @shahrokhDaijavad in #1526
- changing embeddings_in_parquet flag to embeddings_in_lanceDB to fix b… by @klwuibm in #1542
- Add Spark support for doc_quality and docling2parquet. by @Mohammad-nassar10 in #1544
- added pytests for batch generation by @swith005 in #1543
- add a Rich-based Log handler by @roytman in #1519
- tkn2arrow-restructured folder and test ray job by @touma-I in #1539
- Ededup fix by @touma-I in #1536
- modified requirements.txt data_processing to remove burden for model_… by @swith005 in #1546
- Collapse: Added rayjob yaml file by @touma-I in #1548
- resize: Added rayjob yaml by @touma-I in #1549
- remove fullpath printing for log levels greater that Debug by @roytman in #1552
- Pii image notebook by @shahrokhDaijavad in #1553
- Small fixes in the logger formatter by @roytman in #1555
- remove a duplicated line by @roytman in #1556
- Folder-to-Parquet Transform by @touma-I in #1507
- Show some selected output cells by @shahrokhDaijavad in #1560
- Fix a bug in transform.py when GPU is available by @klwuibm in #1564
- add an option to set dpw log handlers by @roytman in #1568
- Move SinkHandler to data_processing lib. by @revit13 in #1567
- update boto3 version in data-proccessing-lib requirements by @swith005 in #1570
- added changes for installing with python 13 by @swith005 in #1554
- fix type mispatmch str/int in ededup when using int_doc_id by @touma-I in #1559
- LF TSC changes by @shahrokhDaijavad in #1573
- Image Transforms : Added model loader for downloading yolov in runtime by @Raghav-Bell in #1572
- updated files to utilize uv by @swith005 in #1574
- fixing web2parquet notebook by @shahrokhDaijavad in #1577
- adding dev6 release for regression testing by @swith005 in #1579
- preparing for a new release (1.1.7) by @swith005 in #1580
Full Changelog: v1.1.6...v1.1.7
v.1.1.6
What's Changed
- Tekton pipelines Feature branch for experimentation by @matouma in #1454
- Tekton 2 by @touma-I in #1461
- added role for pipeline service account on ocp by @matouma in #1463
- Fix Access denied for docling image by @touma-I in #1465
- Build and use new image with kubectl and oc commands by @touma-I in #1473
- preparing for a new release by @swith005 in #1476
- Restructured folder separating tasks from pipelines by @touma-I in #1479
- Reference feature/tekton in repo task by @touma-I in #1482
- updated rag_pdf_example with version 1.1.4 and docling2parquet by @swith005 in #1478
- Documentation : fixed the parameter
int_id_columnname by @Raghav-Bell in #1475 - Truncated exception message by @Raghav-Bell in #1480
- Update doc_id-ray.ipynb with correct int column name by @swith005 in #1490
- updated blocklist to accept blocked_domain_list locally without daf by @swith005 in #1497
- hacktoberfest - Fix dpk_docling2parquet and other other imports in notebooks that were broken in many places - issue 1481 by @maaleemkazmi-code in #1488
- Added crypto example to the PII recipe by @shahrokhDaijavad in #1486
- DocID: Create and test new Python Dockerfile and Python job by @touma-I in #1498
- DocId: Tested to show depricated warning by @touma-I in #1501
- added easyocr to requirements sincen no longer included in latest ver… by @swith005 in #1502
- Enable Tekton pipeline to use python runtime by @touma-I in #1492
- Examples notebook for newly added transform blocklist #1315 by @mahadevroy84 in #1496
- Combine all dpk logs into a single one by @roytman in #1499
- Added notes for hugging face gated models. by @Raghav-Bell in #1504
- Docling2parquet options for extractings binaries for images/pages by @swith005 in #1505
- Tekton pipelines running on K8s by @touma-I in #1460
- Allow runtime to process zip, ndjson and json in addition to parquet by @touma-I in #1506
- Fix Doc id to be able to process lists content -Rebased by @touma-I in #1520
- Mention support for new formats in README by @shahrokhDaijavad in #1515
- Opensearch transform by @touma-I in #1449
- Multimodal refactor by @swith005 in #1516
- Release 3 Multi-modal transforms by @matouma in #1278
- removed all kfp workflows by @swith005 in #1522
- adding dev1 release for regression testing by @swith005 in #1521
- preparing for a new release (1.1.6) by @swith005 in #1523
New Contributors
- @maaleemkazmi-code made their first contribution in #1488
- @mahadevroy84 made their first contribution in #1496
Full Changelog: v1.1.5...v1.1.6
v1.1.5
What's Changed
- Change in the governance personnel by @shahrokhDaijavad in #1444
- PII Redactor: restructure to use new runtime file names and tested for cryto by @touma-I in #1459
- Gneissweb: use new convention for runtime file names by @touma-I in #1458
- added options for vlm granite docling in docling2parquet by @swith005 in #1456
- Fixed typo in CONTRIBUTING.md file by @shahrokhDaijavad in #1452
- Documentation - PDF File link correction by @Raghav-Bell in #1464
- Added python multiprocessing job and fix multiprocesing boto pickle e… by @touma-I in #1457
- Changes to recipe notebooks due to release 1.1.5.dev0 by @shahrokhDaijavad in #1467
- Gw multiprocessing and rayjob by @touma-I in #1445
- restrict version of matplotlib in requirements.txt for flair in pii_r… by @swith005 in #1468
- check filter_criteria for None by @swith005 in #1470
New Contributors
- @Raghav-Bell made their first contribution in #1464
Full Changelog: v1.1.4...v1.1.5
v1.1.4
What's Changed
- preparing for a new release by @swith005 in #1425
- [bug] Extend tokenization to process list of strings by @matouma in #1433
- Updating code to put -1 when text content is empty by @santoshborse in #1424
- fix additional secrets by @roytman in #1431
- updated filter transform to return empty table with original schema r… by @swith005 in #1434
- Avoid latest release of polars 1.33 that is breaking the code by @touma-I in #1439
- added support for binary transforms and data in chain, updated tests, readme, … by @swith005 in #1429
- Drop lower bound on boto3 dependency. by @revit13 in #1437
- updated logging to remove access and secret key from config if there by @swith005 in #1440
- adding dev1 release for regression testing by @swith005 in #1441
- preparing for a new release (1.1.4) by @swith005 in #1442
Full Changelog: v1.1.3...v1.1.4
v1.1.3
What's Changed
- preparing for next release (post 1.1.2) by @swith005 in #1358
- Interchanged the 2 notebooks for code_quality transform by @shahrokhDaijavad in #1350
- Fixed the bug with MD file as input for docling2parquet v2 by @shahrokhDaijavad in #1360
- Added kfp_ray folder and files for the code_profile transform by @shahrokhDaijavad in #1335
- Fixing the logo in README.md by @Ibrahim2595 in #1364
- fix typo in release note text by @matouma in #1373
- Update release-notes.md by @touma-I in #1374
- Upgrade Pyarrow to 17.0.0 and resolve pandas/numpy conflict with google collab by @matouma in #1372
- Adjust tagged dependencies to work with google collab by @touma-I in #1367
- get rid of removed_column by @matouma in #1371
- patch v1.1.2 to address bug in filter code and relax requirements for testing with later releases of pydantic by @touma-I in #1385
- add the ABC class as a base class to all transforms by @roytman in #1383
- Prepare post1 release with patches by @touma-I in #1386
- Updated the Gneissweb notebook with the latest release and new API for the filter by @shahrokhDaijavad in #1388
- Merge patch fixes with dev branch by @touma-I in #1389
- Contributing C4 Annotator by @santoshborse in #1322
- finalize numpy<=1.26.4 across all requirements.txt files by @matouma in #1390
- Relax requirements for xxhash for work Haifa is doing by @matouma in #1395
- Parse metadata.json at end of the run and flag for exceptions by @matouma in #1396
- latest release of setuptools is breaking the build by @touma-I in #1403
- updated model_loader to utilize data_access_s3 for s3 load by @swith005 in #1404
- Removing not required torch dependency by @santoshborse in #1411
- Mem tests by @swith005 in #1407
- updated valid config for io for data_access_local by @swith005 in #1414
- Fixing Divide by 0 in fine web quality annotator by @santoshborse in #1394
- adding dev1 release for regression testing by @swith005 in #1418
- preparing for new release 1.1.3 by @swith005 in #1419
New Contributors
- @Ibrahim2595 made their first contribution in #1364
Full Changelog: v.1.1.2...v1.1.3
v.1.1.2
What's Changed
- Debug issue with Ededup kfp v1 failing in fork by @matouma in #877
- Transforms 1.0.0a0 refactored language transforms by @matouma in #879
- Restructure Html2Parquet with its own dpk_ namespace by @touma-I in #809
- Restructure Pdf2Parquet with its own dpk_ namespace by @touma-I in #813
- Restructure text_encoder with its own dpk_ namespace by @touma-I in #826
- refactored doc quality transform as its own module with its own dpk_ namespace by @touma-I in #854
- Refactor doc_id with its own dpk_ module name by @touma-I in #860
- first cut at refactoring with own dpk_lang_id name space by @touma-I in #864
- Refactor hap transform as its own dpk_ module by @touma-I in #866
- Refactor tokenization transform as its own dpk_tokenization module by @touma-I in #869
- Refactored ededup with own dpk_ededup namespace by @matouma in #878
- pull changes from fork to main repo by @matouma in #892
- FDedup Refactored as its own dpk_ module by @matouma in #893
- refactor tokenization transform as its own named dpk_ module by @matouma in #886
- Fix transforms 1.0 alpha release so it uses docid to generate int id required by fdedup by @matouma in #899
- Added checkmarks for Code Profile in the README by @shahrokhDaijavad in #894
- Initial Pass at Similarity Transform by @AnLiGentile in #897
- Refactored fitler tansform as its own dpk_filter named module by @matouma in #900
- PII data file by @PoojaHolkar in #828
- Enhance header cleanser module with multi-processing and timeout by @takuyagt in #849
- Relax requirements on pandas and requests by @touma-I in #901
- Add image_pull_secrets paameter to add_settings_to_comp for kfp v2 by @revit13 in #915
- Fixing the broken links in the main README file by @shahrokhDaijavad in #917
- Update README.md for the Similarity transform by @shahrokhDaijavad in #911
- Update README.md for the filter transform by @shahrokhDaijavad in #919
- Refactoring code profiler transform to new pythonic code layout #913 by @pankajskku in #916
- Adding support for c sharp by @pankajskku in #926
- Added TRANSFROM_NAME to docker build arg by @matouma in #929
- Updated readme and added notebook by @yash-kalathiya in #845
- fix: updated broken links and paths in kfp v2 documentation by @juancappi in #907
- Support the case where an arbitrary user id runs the ray docker images by @revit13 in #934
- Deleting obsolete notebooks by @shahrokhDaijavad in #933
- Fix path issues when running superworkflow pipeline sample for kfp v2 by @revit13 in #935
- added missing ray notebooks for doc_quality and filter by @matouma in #927
- Deleting 3 "run first .." notebooks from the example folder and the links to them from the main README file + new notebook by @shahrokhDaijavad in #938
- Remove DOCKER_REMOTE_IMAGE from .make.defaults by @touma-I in #890
- Update KFP_DOCKER_VERSION. by @revit13 in #937
- Update Readme.md removing confusion on version 0.2.3 vs 1.0.0 by @matouma in #939
- [agentic-exploration branch] Minor updates to dpk_intro_1_langchain notebook. by @revit13 in #942
- Remove KFP_DOCKER_VERSION. by @revit13 in #943
- README update for Similarity Transform by @AnLiGentile in #944
- Adding rules to the semantic rule set by @pankajskku in #941
- Refactoring pii_redactor as its own dpk_ named module by @matouma in #895
- Relax fasttext requirements >=0.9.2 by @matouma in #950
- Cleanup documentation for 1.0.0 by @touma-I in #945
- Fixed sample notebook location for html2parquet by @sujee in #948
- refactor noop transform to use dpk_ structures by @daw3rd in #951
- refactored profiler transform by @matouma in #966
- initial refactoring resize by @matouma in #960
- Updated Resources page per latest DPK announcement by @agoyal26 in #961
- Starting new release cycle after cutoff 1.0.0 by @matouma in #968
- Updating the semantic rules in csv file by @pankajskku in #963
- [KFP] Obtain the Ray cluster run ID from the user for KFP v2. by @revit13 in #956
- Ordered in reverse chronology and added Dates for events by @agoyal26 in #976
- add exception handling in mkdocs hook by @shivdeep-singh-ibm in #984
- aded quick patch disabling fcntl for Windows by @matouma in #987
- Updating rag-html-1 example by @sujee in #949
- update maintainers by @touma-I in #986
- designate folder for all data-files used by various examples and tutorial by @matouma in #994
- added Optional step for enabiling kfp by @touma-I in #992
- Add extreme tokenize and readability transforms by @cmadam in #965
- Making column names lowercase to make output table schema compatible with the Lakehouse by @pankajskku in #979
- Documentation adjustments by @cmadam in #999
- README files for supporting native windows by @shahrokhDaijavad in #991
- gneissweb_classification by @ran-iwamoto in #974
- DPK processing of text data for finetuning by @PoojaHolkar in #973
- Fix some typos in contribute-your-own-transform.md by @shahrokhDaijavad in #1004
- Rep removal by @swith005 in #953
- Dev 1.0.1.dev1 by @matouma in #1006
- Fdedup package versioning and windows fixes by @cmadam in #1003
- Testing dev1 release by @matouma in #1014
- Reorganized landing page readme and added readme to examples folder by @agoyal26 in #1001
- Update contribute-your-own-transform.md by @shahrokhDaijavad in #1019
- pdf-processing-1 example updated by @sujee in #998
- Updating URLs to point to main data prep kit repo by @sujee in #1022
- Upgrade Docling to v2.21 by @dolfim-ibm in #1031
- Cargo fix by @swith005 in #1016
- Readability transform: performance improvement and adding score_list argument by @cmadam in #1026
- added writeup for building dev wheel by @matouma in #1025
- DPK LLM Agent by @Mohammad-nassar10 in #1021
- Rag pdf 2 by @sujee in #955
- change data files location to 'examples/data-files/pdf-processing-1' by @sujee in ...
v1.1.1
What's Changed
- Debug issue with Ededup kfp v1 failing in fork by @matouma in #877
- Transforms 1.0.0a0 refactored language transforms by @matouma in #879
- Restructure Html2Parquet with its own dpk_ namespace by @touma-I in #809
- Restructure Pdf2Parquet with its own dpk_ namespace by @touma-I in #813
- Restructure text_encoder with its own dpk_ namespace by @touma-I in #826
- refactored doc quality transform as its own module with its own dpk_ namespace by @touma-I in #854
- Refactor doc_id with its own dpk_ module name by @touma-I in #860
- first cut at refactoring with own dpk_lang_id name space by @touma-I in #864
- Refactor hap transform as its own dpk_ module by @touma-I in #866
- Refactor tokenization transform as its own dpk_tokenization module by @touma-I in #869
- Refactored ededup with own dpk_ededup namespace by @matouma in #878
- pull changes from fork to main repo by @matouma in #892
- FDedup Refactored as its own dpk_ module by @matouma in #893
- refactor tokenization transform as its own named dpk_ module by @matouma in #886
- Fix transforms 1.0 alpha release so it uses docid to generate int id required by fdedup by @matouma in #899
- Added checkmarks for Code Profile in the README by @shahrokhDaijavad in #894
- Initial Pass at Similarity Transform by @AnLiGentile in #897
- Refactored fitler tansform as its own dpk_filter named module by @matouma in #900
- PII data file by @PoojaHolkar in #828
- Enhance header cleanser module with multi-processing and timeout by @takuyagt in #849
- Relax requirements on pandas and requests by @touma-I in #901
- Add image_pull_secrets paameter to add_settings_to_comp for kfp v2 by @revit13 in #915
- Fixing the broken links in the main README file by @shahrokhDaijavad in #917
- Update README.md for the Similarity transform by @shahrokhDaijavad in #911
- Update README.md for the filter transform by @shahrokhDaijavad in #919
- Refactoring code profiler transform to new pythonic code layout #913 by @pankajskku in #916
- Adding support for c sharp by @pankajskku in #926
- Added TRANSFROM_NAME to docker build arg by @matouma in #929
- Updated readme and added notebook by @yash-kalathiya in #845
- fix: updated broken links and paths in kfp v2 documentation by @juancappi in #907
- Support the case where an arbitrary user id runs the ray docker images by @revit13 in #934
- Deleting obsolete notebooks by @shahrokhDaijavad in #933
- Fix path issues when running superworkflow pipeline sample for kfp v2 by @revit13 in #935
- added missing ray notebooks for doc_quality and filter by @matouma in #927
- Deleting 3 "run first .." notebooks from the example folder and the links to them from the main README file + new notebook by @shahrokhDaijavad in #938
- Remove DOCKER_REMOTE_IMAGE from .make.defaults by @touma-I in #890
- Update KFP_DOCKER_VERSION. by @revit13 in #937
- Update Readme.md removing confusion on version 0.2.3 vs 1.0.0 by @matouma in #939
- [agentic-exploration branch] Minor updates to dpk_intro_1_langchain notebook. by @revit13 in #942
- Remove KFP_DOCKER_VERSION. by @revit13 in #943
- README update for Similarity Transform by @AnLiGentile in #944
- Adding rules to the semantic rule set by @pankajskku in #941
- Refactoring pii_redactor as its own dpk_ named module by @matouma in #895
- Relax fasttext requirements >=0.9.2 by @matouma in #950
- Cleanup documentation for 1.0.0 by @touma-I in #945
- Fixed sample notebook location for html2parquet by @sujee in #948
- refactor noop transform to use dpk_ structures by @daw3rd in #951
- refactored profiler transform by @matouma in #966
- initial refactoring resize by @matouma in #960
- Updated Resources page per latest DPK announcement by @agoyal26 in #961
- Starting new release cycle after cutoff 1.0.0 by @matouma in #968
- Updating the semantic rules in csv file by @pankajskku in #963
- [KFP] Obtain the Ray cluster run ID from the user for KFP v2. by @revit13 in #956
- Ordered in reverse chronology and added Dates for events by @agoyal26 in #976
- add exception handling in mkdocs hook by @shivdeep-singh-ibm in #984
- aded quick patch disabling fcntl for Windows by @matouma in #987
- Updating rag-html-1 example by @sujee in #949
- update maintainers by @touma-I in #986
- designate folder for all data-files used by various examples and tutorial by @matouma in #994
- added Optional step for enabiling kfp by @touma-I in #992
- Add extreme tokenize and readability transforms by @cmadam in #965
- Making column names lowercase to make output table schema compatible with the Lakehouse by @pankajskku in #979
- Documentation adjustments by @cmadam in #999
- README files for supporting native windows by @shahrokhDaijavad in #991
- gneissweb_classification by @ran-iwamoto in #974
- DPK processing of text data for finetuning by @PoojaHolkar in #973
- Fix some typos in contribute-your-own-transform.md by @shahrokhDaijavad in #1004
- Rep removal by @swith005 in #953
- Dev 1.0.1.dev1 by @matouma in #1006
- Fdedup package versioning and windows fixes by @cmadam in #1003
- Testing dev1 release by @matouma in #1014
- Reorganized landing page readme and added readme to examples folder by @agoyal26 in #1001
- Update contribute-your-own-transform.md by @shahrokhDaijavad in #1019
- pdf-processing-1 example updated by @sujee in #998
- Updating URLs to point to main data prep kit repo by @sujee in #1022
- Upgrade Docling to v2.21 by @dolfim-ibm in #1031
- Cargo fix by @swith005 in #1016
- Readability transform: performance improvement and adding score_list argument by @cmadam in #1026
- added writeup for building dev wheel by @matouma in #1025
- DPK LLM Agent by @Mohammad-nassar10 in #1021
- Rag pdf 2 by @sujee in #955
- change data files location to 'examples/data-files/pdf-processing-1' by @sujee in ...
v1.1.0
What's Changed
- [agentic-exploration branch] Minor updates to dpk_intro_1_langchain notebook. by @revit13 in #942
- Starting new release cycle after cutoff 1.0.0 by @matouma in #968
- Updating the semantic rules in csv file by @pankajskku in #963
- [KFP] Obtain the Ray cluster run ID from the user for KFP v2. by @revit13 in #956
- Ordered in reverse chronology and added Dates for events by @agoyal26 in #976
- add exception handling in mkdocs hook by @shivdeep-singh-ibm in #984
- aded quick patch disabling fcntl for Windows by @matouma in #987
- Updating rag-html-1 example by @sujee in #949
- update maintainers by @touma-I in #986
- designate folder for all data-files used by various examples and tutorial by @matouma in #994
- added Optional step for enabiling kfp by @touma-I in #992
- Add extreme tokenize and readability transforms by @cmadam in #965
- Making column names lowercase to make output table schema compatible with the Lakehouse by @pankajskku in #979
- Documentation adjustments by @cmadam in #999
- README files for supporting native windows by @shahrokhDaijavad in #991
- gneissweb_classification by @ran-iwamoto in #974
- DPK processing of text data for finetuning by @PoojaHolkar in #973
- Fix some typos in contribute-your-own-transform.md by @shahrokhDaijavad in #1004
- Rep removal by @swith005 in #953
- Dev 1.0.1.dev1 by @matouma in #1006
- Fdedup package versioning and windows fixes by @cmadam in #1003
- Testing dev1 release by @matouma in #1014
- Reorganized landing page readme and added readme to examples folder by @agoyal26 in #1001
- Update contribute-your-own-transform.md by @shahrokhDaijavad in #1019
- pdf-processing-1 example updated by @sujee in #998
- Updating URLs to point to main data prep kit repo by @sujee in #1022
- Upgrade Docling to v2.21 by @dolfim-ibm in #1031
- Cargo fix by @swith005 in #1016
- Readability transform: performance improvement and adding score_list argument by @cmadam in #1026
- added writeup for building dev wheel by @matouma in #1025
- DPK LLM Agent by @Mohammad-nassar10 in #1021
- Rag pdf 2 by @sujee in #955
- change data files location to 'examples/data-files/pdf-processing-1' by @sujee in #1036
- Updates the doc to show how to pip install and run a transform at the CLI by @daw3rd in #928
- KFP v2: Fix wrong Ray cluster name by @revit13 in #1039
- Extreme Tokenize transform fails when the number of documents is not equal to the number of tokens sets by @cmadam in #1053
- GneissWeb_recipe_notebook by @Hajar-Emami in #1055
- Fixed Readme git website by @agoyal26 in #1049
- Add shorter alternative flags to options in execute_ray_job_multi_s3.py. by @revit13 in #1067
- Update super pipeline kfp v2. by @revit13 in #1066
- Update transform.py by @ian-cho in #1056
- Add Supported Languages Table to Lang_id transform by @shahrokhDaijavad in #1068
- test pr target by @matouma in #1075
- test using env variable by @matouma in #1076
- added pull request target to code quality and gneissweb by @matouma in #1077
- Update main README.md to fix two broken links in the table by @shahrokhDaijavad in #1074
- Fix PDF with RAG url in readme by @dpkshetty in #1062
- updated embedding model and LLM for rag-pdf-1 example by @sujee in #1060
- trigger on pull request by @matouma in #1082
- use None value rather than None string by @matouma in #1083
- Change gneissweb classification workflow to use PR Target by @matouma in #1095
- Update transform.py by @ian-cho in #1058
- Clear the notebook of the run details by @Hajar-Emami in #1071
- Share secret securrely across fork by @matouma in #1084
- Enabling gneissweb_classification transform by using multiple fasttext classifiers simultaneously by @ran-iwamoto in #1046
- Fdedup::transform() return 0 for success or error code by @cmadam in #1041
- Avoid exposing Hugginface token in lang-id kfp pipeline by @revit13 in #1099
- Tokenization2Arrow - New Transform to tokenize data and generate .arrow and metadata files by @santoshborse in #1033
- Update docling to 2.25 and enable XML/JATS by @dolfim-ibm in #1108
- Implementing Bloom Annotator by @ian-cho in #978
- GneissWeb Notebook that uses dev2 by @shahrokhDaijavad in #1103
- Dev3 testing by @matouma in #1111
- relax requirements for boto3 by @matouma in #1018
- toolkit release 0.2.4 and transforms release 1.1.0 by @matouma in #1115
New Contributors
- @ran-iwamoto made their first contribution in #974
- @swith005 made their first contribution in #953
- @Hajar-Emami made their first contribution in #1055
- @dpkshetty made their first contribution in #1062
Full Changelog: v1.0.0...v1.1.0
v1.0.0
What's Changed
- Debug issue with Ededup kfp v1 failing in fork by @matouma in #877
- Transforms 1.0.0a0 refactored language transforms by @matouma in #879
- Restructure Html2Parquet with its own dpk_ namespace by @touma-I in #809
- Restructure Pdf2Parquet with its own dpk_ namespace by @touma-I in #813
- Restructure text_encoder with its own dpk_ namespace by @touma-I in #826
- refactored doc quality transform as its own module with its own dpk_ namespace by @touma-I in #854
- Refactor doc_id with its own dpk_ module name by @touma-I in #860
- first cut at refactoring with own dpk_lang_id name space by @touma-I in #864
- Refactor hap transform as its own dpk_ module by @touma-I in #866
- Refactor tokenization transform as its own dpk_tokenization module by @touma-I in #869
- Refactored ededup with own dpk_ededup namespace by @matouma in #878
- pull changes from fork to main repo by @matouma in #892
- FDedup Refactored as its own dpk_ module by @matouma in #893
- refactor tokenization transform as its own named dpk_ module by @matouma in #886
- Fix transforms 1.0 alpha release so it uses docid to generate int id required by fdedup by @matouma in #899
- Added checkmarks for Code Profile in the README by @shahrokhDaijavad in #894
- Initial Pass at Similarity Transform by @AnLiGentile in #897
- Refactored fitler tansform as its own dpk_filter named module by @matouma in #900
- PII data file by @PoojaHolkar in #828
- Enhance header cleanser module with multi-processing and timeout by @takuyagt in #849
- Relax requirements on pandas and requests by @touma-I in #901
- Add image_pull_secrets paameter to add_settings_to_comp for kfp v2 by @revit13 in #915
- Fixing the broken links in the main README file by @shahrokhDaijavad in #917
- Update README.md for the Similarity transform by @shahrokhDaijavad in #911
- Update README.md for the filter transform by @shahrokhDaijavad in #919
- Refactoring code profiler transform to new pythonic code layout #913 by @pankajskku in #916
- Adding support for c sharp by @pankajskku in #926
- Added TRANSFROM_NAME to docker build arg by @matouma in #929
- Updated readme and added notebook by @yash-kalathiya in #845
- fix: updated broken links and paths in kfp v2 documentation by @juancappi in #907
- Support the case where an arbitrary user id runs the ray docker images by @revit13 in #934
- Deleting obsolete notebooks by @shahrokhDaijavad in #933
- Fix path issues when running superworkflow pipeline sample for kfp v2 by @revit13 in #935
- added missing ray notebooks for doc_quality and filter by @matouma in #927
- Deleting 3 "run first .." notebooks from the example folder and the links to them from the main README file + new notebook by @shahrokhDaijavad in #938
- Remove DOCKER_REMOTE_IMAGE from .make.defaults by @touma-I in #890
- Update KFP_DOCKER_VERSION. by @revit13 in #937
- Update Readme.md removing confusion on version 0.2.3 vs 1.0.0 by @matouma in #939
- Remove KFP_DOCKER_VERSION. by @revit13 in #943
- README update for Similarity Transform by @AnLiGentile in #944
- Adding rules to the semantic rule set by @pankajskku in #941
- Refactoring pii_redactor as its own dpk_ named module by @matouma in #895
- Relax fasttext requirements >=0.9.2 by @matouma in #950
- Cleanup documentation for 1.0.0 by @touma-I in #945
- Fixed sample notebook location for html2parquet by @sujee in #948
- refactor noop transform to use dpk_ structures by @daw3rd in #951
- refactored profiler transform by @matouma in #966
- initial refactoring resize by @matouma in #960
- Updated Resources page per latest DPK announcement by @agoyal26 in #961
- Cut-off release for refactored language transforms by @matouma in #967
New Contributors
- @AnLiGentile made their first contribution in #897
- @yash-kalathiya made their first contribution in #845
Full Changelog: v0.2.3...v1.0.0
v0.2.3
What's Changed
- Fuzzy dedup by @Kibnelson in #699
- Doc Quality Transform: update readme and add sample notebook by @dtsuzuku-ibm in #790
- Fix for inability to read some parquet files (issue #816) by @daw3rd in #817
- Updated Resources webpage with latest talks and links by @agoyal26 in #846
- HAP transform: Update README.md and add sample notebook by @ian-cho in #821
- publish transforms==0.2.3.dev0 pre-release to pypi with dependency on toolkit==0.2.2 by @touma-I in #837
- Semantic profiler and report generation module integration by @pankajskku in #824
- Update doc for doc_id and ededup to follow template in issue #753 by @cmadam in #836
- Update README.md for check-marking the table with Python and Spark versions of fdedup by @shahrokhDaijavad in #855
- Added links to example notebooks - issue #848 fix by @cmadam in #861
- Hap score - Example Notebook by @AishaDarga in #840
- Simplified fix for issue 803 by @cmadam in #839
- Html rag 1 -- Crawl a website / process HTML / run RAG queries by @sujee in #838
- fix usage of pandas 2.1.x by @dolfim-ibm in #867
- Bug fix for Agda language in code profiler transform by @pankajskku in #865
- Release 0.2.3.dev1 per Constantin's request by @touma-I in #875
- Create pre-release wheels for code_profiler using transform 0.2.3.dev1 and toolkit 0.2.3.dev0 by @touma-I in #857
- Grant non-root users the necessary permissions to the ray directory by @revit13 in #881
- Start of a new release cycle with 1.0.0 by @matouma in #885
New Contributors
- @Kibnelson made their first contribution in #699
- @agoyal26 made their first contribution in #846
- @AishaDarga made their first contribution in #840
Full Changelog: v0.2.2...v0.2.3