Improved performance and security for ContentStream_readInlineImage. #740

sekrause · 2022-04-12T17:17:26Z

Recreated pull-request of the accidentally closed PR #331.

It fixes #329 by raising an exception when the stream ends and we haven't the end token for the inline image.

It also fixes #330 by using a more efficient parsing algorithm. For large inline images this change speeds up this method by many orders of magnitude:

Instead of only reading single bytes from the stream it reads larger chunks of 8 kB and uses Python's really fast find() method to check for the E the token. Only when the token is found it falls back to the normal algorithm that detects the end of the inline image.
Instead of using an immutable data it uses BytesIO to collect the output which support much faster appends.

MartinThoma · 2022-04-12T20:17:15Z

@MasterOdin What do you think about the PR? Do you see anything that this could break?

PyPDF2/pdf.py

codecov-commenter · 2022-04-15T11:18:21Z

Codecov Report

Merging #740 (3d5548d) into main (0890b06) will decrease coverage by 0.06%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main     #740      +/-   ##
==========================================
- Coverage   69.53%   69.47%   -0.07%     
==========================================
  Files           9        9              
  Lines        3309     3315       +6     
  Branches      782      783       +1     
==========================================
+ Hits         2301     2303       +2     
- Misses        767      769       +2     
- Partials      241      243       +2

Impacted Files	Coverage Δ
PyPDF2/pdf.py	`72.17% <50.00%> (-0.15%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0890b06...3d5548d. Read the comment docs.

MartinThoma · 2022-04-15T11:27:42Z

I'm currently trying to find a PDF with an inline image so that the code at least runs once

MartinThoma · 2022-04-15T11:28:29Z

Ah, just found your comment again:

from reportlab.pdfgen import canvas
c = canvas.Canvas("test.pdf")
c.drawInlineImage("test.png", 100, 100, 100, 100)
c.drawString(200, 100, "Test")
c.showPage()
c.save()

MartinThoma · 2022-04-15T11:57:51Z

Thank you so much for all the time you invested into this over so 5 years! 🙏

Security (SEC): - ContentStream_readInlineImage had potential infinite loop (#740) Bug fixes (BUG): - Fix merging encrypted files (#757) - CCITTFaxDecode decodeParms can be an ArrayObject (#756) Robustness improvements (ROBUST): - title sometimes None (#744) Documentation (DOC): - Adjust short description of the package Tests and Test setup (TST): - Rewrite JS tests from unittest to pytest (#746) - Increase Test coverage, mainly with filters (#756) - Add test for inline images (#758) Developer Experience Improvements (DEV): - Remove unused Travis-CI configuration (#747) - Show code coverage (#754, #755) - Add mutmut (#760) Miscellaneous: - STY: Closing file handles, explicit exports, ... (#743) All changes: 1.27.4...1.27.5

Improved performance and security for ContentStream_readInlineImage.

81c888f

sekrause mentioned this pull request Apr 12, 2022

Improved performance and security for ContentStream_readInlineImage #331

Closed

MartinThoma added nf-security Non-functional change: Security Tiny Pull requests that make a tiny change - and thus should be easy to merge labels Apr 12, 2022

Merge branch 'main' into fast-inline-images

e66c809

MartinThoma reviewed Apr 15, 2022

View reviewed changes

PyPDF2/pdf.py Outdated Show resolved Hide resolved

MartinThoma reviewed Apr 15, 2022

View reviewed changes

PyPDF2/pdf.py Outdated Show resolved Hide resolved

Minor style change

894c618

MartinThoma mentioned this pull request Apr 15, 2022

TST: Add test for inline images #758

Merged

Merge branch 'main' into fast-inline-images

3d5548d

MartinThoma merged commit d71fb3e into py-pdf:main Apr 15, 2022

MartinThoma mentioned this pull request Apr 16, 2022

Uses buffered input for inline images to speed up reading #390

Closed

MartinThoma mentioned this pull request Jun 26, 2022

Endless Loop When Processing Certain Large PDF with PdfFileWriter #358

Closed

pubpub-zz mentioned this pull request Sep 6, 2022

PdfReadError: Unexpected end of stream #1090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved performance and security for ContentStream_readInlineImage. #740

Improved performance and security for ContentStream_readInlineImage. #740

Uh oh!

sekrause commented Apr 12, 2022

Uh oh!

MartinThoma commented Apr 12, 2022

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 15, 2022 •

edited

Loading

Uh oh!

MartinThoma commented Apr 15, 2022

Uh oh!

MartinThoma commented Apr 15, 2022

Uh oh!

MartinThoma commented Apr 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improved performance and security for ContentStream_readInlineImage. #740

Improved performance and security for ContentStream_readInlineImage. #740

Uh oh!

Conversation

sekrause commented Apr 12, 2022

Uh oh!

MartinThoma commented Apr 12, 2022

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

MartinThoma commented Apr 15, 2022

Uh oh!

MartinThoma commented Apr 15, 2022

Uh oh!

MartinThoma commented Apr 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Apr 15, 2022 •

edited

Loading