Skip to content

Conversation

@sekrause
Copy link
Contributor

Recreated pull-request of the accidentally closed PR #331.

It fixes #329 by raising an exception when the stream ends and we haven't the end token for the inline image.

It also fixes #330 by using a more efficient parsing algorithm. For large inline images this change speeds up this method by many orders of magnitude:

  • Instead of only reading single bytes from the stream it reads larger chunks of 8 kB and uses Python's really fast find() method to check for the E the token. Only when the token is found it falls back to the normal algorithm that detects the end of the inline image.
  • Instead of using an immutable data it uses BytesIO to collect the output which support much faster appends.

@MartinThoma MartinThoma added nf-security Non-functional change: Security Tiny Pull requests that make a tiny change - and thus should be easy to merge labels Apr 12, 2022
@MartinThoma
Copy link
Member

@MasterOdin What do you think about the PR? Do you see anything that this could break?

@codecov-commenter
Copy link

codecov-commenter commented Apr 15, 2022

Codecov Report

Merging #740 (3d5548d) into main (0890b06) will decrease coverage by 0.06%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main     #740      +/-   ##
==========================================
- Coverage   69.53%   69.47%   -0.07%     
==========================================
  Files           9        9              
  Lines        3309     3315       +6     
  Branches      782      783       +1     
==========================================
+ Hits         2301     2303       +2     
- Misses        767      769       +2     
- Partials      241      243       +2     
Impacted Files Coverage Δ
PyPDF2/pdf.py 72.17% <50.00%> (-0.15%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0890b06...3d5548d. Read the comment docs.

@MartinThoma
Copy link
Member

I'm currently trying to find a PDF with an inline image so that the code at least runs once

@MartinThoma
Copy link
Member

Ah, just found your comment again:

from reportlab.pdfgen import canvas
c = canvas.Canvas("test.pdf")
c.drawInlineImage("test.png", 100, 100, 100, 100)
c.drawString(200, 100, "Test")
c.showPage()
c.save()

@MartinThoma MartinThoma merged commit d71fb3e into py-pdf:main Apr 15, 2022
@MartinThoma
Copy link
Member

Thank you so much for all the time you invested into this over so 5 years! 🙏

MartinThoma added a commit that referenced this pull request Apr 15, 2022
Security (SEC):

- ContentStream_readInlineImage had potential infinite loop (#740)

Bug fixes (BUG):

- Fix merging encrypted files (#757)
- CCITTFaxDecode decodeParms can be an ArrayObject (#756)

Robustness improvements (ROBUST):

- title sometimes None (#744)

Documentation (DOC):

- Adjust short description of the package

Tests and Test setup (TST):

- Rewrite JS tests from unittest to pytest (#746)
- Increase Test coverage, mainly with filters (#756)
- Add test for inline images (#758)

Developer Experience Improvements (DEV):

- Remove unused Travis-CI configuration (#747)
- Show code coverage (#754, #755)
- Add mutmut (#760)

Miscellaneous:

- STY: Closing file handles, explicit exports, ... (#743)

All changes: 1.27.4...1.27.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nf-security Non-functional change: Security Tiny Pull requests that make a tiny change - and thus should be easy to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ContentStream._readInlineImage is really slow on large inline images Manipulated inline images can force PyPDF2 into an infinite loop

3 participants