Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache dict is overriden after write_pdf (breaking it) #2228

Closed
jbtwist opened this issue Aug 14, 2024 · 5 comments
Closed

Cache dict is overriden after write_pdf (breaking it) #2228

jbtwist opened this issue Aug 14, 2024 · 5 comments
Labels
crash Problems preventing documents from being rendered
Milestone

Comments

@jbtwist
Copy link

jbtwist commented Aug 14, 2024

I'm using WeasyPrint 61.2 in my web app to send email in bulk to many users. These emails might have attached PDF's that I generate using the library. For performance, I use the cache kwarg to avoid generating multiple times my pictures.

When I start sending emails in bulk, in the first iteration, weasyprint generates the PNG file and stores it in my cache after calling Document.build_formatting_structure. This method generates this cache:
Screenshot 2024-08-14 at 16 05 23

Where every picture has 2 entries, one with a hash identifying it, and another with the image url.

This continues executing html.write_pdf method normally, but after finishing, my cache has changed to this:
Screenshot 2024-08-14 at 16 13 41

The problem comes with next iterations of html.write_pdf.
My cache has changed and the method Document.build_formatting_structure is not returning it to it's previous state.

When the code continues and arrives the moment of getting the data from my PNG in RasterImage.get_x_object, self.image.data does not contain the PNG Bytes we see in the first picture but the Bytes we see in the second one, causing it to fail raising UnidentifiedImageError('cannot identify image file <_io.BytesIO object at 0x11db6c5e0>')

I have tried with all my heart to solve this by myself as you might see, but I don't find the place where the cache is changing, nor understanding what is it writing on it's place.

@liZe liZe added the crash Problems preventing documents from being rendered label Aug 15, 2024
@liZe
Copy link
Member

liZe commented Aug 15, 2024

There’s definitely a problem, thanks for reporting.

@alexandergitter
Copy link
Contributor

While I can't say I completely understand the code, it appears to be related to how RasterImage uses the same cache key for different data (at least in the case of png):

The data written by get_x_object is not the entire png file though - it gets mangled by _get_png_data, which e.g. removes the png header.
Another run with the same cache then fails because get_x_object itself expects to read a full png file in different places.

This also appears to be closely related to #1942, which afaict fails for the same underlying reason.

@liZe
Copy link
Member

liZe commented Aug 20, 2024

While I can't say I completely understand the code, it appears to be related to how RasterImage uses the same cache key for different data (at least in the case of png):

Yes, you’re right. I’ll fix the bug as soon as I can.

@jbtwist
Copy link
Author

jbtwist commented Aug 20, 2024

Many thanks for the quick action and finding the proper bug. I tried my best to do it but I felt uncapable (or at least I would take a lot of time and I had more work to do) due to my lack of awareness of the whole project, I gave all the info I was able to discover by myself.

@liZe liZe closed this as completed in 176bd71 Aug 22, 2024
@liZe liZe added this to the 63.0 milestone Aug 22, 2024
@liZe
Copy link
Member

liZe commented Aug 22, 2024

Thanks a lot for the report and the investigation. Feedback is welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crash Problems preventing documents from being rendered
Projects
None yet
Development

No branches or pull requests

3 participants