Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table.traverse() is extremely slow when table has many blanks #46

Open
GPHemsley-RELX opened this issue Aug 15, 2024 · 2 comments
Open

Comments

@GPHemsley-RELX
Copy link

I have an ODS file that includes the following silly row at the end (saved from Excel):

<table:table-row table:number-rows-repeated="1048563" table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row>

Attempting to run Table.traverse() on this file (or any of the row functions that rely on it) without first calling Table.optimize_width() takes an extremely long time for what is essentially just a bunch of blanks.

Is this an inherent speed limitation of Python, or can the traverse repeat algorithm be optimized somehow?

@jdum
Copy link
Owner

jdum commented Aug 16, 2024

A complex question. In short: odfdo is inherently slow.

What does traverse() do:

  • odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.

  • In ODF, there is no way to directly access a row or cell without traversing all previous ones. Because of the "number-rows-repeated" feature. So naive row iteration would be an O(n2) algorithm and cell iteration an O(n4) algorithm.

  • The traverse() method (and the other get_row methods) uses and builds a cache: the second time an item is requested, it is no longer necessary to count all the previous rows to find it. So there is a little extra time the first time to build the cache.

  • Another very expensive operation is creating the Row() or Cell() object if it is not already in the cache. So number-rows-repeated = 1048563 is nearly a zip bomb, 1 million underlying Python and XML objects are created.

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

@GPHemsley-RELX
Copy link
Author

GPHemsley-RELX commented Aug 16, 2024

  • odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.

Yeah, that's the sticking point, I think. The example in question was just a test document Save As'd from XLSX. The real-world documents I'm handling take about 10 minutes to run on a fairly reasonable amount of records, though I'm wondering if that is affected by every cell including table:number-columns-spanned="1" table:number-rows-spanned="1"?

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

Yeah, I'm only reading in files to get at their data, so I was trying to avoid this, but it looks like Table.rstrip() may be what I need to squeeze what I can out of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants