`Table.traverse()` is extremely slow when table has many blanks #46

GPHemsley-RELX · 2024-08-15T20:46:07Z

I have an ODS file that includes the following silly row at the end (saved from Excel):

<table:table-row table:number-rows-repeated="1048563" table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row>

Attempting to run Table.traverse() on this file (or any of the row functions that rely on it) without first calling Table.optimize_width() takes an extremely long time for what is essentially just a bunch of blanks.

Is this an inherent speed limitation of Python, or can the traverse repeat algorithm be optimized somehow?

The text was updated successfully, but these errors were encountered:

jdum · 2024-08-16T16:09:53Z

A complex question. In short: odfdo is inherently slow.

What does traverse() do:

odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.
In ODF, there is no way to directly access a row or cell without traversing all previous ones. Because of the "number-rows-repeated" feature. So naive row iteration would be an O(n2) algorithm and cell iteration an O(n4) algorithm.
The traverse() method (and the other get_row methods) uses and builds a cache: the second time an item is requested, it is no longer necessary to count all the previous rows to find it. So there is a little extra time the first time to build the cache.
Another very expensive operation is creating the Row() or Cell() object if it is not already in the cache. So number-rows-repeated = 1048563 is nearly a zip bomb, 1 million underlying Python and XML objects are created.

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

GPHemsley-RELX · 2024-08-16T17:32:52Z

odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.

Yeah, that's the sticking point, I think. The example in question was just a test document Save As'd from XLSX. The real-world documents I'm handling take about 10 minutes to run on a fairly reasonable amount of records, though I'm wondering if that is affected by every cell including table:number-columns-spanned="1" table:number-rows-spanned="1"?

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

Yeah, I'm only reading in files to get at their data, so I was trying to avoid this, but it looks like Table.rstrip() may be what I need to squeeze what I can out of things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Table.traverse()` is extremely slow when table has many blanks #46

`Table.traverse()` is extremely slow when table has many blanks #46

GPHemsley-RELX commented Aug 15, 2024

jdum commented Aug 16, 2024

GPHemsley-RELX commented Aug 16, 2024 •

edited

Loading

Table.traverse() is extremely slow when table has many blanks #46

Table.traverse() is extremely slow when table has many blanks #46

Comments

GPHemsley-RELX commented Aug 15, 2024

jdum commented Aug 16, 2024

GPHemsley-RELX commented Aug 16, 2024 • edited Loading

`Table.traverse()` is extremely slow when table has many blanks #46

`Table.traverse()` is extremely slow when table has many blanks #46

GPHemsley-RELX commented Aug 16, 2024 •

edited

Loading