Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Validation in XML reader #3592

Closed
1 task done
SlowFox71 opened this issue May 31, 2023 · 5 comments · Fixed by #3605
Closed
1 task done

Data Validation in XML reader #3592

SlowFox71 opened this issue May 31, 2023 · 5 comments · Fixed by #3605

Comments

@SlowFox71
Copy link

This is:

- [ ] a bug report
- [x] a feature request
- [x] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

Respect datavalidation in the XML reader

What is the current behavior?

Validation is not yet parsed.

What features do you think are causing the issue

  • Reader

Does an issue affect all spreadsheet file formats? If not, which formats are affected?

Only XML format is affected

Which versions of PhpSpreadsheet and PHP are affected?

All versions are affected

I attach a simple XML file which should cover the most relevant cases - if you need anything more just ask.
As I am still stuck with an ancient version I implemented this myself (quite rude, but it still might be used as a starting point - actually it seems easier than I expected. $children_x are the Excel-namespaced children of the worksheet).

Care must be taken with the "hide"-Options: XML uses "show" as a default and can only hide; in PhpSpreadsheet the defaults are set the other way round.

One big issue seems to be performance if ranges are used. Column A in the sample file uses a simple, single validation rule, column B uses one with exception of B1 - the XML reflects how Excel thinks of this. I have not found a way to replicate this into PhpSpreadsheet, but using Coordinate::extractAllCellReferencesInRange() to split ranges into single cells is really expensive.
validation.txt
validate.txt

@oleibman
Copy link
Collaborator

oleibman commented Jun 1, 2023

In order to avoid the performance problem that you mention with ranges, the Xlsx Reader calls $sheet->shrinkRangeToFit($cell); this causes initialization of cells only up to the maximum row and column used on a sheet, not the maximum row and column in the range. That could easily be done with your code as well. It is certainly an imperfect solution. I don't know if it can be improved upon; I can imagine many complications.

@oleibman
Copy link
Collaborator

oleibman commented Jun 1, 2023

At any rate, you have created an unusually good example. One of your data validations exposes a nasty problem. Your "any of the texts in E1:E5", which Xlsx would store as <formula1>E1:E5</formula1>, Xml stores as <Value>R[-1]C[1]:R[3]C[1]</Value>. If we load the Xml file, and save it as Xlsx, passing the Xml Value as an Xlsx formula unchanged, Excel will complain that the Xlsx file is corrupt. Identifying when the Value needs to be changed and performing the change are probably activities that occur elsewhere in the codebase, but it will take a while to pin down, and to test adequately.

@SlowFox71
Copy link
Author

Thanks for the hint to shrinkRangeToFit. This does the trick for my purposes, though in general it could be a problem, but for all formats.

To explore this bug I added a simple formula to my test sheet:
validation.txt
If I open it with Excel and save as Xlsx, both formula and validation are converted to A1 notation.

If I open it with PhpSpreadsheet and save as Xlsx the following file is produced:
validation_result.xlsx

PhpSpreadsheet tries to convert the formula but leaves the validation rule unchanged - is there any syntactical difference or could the same code be used? But in either way the code seems to be flawed (see the example) as PhpSpreadsheet ruins the formula. If helpful and if I did not make any mistake during conversion, I could open a separate bug report for this.

@oleibman
Copy link
Collaborator

oleibman commented Jun 2, 2023

No need for a separate bug report. I am aware of the additional problem and am working on it.

@oleibman
Copy link
Collaborator

oleibman commented Jun 2, 2023

And ... the DataValidator isValid function is incomplete - it currently validates only one type. That will need to be fixed as part of the PR I'm working on.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue Jun 3, 2023
Fix PHPOffice#3592. Fix PHPOffice#3594. The original issue asked that Data Validation be added for Xml spreadsheets. However, doing so exposed some other Data Validation problems, which are corrected along with the Xml portion.

The main other problem is that the code for `Cell\DataValidator::isValid` covered almost no possibilities. It is expanded to include all possibilities, except for Custom Types, which I will continue to think about. Many tests are added.

An obscure problem is that Xlsx Reader does not quite work properly when the Data Validation cells are beyond the high-used column or row. We use the high-used values because it is completely impractical to set all the cells in a range when an entire column or row is selected. This change causes the leftmost top cell in a range to be explicitly allocated, affecting the high-used values. It is not 100% effective, but it will be much more difficult to get into the problem situation, which I will continue to think about.

Although our expectation is that we will not see any Xml Spreadsheets using non-standard namespacing, much of the Xml Reader code is changed to use proper namespace techniques. A few areas which just had nothing to do with this change continue to use less robust techniques; I plan to address those soon after implementing this change.

One of the Data Validation types is for time of day. Testing revealed that `Shared\Date::convertIsoDate` was insufficiently robust when determining whether its input consisted of a time of day without date. It is now more robust.
oleibman added a commit that referenced this issue Jun 8, 2023
Fix #3592. Fix #3594. The original issue asked that Data Validation be added for Xml spreadsheets. However, doing so exposed some other Data Validation problems, which are corrected along with the Xml portion.

The main other problem is that the code for `Cell\DataValidator::isValid` covered almost no possibilities. It is expanded to include all possibilities, except for Custom Types, which I will continue to think about. Many tests are added.

An obscure problem is that Xlsx Reader does not quite work properly when the Data Validation cells are beyond the high-used column or row. We use the high-used values because it is completely impractical to set all the cells in a range when an entire column or row is selected. This change causes the leftmost top cell in a range to be explicitly allocated, affecting the high-used values. It is not 100% effective, but it will be much more difficult to get into the problem situation, which I will continue to think about.

Although our expectation is that we will not see any Xml Spreadsheets using non-standard namespacing, much of the Xml Reader code is changed to use proper namespace techniques. A few areas which just had nothing to do with this change continue to use less robust techniques; I plan to address those soon after implementing this change.

One of the Data Validation types is for time of day. Testing revealed that `Shared\Date::convertIsoDate` was insufficiently robust when determining whether its input consisted of a time of day without date. It is now more robust.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants