Fix parsing of row_per_reading format with malformed dates #3967

ldodds · 2024-10-09T16:36:55Z

Half-hourly data can be labelled either at the start or the end of the half-hourly period. E.g. the usage between 1.30am and 2.00am might be labelled as "01:30" or "02:00".

For CSV/Spreadsheet formats where there is a single row per day there is generally no ambiguity as we parse the 48 HH columns in the order presented.

But for row_per_reading formats there's some ambiguity. Without more context a column labelled as "8 Oct 2024 00:00" may refer to the usage from 23:30 to midnight on 7th October, or the usage from midnight to 00:30am on the 8th October.

When processing timestamps, our SingleReadConverter has been written with the assumption that fields are labelled at the end of the half hour and that "8 Oct 2024 00:00" refers to the final half-hourly period of the 7th October. Data is then shifted.

Reviewing EDF field shows that they label their periods at the start of the half-hour (which is also our internal default). This means we're incorrectly interpreting the timestamps.

This PR fixes this by:

doing a fairly extensive rework of the supporting tests to remove a lot of boilerplate and expose the error (the behaviour was previously being tested as OK)
adding a new flag to AmrDataFeedConfig to indicate how the half-hourly periods are labelled. This is only expected to be populated for row_per_reading formats
adding some additional constraints to AmrDataFeedConfig to reduce likelihood of incorrect configs being created (although this hasn't happened yet)
reworking the SingleReadConverter to use this configuration to fix the issue, along with some refactoring to better expose the different ways in which we derive the half-hourly index (e.g. from an index in the original file, a time column or via a timestamp)
updating the EDF config to use the new setting
some additional rework of the parsing of time stamps for other formats to simplify code and fix spec that was failing locally but not on github

Rework specs to expose error

ec564e4

ldodds changed the title ~~Rework specs to expose error~~ Fix parsing of row_per_reading format with malformed dates Oct 9, 2024

ldodds added 15 commits October 9, 2024 17:50

More tidying

c0612cb

Tweak iso spec

d5d633d

Tidy

e482327

Finish refactoring specs

6183bd2

Add config for labelling hh periods, rework index code

887e80c

Tidy specs and improve factory

3f1905f

Fix up specs

2f1374b

Add validations

e6f9a91

Reconfigure EDF

515ee70

Test

829641b

Another test

58ba69a

Rework code

1bae6b1

Rubocop

1a714a2

Raise error for invalid datetime

a5a9256

Tidy up

cd24bd2

ldodds requested a review from tbhi October 11, 2024 16:49

ldodds marked this pull request as ready for review October 11, 2024 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parsing of row_per_reading format with malformed dates #3967

Fix parsing of row_per_reading format with malformed dates #3967

ldodds commented Oct 9, 2024 •

edited

Loading

Fix parsing of row_per_reading format with malformed dates #3967

Are you sure you want to change the base?

Fix parsing of row_per_reading format with malformed dates #3967

Conversation

ldodds commented Oct 9, 2024 • edited Loading

ldodds commented Oct 9, 2024 •

edited

Loading