Add new timestamp formats and simplify timestamp key handling in clp-s: #262

gibber9809 · 2024-02-04T18:13:18Z

Add new timestamp formats for clp, clp-s, and glt
Remove obsolete FLOATDATESTRING MST node
Handle timestamp keys in JsonParser rather than column writers (fixes clp-s: Nested timestamps result in incorrect mst node names #261)

References

Fixes #261

Description

Add some obvious variations of existing timestamp patterns to clp, clp-s, and glt
Add "%E" format specifier to clp-s to capture unix epoch millisecond timestamps found in strings
Refactor and simplify timestamp column compression code in clp-s
Fix a nested timestamp bug

Validation performed

Compressed data with the new timestamp formats and verified that the timestamps were parsed succesfully
Verified that nested timestamps now exhibit expected behaviour for compression/decompression/search
Added unit tests in clp for new timestamp formats
Ingested cockroachdb dataset and performed searches on timestamp column

…stamps

wraymo

Great work! Just did the first pass. One thing I'm little bit concerned about is if we mix timestamps in milliseconds with timestamp in nanoseconds, we can get wrong results during search. For example, if we have two log messages {"timestamp": "1707026480000"} and {"timestamp": "1707026455.123456789"}, and the query is timestamp > date(1707026480000), we should return nothing but actually we will return the second log message.

components/core/src/clp_s/TimestampPattern.cpp

gibber9809 · 2024-02-05T21:46:10Z

Great work! Just did the first pass. One thing I'm little bit concerned about is if we mix timestamps in milliseconds with timestamp in nanoseconds, we can get wrong results during search. For example, if we have two log messages {"timestamp": "1707026480000"} and {"timestamp": "1707026455.123456789"}, and the query is timestamp > date(1707026480000), we should return nothing but actually we will return the second log message.

Right this is a good point. I modified how we parse literals supplied inside of date() so that its at least consistent with how we parse the log. E.g. if we have timestamps like "1707026455.123456789" in the log and you copy and paste timestamps to specify a filter like timestamp: date(1707026455.123456789) then the timestamp will match as expected. You are right though that when the user starts mixing formats they can get unexpected results.

The core of the problem is that our various timestamp formats already implicitly have different levels of precision (second, millisecond, nanosecond) but we haven't really reconciled these different precisions at the storage, search, or archive metadata levels.

We sort of deal with this problem halfway in TimestampPattern as timestamps which specify second-level precision get stored with millisecond precision, but the same can't be said for raw integer or floating point timestamps. One way to handle this might be to transform all timestamps into nanosecond precision timestamps no matter how they're parsed -- this would make storage, search, and archive metadata consistent in an easy to reason about way. The downside is that this creates an encoding challenge for raw int and float timestamp columns if we want to decompress in a 1:1 way.

The other issue is that "store everything as nanosecond precision" is a bit loaded because it might sometimes be ambiguous/hard to figure out what the precision of some timestamp is supposed to be, so there's still a need to let users disambiguate via configuration for a robust solution (granted for most timestamps it should be pretty obvious).

Anyway if we do something like this (which we should at some point) I'm not sure if we want to include it in this PR (even though its slowly becoming the fix-every-timestamp-issue-we-notice PR).

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>

wraymo · 2024-02-05T21:48:45Z

And can we delete TimestampEntry::TimestampEncoding and associated ingest_timestamp(double)?

gibber9809 · 2024-02-05T21:51:55Z

And can we delete TimestampEntry::TimestampEncoding and associated ingest_timestamp(double)?

Not yet because we still allow raw floating point columns to be timestamps.

wraymo · 2024-02-05T21:53:30Z

And can we delete TimestampEntry::TimestampEncoding and associated ingest_timestamp(double)?

Not yet because we still allow raw floating point columns to be timestamps.

Oh, that's right

kirkrodrigues

Deferring to @wraymo for the validity of the clp-s timestamp-related changes. Added some comments about the new formats added and minor style changes.

components/core/src/glt/TimestampPattern.cpp

components/core/src/clp/TimestampPattern.cpp

components/core/src/clp_s/ColumnWriter.hpp

components/core/src/clp_s/TimestampPattern.hpp

components/core/src/clp_s/TimestampPattern.cpp

kirkrodrigues · 2024-02-07T21:25:28Z

components/core/src/clp_s/TimestampPattern.cpp

+                        return false;
+                    }
+                    auto dot_position = line.find('.');
+                    auto nanosecond_start = dot_position + 1;


nanoseconds_begin_pos?

components/core/src/clp_s/TimestampPattern.cpp

components/core/src/clp_s/TimestampPattern.hpp

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

kirkrodrigues

Commit msg: Add new timestamp formats and simplify timestamp column compression in clp-s (fixes #261).?

Although we should include some text about what was the fix for #261.

gibber9809 added 5 commits February 3, 2024 17:35

Add new timestamp formats to clp/clp-s/glt

7ec27c0

Add more timestamp patterns to clp-s

1bd92d9

Only treat timestamps as floatdates if they contain '.'

ab1e5e4

Fix a nested timestamp bug, and refactor column writers

c3df5c4

Remove unnecessary FloatDateStringColumnWriter class

fbbe349

gibber9809 requested review from kirkrodrigues and wraymo February 5, 2024 15:32

gibber9809 added 7 commits February 5, 2024 18:25

Add unit tests for new timestamp patterns

f602bcf

clp-s: Add nanosecond floating point timestamp format for string time…

966a117

…stamps

Fix syntax issue

af38f6c

clp-s: Remove FLOATDATESTRING column and FloatDateT Literal type

5b4236b

Merge remote-tracking branch 'upstream/main' into add-timestamp-formats

8fce10d

Fix remaining FloatDateString instance after merge

ab51910

Fix bug in TimestampDictionaryReader

cf9e000

wraymo requested changes Feb 5, 2024

View reviewed changes

components/core/src/clp_s/TimestampPattern.cpp Outdated Show resolved Hide resolved

Apply code review suggestion

bf07c0c

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>

kirkrodrigues requested changes Feb 7, 2024

View reviewed changes

gibber9809 commented Feb 7, 2024

View reviewed changes

components/core/src/clp_s/TimestampPattern.hpp Outdated Show resolved Hide resolved

gibber9809 and others added 4 commits February 7, 2024 17:00

Apply suggestions from code review

a3529f7

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

Remove duplicated timestamp format

703ba86

Re-order timestamp patterns, and add missing YYYY-MM-DDTHH:MM:SS pattern

e97917e

clp-s: Minor refactor and optimization in TimestampDictionaryWriter

d6d0afb

gibber9809 requested review from kirkrodrigues and wraymo February 8, 2024 03:47

kirkrodrigues approved these changes Feb 8, 2024

View reviewed changes

wraymo approved these changes Feb 8, 2024

View reviewed changes

gibber9809 merged commit 7de16f9 into y-scope:main Feb 8, 2024
5 checks passed

kirkrodrigues changed the title ~~Add new timestamp formats and refactor timestamp code in clp-s (fixes #261)~~ Add new timestamp formats and simplify timestamp key handling in clp-s: Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new timestamp formats and simplify timestamp key handling in clp-s: #262

Add new timestamp formats and simplify timestamp key handling in clp-s: #262

gibber9809 commented Feb 4, 2024 •

edited by kirkrodrigues

Loading

wraymo left a comment •

edited

Loading

gibber9809 commented Feb 5, 2024

wraymo commented Feb 5, 2024

gibber9809 commented Feb 5, 2024

wraymo commented Feb 5, 2024

kirkrodrigues left a comment

kirkrodrigues Feb 7, 2024

kirkrodrigues left a comment

Add new timestamp formats and simplify timestamp key handling in clp-s: #262

Add new timestamp formats and simplify timestamp key handling in clp-s: #262

Conversation

gibber9809 commented Feb 4, 2024 • edited by kirkrodrigues Loading

References

Description

Validation performed

wraymo left a comment • edited Loading

Choose a reason for hiding this comment

gibber9809 commented Feb 5, 2024

wraymo commented Feb 5, 2024

gibber9809 commented Feb 5, 2024

wraymo commented Feb 5, 2024

kirkrodrigues left a comment

Choose a reason for hiding this comment

kirkrodrigues Feb 7, 2024

Choose a reason for hiding this comment

kirkrodrigues left a comment

Choose a reason for hiding this comment

gibber9809 commented Feb 4, 2024 •

edited by kirkrodrigues

Loading

wraymo left a comment •

edited

Loading