Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new timestamp formats and simplify timestamp key handling in clp-s: #262

Merged
merged 17 commits into from
Feb 8, 2024

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Feb 4, 2024

References

Fixes #261

Description

  • Add some obvious variations of existing timestamp patterns to clp, clp-s, and glt
  • Add "%E" format specifier to clp-s to capture unix epoch millisecond timestamps found in strings
  • Refactor and simplify timestamp column compression code in clp-s
  • Fix a nested timestamp bug

Validation performed

  • Compressed data with the new timestamp formats and verified that the timestamps were parsed succesfully
  • Verified that nested timestamps now exhibit expected behaviour for compression/decompression/search
  • Added unit tests in clp for new timestamp formats
  • Ingested cockroachdb dataset and performed searches on timestamp column

Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Just did the first pass. One thing I'm little bit concerned about is if we mix timestamps in milliseconds with timestamp in nanoseconds, we can get wrong results during search. For example, if we have two log messages {"timestamp": "1707026480000"} and {"timestamp": "1707026455.123456789"}, and the query is timestamp > date(1707026480000), we should return nothing but actually we will return the second log message.

components/core/src/clp_s/TimestampPattern.cpp Outdated Show resolved Hide resolved
@gibber9809
Copy link
Contributor Author

Great work! Just did the first pass. One thing I'm little bit concerned about is if we mix timestamps in milliseconds with timestamp in nanoseconds, we can get wrong results during search. For example, if we have two log messages {"timestamp": "1707026480000"} and {"timestamp": "1707026455.123456789"}, and the query is timestamp > date(1707026480000), we should return nothing but actually we will return the second log message.

Right this is a good point. I modified how we parse literals supplied inside of date() so that its at least consistent with how we parse the log. E.g. if we have timestamps like "1707026455.123456789" in the log and you copy and paste timestamps to specify a filter like timestamp: date(1707026455.123456789) then the timestamp will match as expected. You are right though that when the user starts mixing formats they can get unexpected results.

The core of the problem is that our various timestamp formats already implicitly have different levels of precision (second, millisecond, nanosecond) but we haven't really reconciled these different precisions at the storage, search, or archive metadata levels.

We sort of deal with this problem halfway in TimestampPattern as timestamps which specify second-level precision get stored with millisecond precision, but the same can't be said for raw integer or floating point timestamps. One way to handle this might be to transform all timestamps into nanosecond precision timestamps no matter how they're parsed -- this would make storage, search, and archive metadata consistent in an easy to reason about way. The downside is that this creates an encoding challenge for raw int and float timestamp columns if we want to decompress in a 1:1 way.

The other issue is that "store everything as nanosecond precision" is a bit loaded because it might sometimes be ambiguous/hard to figure out what the precision of some timestamp is supposed to be, so there's still a need to let users disambiguate via configuration for a robust solution (granted for most timestamps it should be pretty obvious).

Anyway if we do something like this (which we should at some point) I'm not sure if we want to include it in this PR (even though its slowly becoming the fix-every-timestamp-issue-we-notice PR).

Co-authored-by: wraymo <37269683+wraymo@users.noreply.github.com>
@wraymo
Copy link
Contributor

wraymo commented Feb 5, 2024

And can we delete TimestampEntry::TimestampEncoding and associated ingest_timestamp(double)?

@gibber9809
Copy link
Contributor Author

And can we delete TimestampEntry::TimestampEncoding and associated ingest_timestamp(double)?

Not yet because we still allow raw floating point columns to be timestamps.

@wraymo
Copy link
Contributor

wraymo commented Feb 5, 2024

And can we delete TimestampEntry::TimestampEncoding and associated ingest_timestamp(double)?

Not yet because we still allow raw floating point columns to be timestamps.

Oh, that's right

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to @wraymo for the validity of the clp-s timestamp-related changes. Added some comments about the new formats added and minor style changes.

components/core/src/glt/TimestampPattern.cpp Show resolved Hide resolved
components/core/src/clp/TimestampPattern.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/ColumnWriter.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/TimestampPattern.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/TimestampPattern.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/TimestampPattern.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/TimestampPattern.cpp Outdated Show resolved Hide resolved
return false;
}
auto dot_position = line.find('.');
auto nanosecond_start = dot_position + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanoseconds_begin_pos?

components/core/src/clp_s/TimestampPattern.cpp Outdated Show resolved Hide resolved
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit msg: Add new timestamp formats and simplify timestamp column compression in clp-s (fixes #261).?

Although we should include some text about what was the fix for #261.

@gibber9809 gibber9809 merged commit 7de16f9 into y-scope:main Feb 8, 2024
5 checks passed
@kirkrodrigues kirkrodrigues changed the title Add new timestamp formats and refactor timestamp code in clp-s (fixes #261) Add new timestamp formats and simplify timestamp key handling in clp-s: Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clp-s: Nested timestamps result in incorrect mst node names
3 participants