Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shopify: Incremental and other improvements #222

Merged
merged 14 commits into from
Jul 24, 2023
Merged

Shopify: Incremental and other improvements #222

merged 14 commits into from
Jul 24, 2023

Conversation

steinitzu
Copy link
Collaborator

Tell us what you do here

  • implementing verified source (please link a relevant issue labelled as verified source)
  • fixing a bug (please link a relevant bug report)
  • improving, documenting, customizing existing source (please link an issue or describe below)
  • anything else (please link an issue or describe below)

Relevant issue

closes #217

More PR info

@steinitzu steinitzu requested a review from rudolfix July 21, 2023 18:27
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steinitzu thanks! looks good. besides a few suggestions to simplify code
@adrianbr if you could look at the comment about created_at behavior

sources/shopify_dlt/__init__.py Outdated Show resolved Hide resolved

Returns:
Iterable[TDataItem]: A generator of products.
"""
page = shopify.Product.find(
updated_at_min=updated_at.last_value, created_at_min=start_date
params = dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah interesting question if we should set created_at_min=start_date as we did in the previous version.

  • if we set it items that were created before start_date will never be extracted, even if they were updated after that date
  • if we do not set it, any updated item may be extracted
    @adrianbr any opinions? if not, I'd set it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better if we use a separate min_created_at argument in the source I think.
It's a little problematic with end value loading to use the same start date, because an item's created and updated can each be in range of different chunks so they'd never get loaded even if it should be in the total range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right. we need to handle backloading differently.

  • when end date is specified we use only created_at_min and created_at_max to select range
  • when we have incremental load we specify created_at_min (set to start_date - which is also initial_value) and updated_at_min set to last_value
    there's a grey are here: should we also track updates of records from backloaded range? in case of the above it is not the case. we'd need another parameter to set created_at and updated_at separately

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I'd say it makes sense to load updates from backloaded range too.
What is the idea behind the created_at cutoff originally?
I'm struggling with seeing the usefulness in it beyond just filtering by updated_at. I'd think old orders you want to ignore aren't updated anyway usually.

Copy link
Contributor

@rudolfix rudolfix Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are always getting the most recent state of the entity and update/create just filter which entities you want to get. so there's no way IMO to get previous state of it. anyway. I kind of agree with what you say.

  1. let's have our start/end date settings working only with updated_at_min and updated_at_max (this one you need to add)
  2. I'd add another argument created_at_min to cutoff the entities created before that date. the argument is optional and defaults to FIRST_DAY_OF_MILLENNIUM

the typical usage would be like in the incremental_load_with_backloading example: backloading and then incremental load. all based on updated_at.
On top of that we give possibility to filter out all records that were created before initial value of current_start_date in the demo. so the updated_at_min argument is always set to it, during backloading and on incremental load.

WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense, and I think it's more clear with the separate argument.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added created_at_min arg now and test for it.
updated_at_max is used now too.

elif isinstance(value, date):
return pendulum.datetime(value.year, value.month, value.day)
elif isinstance(value, str):
return pendulum.parse(value) # type: ignore[return-value]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pendulum parse cannot be trusted, my take is to accept a few ISO like dates, we have it in dlt.common.time module: parse_iso_like_datetime. I'm also not sure that pendulum.parse always returns datetime with timezone

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I'm using parse_iso_like_datetime now. It does currently create date objects from iso dates so I'm checking for that https://github.com/dlt-hub/verified-sources/pull/222/files#diff-3c1b45073fee3f98a1147e876dab634e0676b984b509690c76a4957e62d9efa0R32-R33

sources/shopify_dlt/helpers.py Outdated Show resolved Hide resolved

Returns:
Iterable[TDataItem]: A generator of products.
"""
page = shopify.Product.find(
updated_at_min=updated_at.last_value, created_at_min=start_date
params = dict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right. we need to handle backloading differently.

  • when end date is specified we use only created_at_min and created_at_max to select range
  • when we have incremental load we specify created_at_min (set to start_date - which is also initial_value) and updated_at_min set to last_value
    there's a grey are here: should we also track updates of records from backloaded range? in case of the above it is not the case. we'd need another parameter to set created_at and updated_at separately

Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! this is really good now!

@rudolfix rudolfix merged commit 6da4f69 into master Jul 24, 2023
8 checks passed
@rudolfix rudolfix deleted the fix/shopify branch July 24, 2023 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for statuses for shopify and standardize incremental loading
2 participants