Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

129 Adds verified source: airtable #218

Merged
merged 2 commits into from
Aug 30, 2023

Conversation

willi-mueller
Copy link
Collaborator

@willi-mueller willi-mueller commented Jul 20, 2023

Tell us what you do here

  • implementing verified source Airtable
  • fixing a bug (please link a relevant bug report)
  • improving, documenting, customizing existing source (please link an issue or describe below)
  • anything else (please link an issue or describe below)

Relevant issue

#129

More PR info

TODO Before Merge

  • implement type annotations
  • lint the code
  • implement test cases
  • add the Event Planning base to the Airtable CI account so that we also test that dlt creates join tables. Here is how to:

Step 1
step-1

Step 2
step-2

Step 3:
Share the table as read-only with me or with the public so that I can point the references in our test suite and example pipeline to it.

Thank you!

Future work

implement an API similar to the facebook_ads connector which allows users to specify column names and datatypes of their Airtables.

@willi-mueller willi-mueller changed the title 129 airtable source 129 Adds verified source: airtable Jul 20, 2023
@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Jul 20, 2023

Open questions

How should users refer to Airtable tables?

I see two options:

  • A: table names (user-defined, mutable)
  • B: table ids (airtable-defined, immutable)

There are two use cases for names:

  1. naming the dlt resource and thus the table in the destination
  2. providing a whitelist of tables to be loaded into the destination

Additional Write Dispositions?

I'm not sure how to support additional write dispositions, I'd have to study the sql_database code more deeply. If you have any pointers I'd be happy to hear.

@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Jul 20, 2023

I'd also appreciate help on how to fix the following linting error:

sources/airtable/__init__.py:38: error: Returning Any from function declared to return "DltResource"  [no-any-return]

The function returns a call to dlt.resource and there is no conditional statement. Thus, I don't know how to convince the type inferencer.

@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Jul 20, 2023

Currently, this draft PR has at least two implementation problems which I'm stuck with because they seem to be related to what's going on deep in the core:

  1. Warnings about incomplete columns and data binding
  2. Emojis in table names

I've been testing with this Airtable table provided by dlt hub as well as this quickstart template provided from Airtable. In the latter, each table name starts with an emoji.

See screenshot:
Screenshot 2023-07-20 at 16 17 52

Question: Should the source connector handle emojis in resource names or is this something for the dlt framework?

1. Warnings about incomplete columns

After deleting my local duckdb database and run the pipeline using python airtable_pipeline.py I get the following error:

2023-07-20 16:03:25,617|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column name in table sheet1 in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
... # omitting the "successfully loaded" messages
2023-07-20 16:03:31,205|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column name in table sheet1 in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:31,205|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column activity in table _schedule in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:31,205|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column name in table _speakers in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?

2. Running the pipeline repeatedly creates runtime errors – due to emojis in column names

Then, running the pipeline again produces:

2023-07-20 16:03:58,284|[WARNING ]|81769|dlt|reference.py|_verify_schema:195|A column name in table sheet1 in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:58,284|[WARNING ]|81769|dlt|reference.py|_verify_schema:195|A column activity in table _schedule in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:58,284|[WARNING ]|81769|dlt|reference.py|_verify_schema:195|A column name in table _speakers in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
Traceback (most recent call last):
File "/Users/vilasa/.pyenv/versions/3.9.15/lib/python3.9/base64.py", line 37, in _bytes_from_decode_data
return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f4c6' in position 8539: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 350, in load
runner.run_pool(load.config, load)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/common/runners/pool_runner.py", line 48, in run_pool
while _run_func():
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/common/runners/pool_runner.py", line 41, in _run_func
run_metrics = run_f.run(cast(TPool, pool))
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/load/load.py", line 325, in run
self.load_single_package(load_id, schema)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/load/load.py", line 246, in load_single_package
applied_update = job_client.update_storage_schema(only_tables=set(all_tables+dlt_tables), expected_update=expected_update)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/destinations/job_client_impl.py", line 90, in update_storage_schema
schema_info = self.get_schema_by_hash(self.schema.stored_version_hash)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/destinations/job_client_impl.py", line 204, in get_schema_by_hash
return self._row_to_schema_info(query, version_hash)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/destinations/job_client_impl.py", line 298, in _row_to_schema_info
schema_bytes = base64.b64decode(schema_str, validate=True)
File "/Users/vilasa/.pyenv/versions/3.9.15/lib/python3.9/base64.py", line 80, in b64decode
s = _bytes_from_decode_data(s)
File "/Users/vilasa/.pyenv/versions/3.9.15/lib/python3.9/base64.py", line 39, in _bytes_from_decode_data
raise ValueError('string argument should contain only ASCII characters')
ValueError: string argument should contain only ASCII characters

@rudolfix
Copy link
Contributor

How should users refer to Airtable tables?

  1. naming the dlt resource and thus the table in the destination
    my take is that regular user expects to see the table names not table ids. the consequence is that dlt will create a new table if the name changes. we could also parametrize this behavior but IMO that is unnecessary complication
  1. providing a whitelist of tables to be loaded into the destination
    you could accept both and check both names and ids. on top of that if someone passes an id of the table you could also name the table in the destination by id.

btw. if the code stays as simple as it is right now almost anyone can hack it and use ids for names. dlt init inserts the source code into the current project so hacking is easy

Additional Write Dispositions?

I'm not sure how to support additional write dispositions, I'd have to study the sql_database code more deeply. If you have any pointers I'd be happy to hear.

Everything is on good track. you already add primary keys which is essential step.

in dlt resources can be customized after they are created. in sql_database all of them are created as append and then changed per user wish:
https://github.com/dlt-hub/verified-sources/blob/master/sources/sql_database_pipeline.py
in this demo we add merge dispositon and incremental loading via cursor column
here's some general info: https://dlthub.com/docs/general-usage/resource#adjust-schema

What I'd suggest would be

  1. how big air tables can be? if they are large enough we should add support for incremental loading. in sql_database see sql_table and incremental
  2. IMO it makes sense to add write_dispositon to airtable_resource (we should do it in sql_database too)

@rudolfix
Copy link
Contributor

The function returns a call to dlt.resource and there is no conditional statement. Thus, I don't know how to convince the type inferencer.
the dlt.resources is overloaded and mypy is very picky then. what IMO happens is that:

    return dlt.resource(
        pyairtable.Table(access_token, base_id, table.get("id")).iterate(),
        name=table.get("name"),  # TODO: clarify if stable id or user-chosen name is preferred
        primary_key=[primary_key_field["name"]],
        write_disposition="replace",
    )

here

  • name=table.get("name") probably returns type Any but we exepct string
  • primary_key=[primary_key_field["name"]], same (but we expect string or list)
    so the type checker is not able to match to any of the overloaded signatures. type the things above and it will work IMO

@rudolfix
Copy link
Contributor

  1. Warnings about incomplete columns and data binding
    this comes from airtables that you define as resource but at the end do not return any data. in dlt a partial table is created which has incomplete column schema. in particular: you define primary_key but it does not get bound to data type because dlt never received any data for it. thus the table has incomplete schema and we show warning...
    workaround: maybe you have row count in table props and can skip tables with 0 rows? if not let's just ignore the warning..
  1. Emojis in table names

thanks for spotting this! this is simply very interesting. I'll dig deeper. try to reproduce and fill a bug for it. most probably I expect that schema content is an ascii string - and it should be! emojis should be normalized to _ in the table names...

@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Jul 20, 2023

  1. Warnings about incomplete columns and data binding

this comes from airtables that you define as resource but at the end do not return any data. in dlt a partial table is created which has incomplete column schema. in particular: you define primary_key but it does not get bound to data type because dlt never received any data for it. thus the table has incomplete schema and we show warning...
workaround: maybe you have row count in table props and can skip tables with 0 rows? if not let's just ignore the warning..

@rudolfix
I think there must be a different reason because all airtables and dlt tables contain data! Still, the warning comes.

  1. Emojis in table names

thanks for spotting this! this is simply very interesting. I'll dig deeper. try to reproduce and fill a bug for it. most probably I expect that schema content is an ascii string - and it should be! emojis should be normalized to _ in the table names...

Thank you! Yes, at the first load they get indeed normalized to _. However, the second load – even though write_disposition = "replace"` does not normalize them.

Here is a screenshot after the first run:
Screenshot 2023-07-20 at 18 58 20

I think I can make a minimal example reproducing this and file a bug.

@rudolfix
Copy link
Contributor

@willi-mueller just come to my mind: should we add some metadata to each row coming from Airtable. do they have any internal unique id? is the row number important?

@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Jul 20, 2023

@willi-mueller just come to my mind: should we add some metadata to each row coming from Airtable. do they have any internal unique id? is the row number important?

yes, every row (record) has an internal unique id. It seems to exist in the destination. They start with "rec". See first column:
Screenshot 2023-07-20 at 20 10 01

Or did you have something else in mind with "add some metadata"?

@willi-mueller
Copy link
Collaborator Author

regarding the warning with the binding: for every table, dlt complains about the first column. That's a pattern.

@rudolfix
Copy link
Contributor

@willi-mueller I've managed to reproduce encoding error and the fix is coming dlt-hub/dlt#509
thanks! that was really important edge case now we'll have it properly tested

@willi-mueller willi-mueller marked this pull request as ready for review August 11, 2023 11:11
@willi-mueller
Copy link
Collaborator Author

@rudolfix I'd appreciate your help in getting the CI checks to pass. It seems that something is wrong with the way I import the third-party dependency pyairtable. The library claims that it is fully typed.

Implemented strict type annotations on all functions and methods

Still, the mypy checks fail.

I'd like to hear your thoughts wether:
a) I made a mistake during import
b) I should ignore the type checks, like for facebook
c) I should implement stubs because you think that type annotations of pyairtable do not seem to be complete

@rudolfix
Copy link
Contributor

@willi-mueller you have to add this dependency to dev dependency group with poetry as described here https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md#source-specific-dependencies-requirements-file

And you are right by thinking that requirements in pipeline should be enough. This is actually really good idea (we started with the groups before there were any requirements for pipelines).

@willi-mueller
Copy link
Collaborator Author

@willi-mueller you have to add this dependency to dev dependency group with poetry as described here https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md#source-specific-dependencies-requirements-file

And you are right by thinking that requirements in pipeline should be enough. This is actually really good idea (we started with the groups before there were any requirements for pipelines).

Oh, fantastic! Thank you for the pointer to the docs. I'm sorry I haven't seen that before.

Still, the linter & test runner errors are unchanged.|
Linter:

Cannot find implementation or library stub for module named "pyairtable" [import]

Test runner:

ModuleNotFoundError: No module named 'pyairtable'

I'd appreciate advice on how we can move forward. Thank you!

@rudolfix
Copy link
Contributor

rudolfix commented Aug 13, 2023

@willi-mueller you are doing everything right. it is poetry not adding your dependency to pyproject.toml:
python-poetry/poetry#7230
in our case it was black linter config added not in the right place. I fixed our pyproject.toml and will push to github then you can do poetry add pyairtable --group=airtable. I will ping you when done
thx for finding another problem :)

@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Aug 15, 2023 via email

@rudolfix
Copy link
Contributor

I read the comments on the issue you linked. I can  test it, make the PR, and only reach out if the issue is more complex. You must have already tons of things on your head. Aug 14, 2023 03:35:34 rudolfix @.***>:

OK! here's PR fixing blake but it may take a while to merge it https://github.com/dlt-hub/verified-sources/pull/243/files#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711

@rudolfix
Copy link
Contributor

@willi-mueller problem in pyproject.toml got fixed #243

@willi-mueller willi-mueller force-pushed the 129-airtable-source branch 2 times, most recently from b49395c to 67dc58f Compare August 17, 2023 10:50
@willi-mueller
Copy link
Collaborator Author

willi-mueller commented Aug 17, 2023

@willi-mueller problem in pyproject.toml got fixed #243

🥳 Yes, it fixes most of the CI runs!

Now, in the dlthub repo the tests are red due to the missing airtable secrets. I'll reach out on slack.

I'm not sure if I understood your intention in #243:

I added DLT_SECRETS_TOML with my airtable access_token. However, the airtable tests are skipped as defined in test_on_local_destinations_forks.yml – and thus I'm not sure how to run my test suite in the CI of my fork. See the run here

@rudolfix
Copy link
Contributor

Now, in the dlthub repo the tests are red due to the missing airtable secrets. I'll reach out on slack.

I'm not sure if I understood your intention in #243:

I added DLT_SECRETS_TOML with my airtable access_token. However, the airtable tests are skipped as defined in test_on_local_destinations_forks.yml – and thus I'm not sure how to run my test suite in the CI of my fork. See the run here

dlt complain that it does not see your access token... you should have something like this

access_token="...."

or

[sources.airtable]
access_token="...."

please make sure that you did one of the above. toml has it quirks ie. if you provide access token as in first example - it must be at very top of the file above any table (square brackets)

btw. maybe you've found another bug, let's see :)

@willi-mueller
Copy link
Collaborator Author

Now, in the dlthub repo the tests are red due to the missing airtable secrets. I'll reach out on slack.
I'm not sure if I understood your intention in #243:
I added DLT_SECRETS_TOML with my airtable access_token. However, the airtable tests are skipped as defined in test_on_local_destinations_forks.yml – and thus I'm not sure how to run my test suite in the CI of my fork. See the run here

dlt complain that it does not see your access token... you should have something like this

access_token="...."

or

[sources.airtable]
access_token="...."

please make sure that you did one of the above. toml has it quirks ie. if you provide access token as in first example - it must be at very top of the file above any table (square brackets)

btw. maybe you've found another bug, let's see :)

Thank you for your quick reply. Here is how I set it up after having examined the contribution docs. My github actions secret for my fork:

Screenshot 2023-08-18 at 13 26 23

This is my sources/secrets.toml file which has been working great locally. I copied the lines 3 and 4 into the github secret:
Screenshot_2023-08-18_at_13_25_06

Can you spot a mistake?

Would I be able to reproduce this behavior in my fork by doing:

  • pick another connector
  • make a nonsense change
  • add the expected secret
  • watch the CI run?

@rudolfix
Copy link
Contributor

@willi-mueller absolutely. poke the facebook ads maybe and I'll give you a refresh token on slack :). I cannot really find any problem with your code. if our experiment with facebook does not work I'll try it myself from my priv account

@rudolfix rudolfix added the ci from fork Allows to run tests from PR coming from fork label Aug 20, 2023
@rudolfix rudolfix added ci from fork Allows to run tests from PR coming from fork and removed ci from fork Allows to run tests from PR coming from fork labels Aug 20, 2023
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willi-mueller this looks good! I have a few small suggestions to try. see review

sources/airtable/README.md Outdated Show resolved Hide resolved
sources/airtable/__init__.py Outdated Show resolved Hide resolved
sources/airtable/__init__.py Outdated Show resolved Hide resolved
@rudolfix rudolfix added ci from fork Allows to run tests from PR coming from fork and removed ci from fork Allows to run tests from PR coming from fork labels Aug 27, 2023
- creates source for a base which returns resources for airtables
- supports whitelisting airtables by names or ids
- implements test suite calling live API
- README for Airtable source

improves typing
…ence warnings and improve documentation

refactoring according to latest recommendations of pyairtable

simplify airtable filters by following the paradigm of the pyairtable API which treats table names and table IDs equally

updates pyairtable dependency
@rudolfix rudolfix added ci from fork Allows to run tests from PR coming from fork and removed ci from fork Allows to run tests from PR coming from fork labels Aug 30, 2023
@rudolfix rudolfix self-assigned this Aug 30, 2023
Copy link
Contributor

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is excellent! thanks!

@rudolfix rudolfix merged commit a0f8f12 into dlt-hub:master Aug 30, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci from fork Allows to run tests from PR coming from fork
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants