129 Adds verified source: airtable #218

willi-mueller · 2023-07-20T06:24:16Z

Tell us what you do here

implementing verified source Airtable
fixing a bug (please link a relevant bug report)
improving, documenting, customizing existing source (please link an issue or describe below)
anything else (please link an issue or describe below)

Relevant issue

#129

More PR info

TODO Before Merge

implement type annotations
lint the code
implement test cases
add the Event Planning base to the Airtable CI account so that we also test that dlt creates join tables. Here is how to:

Step 1

Step 2

Step 3:
Share the table as read-only with me or with the public so that I can point the references in our test suite and example pipeline to it.

Thank you!

Future work

implement an API similar to the facebook_ads connector which allows users to specify column names and datatypes of their Airtables.

willi-mueller · 2023-07-20T06:45:07Z

Open questions

How should users refer to Airtable tables?

I see two options:

A: table names (user-defined, mutable)
B: table ids (airtable-defined, immutable)

There are two use cases for names:

naming the dlt resource and thus the table in the destination
providing a whitelist of tables to be loaded into the destination

Additional Write Dispositions?

I'm not sure how to support additional write dispositions, I'd have to study the sql_database code more deeply. If you have any pointers I'd be happy to hear.

willi-mueller · 2023-07-20T10:27:19Z

I'd also appreciate help on how to fix the following linting error:

sources/airtable/__init__.py:38: error: Returning Any from function declared to return "DltResource"  [no-any-return]

The function returns a call to dlt.resource and there is no conditional statement. Thus, I don't know how to convince the type inferencer.

willi-mueller · 2023-07-20T10:49:02Z

Currently, this draft PR has at least two implementation problems which I'm stuck with because they seem to be related to what's going on deep in the core:

Warnings about incomplete columns and data binding
Emojis in table names

I've been testing with this Airtable table provided by dlt hub as well as this quickstart template provided from Airtable. In the latter, each table name starts with an emoji.

See screenshot:

Question: Should the source connector handle emojis in resource names or is this something for the dlt framework?

1. Warnings about incomplete columns

After deleting my local duckdb database and run the pipeline using python airtable_pipeline.py I get the following error:

2023-07-20 16:03:25,617|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column name in table sheet1 in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
... # omitting the "successfully loaded" messages
2023-07-20 16:03:31,205|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column name in table sheet1 in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:31,205|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column activity in table _schedule in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:31,205|[WARNING ]|81710|dlt|reference.py|_verify_schema:195|A column name in table _speakers in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?

2. Running the pipeline repeatedly creates runtime errors – due to emojis in column names

Then, running the pipeline again produces:

2023-07-20 16:03:58,284|[WARNING ]|81769|dlt|reference.py|_verify_schema:195|A column name in table sheet1 in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:58,284|[WARNING ]|81769|dlt|reference.py|_verify_schema:195|A column activity in table _schedule in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
2023-07-20 16:03:58,284|[WARNING ]|81769|dlt|reference.py|_verify_schema:195|A column name in table _speakers in schema airtable_source is incomplete. It was not bound to the data during normalizations stage and its data type is unknown. Did you add this column manually in code ie. as a merge key?
Traceback (most recent call last):
File "/Users/vilasa/.pyenv/versions/3.9.15/lib/python3.9/base64.py", line 37, in _bytes_from_decode_data
return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f4c6' in position 8539: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 350, in load
runner.run_pool(load.config, load)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/common/runners/pool_runner.py", line 48, in run_pool
while _run_func():
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/common/runners/pool_runner.py", line 41, in _run_func
run_metrics = run_f.run(cast(TPool, pool))
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/load/load.py", line 325, in run
self.load_single_package(load_id, schema)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/load/load.py", line 246, in load_single_package
applied_update = job_client.update_storage_schema(only_tables=set(all_tables+dlt_tables), expected_update=expected_update)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/destinations/job_client_impl.py", line 90, in update_storage_schema
schema_info = self.get_schema_by_hash(self.schema.stored_version_hash)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/destinations/job_client_impl.py", line 204, in get_schema_by_hash
return self._row_to_schema_info(query, version_hash)
File "/Users/vilasa/Library/Caches/pypoetry/virtualenvs/dlt-verified-sources-oVE4GiVX-py3.9/lib/python3.9/site-packages/dlt/destinations/job_client_impl.py", line 298, in _row_to_schema_info
schema_bytes = base64.b64decode(schema_str, validate=True)
File "/Users/vilasa/.pyenv/versions/3.9.15/lib/python3.9/base64.py", line 80, in b64decode
s = _bytes_from_decode_data(s)
File "/Users/vilasa/.pyenv/versions/3.9.15/lib/python3.9/base64.py", line 39, in _bytes_from_decode_data
raise ValueError('string argument should contain only ASCII characters')
ValueError: string argument should contain only ASCII characters

rudolfix · 2023-07-20T12:57:58Z

How should users refer to Airtable tables?

naming the dlt resource and thus the table in the destination
my take is that regular user expects to see the table names not table ids. the consequence is that dlt will create a new table if the name changes. we could also parametrize this behavior but IMO that is unnecessary complication

providing a whitelist of tables to be loaded into the destination
you could accept both and check both names and ids. on top of that if someone passes an id of the table you could also name the table in the destination by id.

btw. if the code stays as simple as it is right now almost anyone can hack it and use ids for names. dlt init inserts the source code into the current project so hacking is easy

Additional Write Dispositions?

I'm not sure how to support additional write dispositions, I'd have to study the sql_database code more deeply. If you have any pointers I'd be happy to hear.

Everything is on good track. you already add primary keys which is essential step.

in dlt resources can be customized after they are created. in sql_database all of them are created as append and then changed per user wish:
https://github.com/dlt-hub/verified-sources/blob/master/sources/sql_database_pipeline.py
in this demo we add merge dispositon and incremental loading via cursor column
here's some general info: https://dlthub.com/docs/general-usage/resource#adjust-schema

What I'd suggest would be

how big air tables can be? if they are large enough we should add support for incremental loading. in sql_database see sql_table and incremental
IMO it makes sense to add write_dispositon to airtable_resource (we should do it in sql_database too)

rudolfix · 2023-07-20T13:01:07Z

The function returns a call to dlt.resource and there is no conditional statement. Thus, I don't know how to convince the type inferencer.
the dlt.resources is overloaded and mypy is very picky then. what IMO happens is that:

    return dlt.resource(
        pyairtable.Table(access_token, base_id, table.get("id")).iterate(),
        name=table.get("name"),  # TODO: clarify if stable id or user-chosen name is preferred
        primary_key=[primary_key_field["name"]],
        write_disposition="replace",
    )

here

name=table.get("name") probably returns type Any but we exepct string
primary_key=[primary_key_field["name"]], same (but we expect string or list)
so the type checker is not able to match to any of the overloaded signatures. type the things above and it will work IMO

rudolfix · 2023-07-20T13:05:20Z

Warnings about incomplete columns and data binding
this comes from airtables that you define as resource but at the end do not return any data. in dlt a partial table is created which has incomplete column schema. in particular: you define primary_key but it does not get bound to data type because dlt never received any data for it. thus the table has incomplete schema and we show warning...
workaround: maybe you have row count in table props and can skip tables with 0 rows? if not let's just ignore the warning..

Emojis in table names

thanks for spotting this! this is simply very interesting. I'll dig deeper. try to reproduce and fill a bug for it. most probably I expect that schema content is an ascii string - and it should be! emojis should be normalized to _ in the table names...

willi-mueller · 2023-07-20T13:29:44Z

Warnings about incomplete columns and data binding

this comes from airtables that you define as resource but at the end do not return any data. in dlt a partial table is created which has incomplete column schema. in particular: you define primary_key but it does not get bound to data type because dlt never received any data for it. thus the table has incomplete schema and we show warning...
workaround: maybe you have row count in table props and can skip tables with 0 rows? if not let's just ignore the warning..

@rudolfix
I think there must be a different reason because all airtables and dlt tables contain data! Still, the warning comes.

Emojis in table names

thanks for spotting this! this is simply very interesting. I'll dig deeper. try to reproduce and fill a bug for it. most probably I expect that schema content is an ascii string - and it should be! emojis should be normalized to _ in the table names...

Thank you! Yes, at the first load they get indeed normalized to _. However, the second load – even though write_disposition = "replace"` does not normalize them.

Here is a screenshot after the first run:

I think I can make a minimal example reproducing this and file a bug.

rudolfix · 2023-07-20T14:18:04Z

@willi-mueller just come to my mind: should we add some metadata to each row coming from Airtable. do they have any internal unique id? is the row number important?

willi-mueller · 2023-07-20T14:40:39Z

@willi-mueller just come to my mind: should we add some metadata to each row coming from Airtable. do they have any internal unique id? is the row number important?

yes, every row (record) has an internal unique id. It seems to exist in the destination. They start with "rec". See first column:

Or did you have something else in mind with "add some metadata"?

willi-mueller · 2023-07-20T14:41:04Z

regarding the warning with the binding: for every table, dlt complains about the first column. That's a pattern.

rudolfix · 2023-07-21T21:57:09Z

@willi-mueller I've managed to reproduce encoding error and the fix is coming dlt-hub/dlt#509
thanks! that was really important edge case now we'll have it properly tested

willi-mueller · 2023-08-11T11:45:07Z

@rudolfix I'd appreciate your help in getting the CI checks to pass. It seems that something is wrong with the way I import the third-party dependency pyairtable. The library claims that it is fully typed.

Implemented strict type annotations on all functions and methods

Still, the mypy checks fail.

I'd like to hear your thoughts wether:
a) I made a mistake during import
b) I should ignore the type checks, like for facebook
c) I should implement stubs because you think that type annotations of pyairtable do not seem to be complete

rudolfix · 2023-08-11T13:26:58Z

@willi-mueller you have to add this dependency to dev dependency group with poetry as described here https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md#source-specific-dependencies-requirements-file

And you are right by thinking that requirements in pipeline should be enough. This is actually really good idea (we started with the groups before there were any requirements for pipelines).

willi-mueller · 2023-08-11T14:47:10Z

@willi-mueller you have to add this dependency to dev dependency group with poetry as described here https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md#source-specific-dependencies-requirements-file

And you are right by thinking that requirements in pipeline should be enough. This is actually really good idea (we started with the groups before there were any requirements for pipelines).

Oh, fantastic! Thank you for the pointer to the docs. I'm sorry I haven't seen that before.

Still, the linter & test runner errors are unchanged.|
Linter:

Cannot find implementation or library stub for module named "pyairtable" [import]

Test runner:

ModuleNotFoundError: No module named 'pyairtable'

I'd appreciate advice on how we can move forward. Thank you!

rudolfix · 2023-08-13T22:05:19Z

@willi-mueller you are doing everything right. it is poetry not adding your dependency to pyproject.toml:
python-poetry/poetry#7230
in our case it was black linter config added not in the right place. I fixed our pyproject.toml and will push to github then you can do poetry add pyairtable --group=airtable. I will ping you when done
thx for finding another problem :)

willi-mueller · 2023-08-15T06:42:52Z

I read the comments on the issue you linked. I can test it, make the PR, and only reach out if the issue is more complex. You must have already tons of things on your head. Aug 14, 2023 03:35:34 rudolfix ***@***.***>:

…

@willi-mueller[https://github.com/willi-mueller] you are doing everything right. it is poetry not adding your dependency to pyproject.toml: python-poetry/poetry#7230[python-poetry/poetry#7230] in our case it was *black* linter config added not in the right price. I fixed our pyproject.toml and will push to github then you can do *poetry add pyairtable --group=airtable*. I will ping you when done thx for finding another problem :) — Reply to this email directly, view it on GitHub[#218 (comment)], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AABVG7GTQ5UWAWP3OLRPGXDXVFFSXANCNFSM6AAAAAA2Q5QFIE]. You are receiving this because you were mentioned.[Tracking image][https://github.com/notifications/beacon/AABVG7EGBQ2DCJVRQ2DB2ZDXVFFSXA5CNFSM6AAAAAA2Q5QFIGWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTD5T3VW.gif]

rudolfix · 2023-08-15T07:51:33Z

I read the comments on the issue you linked. I can test it, make the PR, and only reach out if the issue is more complex. You must have already tons of things on your head. Aug 14, 2023 03:35:34 rudolfix @.***>:
…

OK! here's PR fixing blake but it may take a while to merge it https://github.com/dlt-hub/verified-sources/pull/243/files#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711

rudolfix · 2023-08-16T15:37:30Z

@willi-mueller problem in pyproject.toml got fixed #243

willi-mueller · 2023-08-17T11:05:18Z

@willi-mueller problem in pyproject.toml got fixed #243

🥳 Yes, it fixes most of the CI runs!

Now, in the dlthub repo the tests are red due to the missing airtable secrets. I'll reach out on slack.

I'm not sure if I understood your intention in #243:

I added DLT_SECRETS_TOML with my airtable access_token. However, the airtable tests are skipped as defined in test_on_local_destinations_forks.yml – and thus I'm not sure how to run my test suite in the CI of my fork. See the run here

rudolfix · 2023-08-17T20:01:27Z

Now, in the dlthub repo the tests are red due to the missing airtable secrets. I'll reach out on slack.

I'm not sure if I understood your intention in #243:

I added DLT_SECRETS_TOML with my airtable access_token. However, the airtable tests are skipped as defined in test_on_local_destinations_forks.yml – and thus I'm not sure how to run my test suite in the CI of my fork. See the run here

dlt complain that it does not see your access token... you should have something like this

access_token="...."

or

[sources.airtable]
access_token="...."

please make sure that you did one of the above. toml has it quirks ie. if you provide access token as in first example - it must be at very top of the file above any table (square brackets)

btw. maybe you've found another bug, let's see :)

willi-mueller · 2023-08-18T08:05:05Z

Now, in the dlthub repo the tests are red due to the missing airtable secrets. I'll reach out on slack.
I'm not sure if I understood your intention in #243:
I added DLT_SECRETS_TOML with my airtable access_token. However, the airtable tests are skipped as defined in test_on_local_destinations_forks.yml – and thus I'm not sure how to run my test suite in the CI of my fork. See the run here

dlt complain that it does not see your access token... you should have something like this
access_token="...."
or
[sources.airtable]
access_token="...."
please make sure that you did one of the above. toml has it quirks ie. if you provide access token as in first example - it must be at very top of the file above any table (square brackets)

btw. maybe you've found another bug, let's see :)

Thank you for your quick reply. Here is how I set it up after having examined the contribution docs. My github actions secret for my fork:

This is my sources/secrets.toml file which has been working great locally. I copied the lines 3 and 4 into the github secret:

Can you spot a mistake?

Would I be able to reproduce this behavior in my fork by doing:

pick another connector
make a nonsense change
add the expected secret
watch the CI run?

rudolfix · 2023-08-18T15:08:26Z

@willi-mueller absolutely. poke the facebook ads maybe and I'll give you a refresh token on slack :). I cannot really find any problem with your code. if our experiment with facebook does not work I'll try it myself from my priv account

rudolfix

@willi-mueller this looks good! I have a few small suggestions to try. see review

sources/airtable/README.md

sources/airtable/__init__.py

- creates source for a base which returns resources for airtables - supports whitelisting airtables by names or ids - implements test suite calling live API - README for Airtable source improves typing

…ence warnings and improve documentation refactoring according to latest recommendations of pyairtable simplify airtable filters by following the paradigm of the pyairtable API which treats table names and table IDs equally updates pyairtable dependency

rudolfix

this is excellent! thanks!

willi-mueller changed the title ~~129 airtable source~~ 129 Adds verified source: airtable Jul 20, 2023

willi-mueller force-pushed the 129-airtable-source branch from 16efca7 to 4d20868 Compare July 20, 2023 10:10

willi-mueller force-pushed the 129-airtable-source branch from 3b9cb44 to 55cfb9a Compare August 4, 2023 12:47

willi-mueller marked this pull request as ready for review August 11, 2023 11:11

willi-mueller force-pushed the 129-airtable-source branch from 9edbf8c to 9e5bb4d Compare August 11, 2023 14:38

willi-mueller force-pushed the 129-airtable-source branch 2 times, most recently from b49395c to 67dc58f Compare August 17, 2023 10:50

rudolfix added the ci from fork Allows to run tests from PR coming from fork label Aug 20, 2023

rudolfix added ci from fork Allows to run tests from PR coming from fork and removed ci from fork Allows to run tests from PR coming from fork labels Aug 20, 2023

rudolfix requested changes Aug 22, 2023

View reviewed changes

sources/airtable/README.md Outdated Show resolved Hide resolved

sources/airtable/__init__.py Outdated Show resolved Hide resolved

sources/airtable/__init__.py Outdated Show resolved Hide resolved

willi-mueller force-pushed the 129-airtable-source branch from 81e40bd to a257313 Compare August 25, 2023 14:31

rudolfix added ci from fork Allows to run tests from PR coming from fork and removed ci from fork Allows to run tests from PR coming from fork labels Aug 27, 2023

adds airtable connector

def5819

- creates source for a base which returns resources for airtables - supports whitelisting airtables by names or ids - implements test suite calling live API - README for Airtable source improves typing

willi-mueller force-pushed the 129-airtable-source branch from a257313 to 79685c0 Compare August 29, 2023 14:08

willi-mueller force-pushed the 129-airtable-source branch from 79685c0 to ea6a41f Compare August 29, 2023 14:12

willi-mueller requested a review from rudolfix August 29, 2023 14:21

rudolfix added ci from fork Allows to run tests from PR coming from fork and removed ci from fork Allows to run tests from PR coming from fork labels Aug 30, 2023

rudolfix self-assigned this Aug 30, 2023

rudolfix approved these changes Aug 30, 2023

View reviewed changes

rudolfix merged commit a0f8f12 into dlt-hub:master Aug 30, 2023
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

129 Adds verified source: airtable #218

129 Adds verified source: airtable #218

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

rudolfix commented Jul 20, 2023

Additional Write Dispositions?

rudolfix commented Jul 20, 2023

rudolfix commented Jul 20, 2023

willi-mueller commented Jul 20, 2023 •

edited

Loading

rudolfix commented Jul 20, 2023

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023

rudolfix commented Jul 21, 2023

willi-mueller commented Aug 11, 2023

rudolfix commented Aug 11, 2023

willi-mueller commented Aug 11, 2023

rudolfix commented Aug 13, 2023 •

edited

Loading

willi-mueller commented Aug 15, 2023 via email

rudolfix commented Aug 15, 2023

rudolfix commented Aug 16, 2023

willi-mueller commented Aug 17, 2023 •

edited

Loading

rudolfix commented Aug 17, 2023

willi-mueller commented Aug 18, 2023

rudolfix commented Aug 18, 2023

rudolfix left a comment

rudolfix left a comment

129 Adds verified source: airtable #218

129 Adds verified source: airtable #218

Conversation

willi-mueller commented Jul 20, 2023 • edited Loading

Tell us what you do here

Relevant issue

More PR info

TODO Before Merge

Future work

willi-mueller commented Jul 20, 2023 • edited Loading

How should users refer to Airtable tables?

Additional Write Dispositions?

willi-mueller commented Jul 20, 2023 • edited Loading

willi-mueller commented Jul 20, 2023 • edited Loading

1. Warnings about incomplete columns

2. Running the pipeline repeatedly creates runtime errors – due to emojis in column names

rudolfix commented Jul 20, 2023

Additional Write Dispositions?

rudolfix commented Jul 20, 2023

rudolfix commented Jul 20, 2023

willi-mueller commented Jul 20, 2023 • edited Loading

rudolfix commented Jul 20, 2023

willi-mueller commented Jul 20, 2023 • edited Loading

willi-mueller commented Jul 20, 2023

rudolfix commented Jul 21, 2023

willi-mueller commented Aug 11, 2023

rudolfix commented Aug 11, 2023

willi-mueller commented Aug 11, 2023

rudolfix commented Aug 13, 2023 • edited Loading

willi-mueller commented Aug 15, 2023 via email

rudolfix commented Aug 15, 2023

rudolfix commented Aug 16, 2023

willi-mueller commented Aug 17, 2023 • edited Loading

rudolfix commented Aug 17, 2023

willi-mueller commented Aug 18, 2023

rudolfix commented Aug 18, 2023

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

willi-mueller commented Jul 20, 2023 •

edited

Loading

rudolfix commented Aug 13, 2023 •

edited

Loading

willi-mueller commented Aug 17, 2023 •

edited

Loading