[new]: `get_file(destination_table, duckdb_sql)` #165

AntoineGiraud · 2024-10-01T12:01:39Z

Check the idea has not already been suggested

I could not find my idea in existing issues

Edit the title above with self-explanatory function name and argument names

The function name and the argument names I entered in the title above seems self explanatory to me.

BigFunction Description as it would appear in the documentation

With this function, we will use duckdb web reading capability to download data out of :

url
S3 & other bucket (GCS / HuggingFace ...)

& let BigFunction's load it into a bigQuery schema.table of your choice

Function inspired from recent bigFunctions.get_csv

Examples of (arguments, expected output) as they would appear in the documentation

example :

select bigfunction.eu.load_file('bigfun_loads.station_info', 'from "https://gbfs.velobixi.com/gbfs/en/station_information.json"')

select bigfunction.eu.load_file('bigfun_loads.mtl_sectors', '''
  with t as (
      select unnest(features) feat
      from read_json_auto("https://www.donneesquebec.ca/recherche/dataset/b57cdeb1-98e7-4db7-bb84-32530f0367eb/resource/95ab084b-727e-4322-9433-0fed7baa690d/download/artm-sm-od13.geojson", sample_size=-1)
  )
  select
      feat.properties.SM13::int sector_id,
      feat.properties.SM13_nom sector_name,
      ST_GeomFromGeoJSON(feat.geometry::json) sector_geom,
      ST_Centroid(sector_geom) sector_centroid,
      1
  from t
''')

The text was updated successfully, but these errors were encountered:

AntoineGiraud · 2024-10-02T13:37:02Z

In a 2nd phase,
we could add DuckDB secret support :)

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/xxxxxx/bPxRfiCYEXAMPLEKEY',
    REGION 'us-east-1'
);

before executing user's query

SELECT *
FROM 's3://my-bucket/file.parquet';

AntoineGiraud · 2024-10-02T19:36:19Z

Other option : configure some file types that we can query through DuckDB

select bigfunction.eu.load_file(
    'your_project.your_dataset.random_sales', -- table_name
    'csv', -- src_type
    "https://xxxxx", -- url
    '{}' -- optional_args
);

solution : build SQL query using ibis dataframe API

import ibis
con = ibis.duckdb.connect()

some simple csv

select bigfunction.eu.load_file(
    'your_project.your_dataset.random_sales',
    'csv',
    "https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv",
    '{}'
);

will be interpreted as

achats = con.read_csv(
    "https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv"
)
print(achats)
print(achats.to_pandas())

crappy csv : codes_postaux

select bigfunction.eu.load_file(
    'your_project.your_dataset.dim_french_postalcodes',
    'csv',
    "https://www.data.gouv.fr/fr/datasets/r/2f75293b-3ee5-4cb5-971b-93e754dc96ea",
    '''{
      "columns": {
          "code_commune_insee": "VARCHAR",
          "nom_commune_insee": "VARCHAR",
          "code_postal": "VARCHAR",
          "lb_acheminement": "VARCHAR",
          "ligne_5": "VARCHAR",
      },
      "delim": ";",
      "skip": 1,
    }'''
);

will be interpreted as

codes_postaux = con.read_csv(
    "https://www.data.gouv.fr/fr/datasets/r/2f75293b-3ee5-4cb5-971b-93e754dc96ea",
    columns={
        "code_commune_insee": "VARCHAR",
        "nom_commune_insee": "VARCHAR",
        "code_postal": "VARCHAR",
        "lb_acheminement": "VARCHAR",
        "ligne_5": "VARCHAR",
    },
    delim=";",
    skip=1,
)
print(codes_postaux)
print(codes_postaux.to_pandas())

some json

select bigfunction.eu.load_file(
    'your_project.your_dataset.dim_french_departements',
    'json',
    "https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region",
    '{}'
);

dep = con.read_json(
    "https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region"
)
print(dep)
print(dep.to_pandas())

AntoineGiraud · 2024-10-03T14:56:29Z

implemented in a new commit in PR #166 :)

* feat: init function * fix: remove ; * feat: add constraint thanks to ibis

AntoineGiraud added the new-bigfunction Suggest a New BigFunction label Oct 1, 2024

AntoineGiraud mentioned this issue Oct 2, 2024

feat(#165): add load_file with duckdb #166

Merged

unytics pushed a commit that referenced this issue Oct 4, 2024

feat(#165): add load_file with duckdb (#166)

caa9f08

* feat: init function * fix: remove ; * feat: add constraint thanks to ibis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[new]: `get_file(destination_table, duckdb_sql)` #165

[new]: `get_file(destination_table, duckdb_sql)` #165

AntoineGiraud commented Oct 1, 2024 •

edited

Loading

AntoineGiraud commented Oct 2, 2024

AntoineGiraud commented Oct 2, 2024 •

edited

Loading

AntoineGiraud commented Oct 3, 2024 •

edited

Loading

[new]: get_file(destination_table, duckdb_sql) #165

[new]: get_file(destination_table, duckdb_sql) #165

Comments

AntoineGiraud commented Oct 1, 2024 • edited Loading

Check the idea has not already been suggested

Edit the title above with self-explanatory function name and argument names

BigFunction Description as it would appear in the documentation

Examples of (arguments, expected output) as they would appear in the documentation

AntoineGiraud commented Oct 2, 2024

AntoineGiraud commented Oct 2, 2024 • edited Loading

some simple csv

crappy csv : codes_postaux

some json

AntoineGiraud commented Oct 3, 2024 • edited Loading

[new]: `get_file(destination_table, duckdb_sql)` #165

[new]: `get_file(destination_table, duckdb_sql)` #165

AntoineGiraud commented Oct 1, 2024 •

edited

Loading

AntoineGiraud commented Oct 2, 2024 •

edited

Loading

AntoineGiraud commented Oct 3, 2024 •

edited

Loading