Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new]: get_file(destination_table, duckdb_sql) #165

Open
2 tasks done
AntoineGiraud opened this issue Oct 1, 2024 · 3 comments
Open
2 tasks done

[new]: get_file(destination_table, duckdb_sql) #165

AntoineGiraud opened this issue Oct 1, 2024 · 3 comments
Labels
new-bigfunction Suggest a New BigFunction

Comments

@AntoineGiraud
Copy link
Contributor

AntoineGiraud commented Oct 1, 2024

Check the idea has not already been suggested

Edit the title above with self-explanatory function name and argument names

  • The function name and the argument names I entered in the title above seems self explanatory to me.

BigFunction Description as it would appear in the documentation

With this function, we will use duckdb web reading capability to download data out of :

  • url
  • S3 & other bucket (GCS / HuggingFace ...)

& let BigFunction's load it into a bigQuery schema.table of your choice

Function inspired from recent bigFunctions.get_csv

Examples of (arguments, expected output) as they would appear in the documentation

example :

select bigfunction.eu.load_file('bigfun_loads.station_info', 'from "https://gbfs.velobixi.com/gbfs/en/station_information.json"')
select bigfunction.eu.load_file('bigfun_loads.mtl_sectors', '''
  with t as (
      select unnest(features) feat
      from read_json_auto("https://www.donneesquebec.ca/recherche/dataset/b57cdeb1-98e7-4db7-bb84-32530f0367eb/resource/95ab084b-727e-4322-9433-0fed7baa690d/download/artm-sm-od13.geojson", sample_size=-1)
  )
  select
      feat.properties.SM13::int sector_id,
      feat.properties.SM13_nom sector_name,
      ST_GeomFromGeoJSON(feat.geometry::json) sector_geom,
      ST_Centroid(sector_geom) sector_centroid,
      1
  from t
''')
@AntoineGiraud
Copy link
Contributor Author

In a 2nd phase,
we could add DuckDB secret support :)

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/xxxxxx/bPxRfiCYEXAMPLEKEY',
    REGION 'us-east-1'
);

before executing user's query

SELECT *
FROM 's3://my-bucket/file.parquet';

@AntoineGiraud
Copy link
Contributor Author

AntoineGiraud commented Oct 2, 2024

Other option : configure some file types that we can query through DuckDB

select bigfunction.eu.load_file(
    'your_project.your_dataset.random_sales', -- table_name
    'csv', -- src_type
    "https://xxxxx", -- url
    '{}' -- optional_args
);

solution : build SQL query using ibis dataframe API

import ibis
con = ibis.duckdb.connect()

some simple csv

select bigfunction.eu.load_file(
    'your_project.your_dataset.random_sales',
    'csv',
    "https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv",
    '{}'
);

will be interpreted as

achats = con.read_csv(
    "https://raw.githubusercontent.com/AntoineGiraud/dbt_hypermarche/refs/heads/main/input/achats.csv"
)
print(achats)
print(achats.to_pandas())

crappy csv : codes_postaux

select bigfunction.eu.load_file(
    'your_project.your_dataset.dim_french_postalcodes',
    'csv',
    "https://www.data.gouv.fr/fr/datasets/r/2f75293b-3ee5-4cb5-971b-93e754dc96ea",
    '''{
      "columns": {
          "code_commune_insee": "VARCHAR",
          "nom_commune_insee": "VARCHAR",
          "code_postal": "VARCHAR",
          "lb_acheminement": "VARCHAR",
          "ligne_5": "VARCHAR",
      },
      "delim": ";",
      "skip": 1,
    }'''
);

will be interpreted as

codes_postaux = con.read_csv(
    "https://www.data.gouv.fr/fr/datasets/r/2f75293b-3ee5-4cb5-971b-93e754dc96ea",
    columns={
        "code_commune_insee": "VARCHAR",
        "nom_commune_insee": "VARCHAR",
        "code_postal": "VARCHAR",
        "lb_acheminement": "VARCHAR",
        "ligne_5": "VARCHAR",
    },
    delim=";",
    skip=1,
)
print(codes_postaux)
print(codes_postaux.to_pandas())

image

some json

select bigfunction.eu.load_file(
    'your_project.your_dataset.dim_french_departements',
    'json',
    "https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region",
    '{}'
);
dep = con.read_json(
    "https://geo.api.gouv.fr/departements?fields=nom,code,codeRegion,region"
)
print(dep)
print(dep.to_pandas())

image

@AntoineGiraud
Copy link
Contributor Author

AntoineGiraud commented Oct 3, 2024

implemented in a new commit in PR #166 :)

unytics pushed a commit that referenced this issue Oct 4, 2024
* feat: init function

* fix: remove ;

* feat: add constraint thanks to ibis
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-bigfunction Suggest a New BigFunction
Projects
None yet
Development

No branches or pull requests

1 participant