Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe add a DataCatalog.from_file method #2967

Closed
noklam opened this issue Aug 24, 2023 · 14 comments
Closed

Maybe add a DataCatalog.from_file method #2967

noklam opened this issue Aug 24, 2023 · 14 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Aug 24, 2023

Description

Add a convenience method that creates a catalog from a single file

From

conf_loader = OmegaConfigLoader(conf_source= "conf")
conf_catalog = conf_loader.get("catalog")
catalog = DataCatalog.from_config(conf_catalog)

to

catalog = DataCatalog.from_file("catalog.yml")

Context

Originally posted by @astrojuanlu in #2819 (comment)

@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Aug 24, 2023
@datajoely
Copy link
Contributor

This is bloody brilliant and I'm annoyed we didn't think of it before

@noklam
Copy link
Contributor Author

noklam commented Sep 13, 2023

Reflecting on this, #2965 seems like a pre-requisite on this one.

The from_file is simply a syntactic sugar factory method that avoid the need of config_loader. The root problem is a separate issue and we still need this for DataCatalog.from_file?

Cc @merelcht

@astrojuanlu
Copy link
Member

@yetudada :

does this couple the Data Catalog and ConfigLoader? And if so, what are the implications of that?

I'd say it's fine for a "syntactic sugar" method to introduce some coupling - in this case with the OmegaConfigLoader. Users that want to use a different config loader can always use the longer version:

conf_loader = MyFancyCustomConfigLoader(conf_source= "conf")
conf_catalog = conf_loader.get("catalog")
catalog = DataCatalog.from_config(conf_catalog)

@datajoely
Copy link
Contributor

What about:
DataCatalog.from_files("**/*_catalog.yml", config_loader=MyConfigLoader)

@astrojuanlu
Copy link
Member

@idanov and @merelcht made a point today in a meeting that we really shouldn't couple the catalog with the config loader, so maybe keep a from_file("file.yml") and do a simple yaml.load instead of using the conf loader at all?

@deepyaman
Copy link
Member

@idanov and @merelcht made a point today in a meeting that we really shouldn't couple the catalog with the config loader, so maybe keep a from_file("file.yml") and do a simple yaml.load instead of using the conf loader at all?

Maybe even use simple yaml.load by default, and provide option to provide optional config_loader argument like @datajoely suggests, if you want to keep the ability to use that, in contexts where it's available (including in the Kedro framework code). I guess it comes down to how often somebody could be using ConfigLoader + DataCatalog outside of scope where it's fully managed by Kedro.

@datajoely
Copy link
Contributor

I think a simple single file mode defeats the object of this

@idanov
Copy link
Member

idanov commented Sep 14, 2023

I don't think it's a good idea to couple the catalog with the loader. The tradeoff is not great - do bad engineering only to save users from typing or reading 1 extra line of code (which usually occurs only once ever in your notebook).

conf_loader = OmegaConfigLoader(conf_source= "conf")
catalog = DataCatalog.from_config(conf_loader["catalog"])

https://en.wikipedia.org/wiki/Coupling_(computer_programming)#Disadvantages_of_tight_coupling

@inigohidalgo
Copy link
Contributor

As a user I will say that being able to directly instantiate a catalog from a single config file, skipping the environment replacing and stuff would be a very cool way to minimize the entry barrier to start using the catalog, as defining datasets in a single yaml is a much lower-friction action than having to setup a conf/ directory structure each time.

It would then be trivial to add some jupyter line magic which will instantiate a catalog in the notebook if there's a "catalog.yml" file in the same directory or something along those lines.

You do run into the credentials issue though, as those wouldn't be injected, but as a one-off analysis a lot of the time I'll just hardcode whatever credentials at the top (we don't usually use tokens when coding interactively, as all our data is hosted on azure which allows for interactive AAD Authentication)

@noklam
Copy link
Contributor Author

noklam commented Sep 15, 2023

I agree we shouldn't go too far to tie Catalog with ConfigLoader, this is mostly a syntactic sugar.

If we keep this simple as a yaml.load for a single file, do you still think it's a bad idea? @idanov @merelcht

@merelcht
Copy link
Member

Using yaml.load will help with not coupling the config loader and data catalog, but I'm wondering if it will be useful enough for users. If that's the way we implement this, we wouldn't be able to load any omegaconf specific syntax properly. So it could only be used on very basic catalogs without templating etc.

Arguably, this from_file() method could be a step towards using Kedro as a Framework and then users would gradually transition to use from_config() and have more advanced syntax in their config. That could be a nice experience, but I don't know how we can make it clear to users that from_file() will be using a plain yaml.load under the hood, while from_config() will leverage our OmegaConfigLoader.

@astrojuanlu
Copy link
Member

astrojuanlu commented Sep 25, 2023

Using yaml.load will help with not coupling the config loader and data catalog, but I'm wondering if it will be useful enough for users.

I contend that this could be useful enough for simple cases, and we could tell our users "if you want to use OmegaConfigLoader advanced capabilities, then, well, go ahead and instantiate it the right way".

As an alternative syntactic sugar that (1) doesn't introduce coupling between DataCatalog and OmegaConfigLoader and (2) retains functionality beyond plain YAML:

from kedro.util import catalog_from_file

catalog = catalog_from_file("catalog.yml")

(exact location and name of this function to be defined)

Slightly annoying that this requires an extra import and that the DataCatalog class isn't visible, but at least fulfills the original desire of simplifying the API for a simple use case. I'd still rather have DataCatalog.from_file loading a single file with plain YAML (no templating).

@astrojuanlu
Copy link
Member

Spotted today in the Intake docs:

from intake import open_catalog
cat = open_catalog('catalog.yaml')

(granted, they don't support advanced features)

@astrojuanlu
Copy link
Member

Internal users seemed to love this API, but from our internal discussions, it has become apparent that we cannot do this without either (a) introducing coupling between the DataCatalog and the OmegaConfigLoader class, which has been regarded as a bad idea, or (b) introducing a stripped-down version of config loading (plain YAML only) that is not satisfactory either.

For cases in which someone wants to teach or explain Kedro concepts (and this will be ourselves when we have better "how to use Kedro from Python" docs, see #2855 and #410), it was noted that one can instantiate the DataCatalog without the config loader using a Python dictionary:

catalog = DataCatalog.from_config({
    "reviews": {
        "type": "pandas.CSVDataset",
        "filepath": "data/01_raw/reviews.csv",
    }
})

Let's close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

No branches or pull requests

7 participants