Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Simplify the way to access catalog #3923

Open
ElenaKhaustova opened this issue Jun 4, 2024 · 3 comments
Open

[DataCatalog]: Simplify the way to access catalog #3923

ElenaKhaustova opened this issue Jun 4, 2024 · 3 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

Description

Currently, there are two ways of accessing catalog: use DataCatalog.load_from_config() method or instantiate a KedroSession, load context and access catalog from there.

Users point that:

  • accessing the catalog from a Kedro session is complex and requires an understanding of framework details, such as project creation and environment setup;
  • acquiring the catalog involves writing a lot of code and navigating through parameters that are out of the context of their work;
  • creating a Kedro session too heavy for simple catalog reading tasks.

We propose to explore the feasibility of developing a clear and intuitive API for accessing the catalog directly from a Kedro project, eliminating the need for a session / hiding session creation.

Context

The current method for acquiring the Data Catalog is cumbersome and involves multiple complex steps, making it less user-friendly. The necessity to initiate a Kedro session and create a context adds unnecessary complexity for users who simply want to access the catalog. The pain point identified involves the complexity and inconsistency in accessing the data catalog from a Kedro project. The user highlights that obtaining the catalog typically requires navigating the Kedro documentation to find the appropriate code snippet to copy and paste, which is cumbersome and inefficient. To address this issue, the user created a custom function, catalog_from_project(), to streamline the process. This function simplifies the task but also suggests that such a utility might be beneficial if included directly within Kedro itself, improving accessibility and user experience.

Screenshot 2024-06-04 at 14 24 08

Frequent changes in this methods for acquiring a Kedro catalog across different versions (such as changes from Kedro 0.16 to 0.17) create difficulties in maintaining compatibility. This variability requires developers to implement complex logic in plugins like Kedro-viz to adapt to version differences.

Some users suggest having read-only DataCatalog Instance: creating a data catalog instance, at least for read-only use cases, which do not rely on creating a full-blown Kedro session.

Implementation Notes

The session creation step is needed to apply hooks that can change the catalog upon loading, so it can be hard to eliminate session creation completely. We can consider encapsulating session creation logic and providing an interface such as from kedro.framework.project.session.context import catalog or/andfrom kedro.framework.project import catalog with or without session creation.

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 4, 2024
@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 6, 2024

The boilerplate required to extract the catalog from the session is clear.

Do we have any insight on what's difficult about

from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog

conf_loader = OmegaConfigLoader(conf_source="conf", base_env="base", default_run_env="local")
conf_catalog = conf_loader["catalog"]

catalog = DataCatalog.from_config(conf_catalog)

?

(Asking because this was discussed in #2967)

@astrojuanlu
Copy link
Member

If we focus this issue on how to access the catalog for an existing project or session though, this is more of a Kedro Framework issue and not a DataCatalog API issue (which should stand on its own, unaware of the Framework).

@merelcht
Copy link
Member

merelcht commented Jun 6, 2024

From reading this issue it sounds to me that these users aren't aware of getting the catalog via the configloader like @astrojuanlu shows in the snippet above. We have worked on improving that massively for the 0.19.0 release, so I would personally leave this for now and not do anything other than maybe going back to the people who mentioned this and send them the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

3 participants