Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): Added the Experimental ExternalTableDataset for Databricks #885

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

MinuraPunchihewa
Copy link
Contributor

@MinuraPunchihewa MinuraPunchihewa commented Oct 11, 2024

Description

This PR adds the ExternalTableDataset to support interactions with external tables in Databricks (Unity Catalog).

Fixes #817

Development notes

The ExternalTableDataset has been implemented by extending the BaseTableDataset that was added here.

These changes have been tested,

  1. Manually, by running the code against a Unity Catalog enabled workspace.
  2. Via the existing and newly added unit tests.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa MinuraPunchihewa marked this pull request as ready for review October 11, 2024 12:03
@MinuraPunchihewa
Copy link
Contributor Author

Hey @noklam, @ankatiyar,
I've added the ExternalTableDataset as an experimental dataset here. Is there anything else I need to do here, like tests etc.?

@noklam
Copy link
Contributor

noklam commented Oct 11, 2024

@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally.

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa
Copy link
Contributor Author

MinuraPunchihewa commented Oct 11, 2024

@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally.

@noklam I just added some tests to cover the logic specific to ExternalTableDataset (outside of BaseTableDataset). I also made a few other changes to the existing tstssuch as removing some duplicate code and moving all of the fixtures to conftest.py.

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@noklam
Copy link
Contributor

noklam commented Oct 14, 2024

FAILED tests/databricks/test_base_table_dataset.py::TestBaseTableDataset::test_save_overwrite

Is this an expected fail?

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa
Copy link
Contributor Author

FAILED tests/databricks/test_base_table_dataset.py::TestBaseTableDataset::test_save_overwrite

Is this an expected fail?

@noklam It has been my experience that when overwriting data to an external table of a format other than Delta, the location has to be provided. Since I am an external location fixture here which requires an environment variable called DATABRICKS_EXTERNAL_LOCATION that does not exist in the test environment, yes, this is expected to fail.

I should have specified a different format though. I just made a commit with that change.

Can you give me an explanation of how these tests are running? There should be a Spark environment available for them to run at all, right?

@noklam
Copy link
Contributor

noklam commented Oct 14, 2024

Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests

This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version).

Signed-off-by: Minura Punchihewa <minurapunchihewa17@gmail.com>
@MinuraPunchihewa
Copy link
Contributor Author

Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests

This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version).

Got it. I just moved the tests; I had to copy the conftest.py from as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kedro-datasets: ExternalTableDataset for Databricks
2 participants