Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid storing duplicate information in the database #87

Open
PGijsbers opened this issue Nov 3, 2023 · 2 comments
Open

Avoid storing duplicate information in the database #87

PGijsbers opened this issue Nov 3, 2023 · 2 comments
Labels
database Something to address on a database level enhancement New feature or request

Comments

@PGijsbers
Copy link
Contributor

Information may be stored in multiple times in the database, this came to light in openml/openml-python#1289 (comment). We should avoid storing duplicate information in the database, because it can easily lead to multiple truths. This issue can be used to keep track of all duplicate data, with the intention to refactor our database in the future to avoid these pitfalls:

  • Feature attributes (e.g., ignore_attributes) information is duplicated between the expdb.dataset table and the expdb.data_features table.
@PGijsbers PGijsbers added enhancement New feature or request database Something to address on a database level labels Nov 3, 2023
@amueller
Copy link

amueller commented Nov 6, 2023

I assume this was done for efficiency, and we should be using automatic view materialization instead? Or was that on accident?

@PGijsbers
Copy link
Contributor Author

I wasn't involved with the database design, so I can't comment on why the duplication exists. I hope to discuss this with Jan later, but changes to the database likely won't happen yet in the next few months as we are focusing on a (mostly) faithful reimplementation of the PHP REST API first. While this issue doesn't specifically mention it, potential changes to the database will be benchmarked and put into context with usage statistics, which helps us evaluate the alternatives. But in principle the change outlined is something that should be looked at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Something to address on a database level enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants