Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioning fixes #9207

Merged
merged 5 commits into from
Feb 14, 2024
Merged

Partitioning fixes #9207

merged 5 commits into from
Feb 14, 2024

Conversation

esheppa
Copy link
Contributor

@esheppa esheppa commented Feb 12, 2024

Which issue does this PR close?

Closes #9206.

Rationale for this change

  • Aligns the code behaviour with the documentation

What changes are included in this PR?

  • Altering the ListingTableUrl::contains method to skip hive partitions where they appear as a segment in the url
  • Makes the list_all_files method public to allow users to more easily see what files are being included with the ListingTableUrl

Are these changes tested?

Yes a new test is added to ensure that the changes continue to behave as expected

Are there any user-facing changes?

  • An additional public function on ListingTableUrl
  • Extra files may be listed for a given ListingTableUrl which previously were not included. This could cause query results to change

@github-actions github-actions bot added the core Core DataFusion crate label Feb 12, 2024
@alamb
Copy link
Contributor

alamb commented Feb 13, 2024

cc @devinjdangelo

@@ -234,7 +239,7 @@ impl ListingTableUrl {
}

/// List all files identified by this [`ListingTableUrl`] for the provided `file_extension`
pub(crate) async fn list_all_files<'a>(
pub async fn list_all_files<'a>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it related to the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to separate this out if preferred. I made this change because it allows easier debugging for future users who are unsure which files the ListingTableUrl will actually list. This means one can see whether the issue is with the url itself, or another issue

@alamb
Copy link
Contributor

alamb commented Feb 13, 2024

Sorry I didn't make it to this PR today -- I plan to review it tomorrow if no one beats me to it

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @esheppa -- this change makes sense to me. I think it is well documented and well tested. Nice work.

I had one small comment suggestion but I don't think it is required and we can merge this PR without it.

Thanks again

};

// remove any segments that contain `=` as they are allowed even
// when ignore subdirectories is `true`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// when ignore subdirectories is `true`
// when ignore subdirectories is `true`.
// For example a directory like `"year=2024"` is a hive
// partition containing data for the year 2024.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, easy fix

}
}

/// Returns `true` if `path` refers to a collection of objects
pub fn is_collection(&self) -> bool {
self.url.as_str().ends_with('/')
self.url.as_str().ends_with(DELIMITER)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this to use the DELIMITER as elsewhere

.next()
.map_or(false, |file_name| glob.matches(file_name))
} else {
let stripped = segments.join(DELIMITER);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this to use the DELIMITER as elsewhere

let has_subdirectory = segments.count() > 1;
!has_subdirectory
} else {
true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to change this to something like:

None => !ignore_subdirectory || segments.count() <= 1,

however perhaps that is making things less clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you have added the unit test coverage I think it would be fine to change. You could also do as a follow on PR if you prefer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent, I've added one more commit

@alamb alamb merged commit 528fdfb into apache:main Feb 14, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 14, 2024

Thanks again @esheppa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ExecutionOptions::ignore_subdirectory skipping hive style partitions
3 participants