Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41755: [C++][ORC] Ensure setting detected ORC version #41767

Merged
merged 1 commit into from
May 27, 2024

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented May 21, 2024

FindorcAlt.cmake doesn't set orcAlt_VERSION when it finds ORC by find_library()/find_path(). If orcAlt_VERSION isn't set, ORC version detection by caller is failed.

cpp/src/arrow/adapters/orc/adapter.cc uses detected ORC version. If detected ORC version isn't correct, needless time zone database check is used.

Deployed in conda-forge through conda-forge/arrow-cpp-feedstock#1424 and confirmed as working in conda-forge/pyarrow-feedstock#122

@h-vetinari h-vetinari requested a review from wgtmac as a code owner May 21, 2024 21:36
@h-vetinari h-vetinari changed the title teach CheckTimeZoneDatabaseAvailability how to inspect conda environm… teach CheckTimeZoneDatabaseAvailability how to inspect conda environments May 21, 2024
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions github-actions bot added the awaiting review Awaiting review label May 21, 2024
@h-vetinari h-vetinari changed the title teach CheckTimeZoneDatabaseAvailability how to inspect conda environments GH-41755: teach CheckTimeZoneDatabaseAvailability how to inspect conda environments May 21, 2024
Copy link

⚠️ GitHub issue #41755 has been automatically assigned in GitHub to PR creator.

@wgtmac wgtmac changed the title GH-41755: teach CheckTimeZoneDatabaseAvailability how to inspect conda environments GH-41755: [C++][ORC] Teach CheckTimeZoneDatabaseAvailability how to inspect conda environments May 22, 2024
return Status::Invalid(
"IANA time zone database is unavailable but required by ORC."
" Please install it to /usr/share/zoneinfo or set TZDIR env to the installed"
" directory");
" directory. If you are using conda, simply install the package `tzdata`.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it would be annoying to non-conda users to see this irrelevant error message. Perhaps we can add this only when CONDA_PREFIX is set but TZDB is not installed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how that would be annoying TBH. It just adds one more way to fix the problem, and the user can choose the appropriate solution. I mean, I can do what you ask, but I feel it's implementation complexity for very little (or even negative) gain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou WDYT?

Copy link
Member

@kou kou May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is OK.
I don't know why this is happen with conda. Can we add tzdata to dependencies of arrow-cpp or orc conda packages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why this is happen with conda.

I described it in the issue (#41755) and in #36026: it's due to apache/orc#1882, which was the only way to avoid injecting TZDIR as an environment variable into every conda environment using arrow on windows, which would have been very intrusive.

Can we add tzdata to dependencies of arrow-cpp or orc conda packages?

That already happened in conda-forge/pyarrow-feedstock#122 (and backports to 15.x, 14.x, 13.x). In general, someone needs to sync the feedstocks back to the repo here, but that's unrelated to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC 2.0.1 can find tzdb without setting TZDIR with conda (CONDA_PREFIX), right?

Yes. If the environment variable CONDA_PREFIX is set, the code will prefer looking in $CONDA_PREFIX/share/zoneinfo, which is what's shipped through tzdata.

libarrow conda package uses ORC 2.0.1, right?

As of conda-forge/arrow-cpp-feedstock#1421 (and the various backports), yes.

So this message will not be shown for conda users. If this message is shown with ORC 2.0.1, ORC version detection may be failed.

Indeed something might be going wrong there, because conda-forge/pyarrow-feedstock#122 failed before I backported this PR (in conda-forge/arrow-cpp-feedstock#1424), despite building against orc 2.0.1 already

Can we add tzdata to dependencies of orc only on Windows? If we can do it, it will solve this problem.

We can add tzdata as a dependency on orc, but I want to make this unconditional. It's better to always rely on conda-forge's tzdata, rather than potentially out-of-date stuff in /usr/share/zoneinfo (aside from not introducing unnecessary divergence between platforms).

In any case, the presence of tzdata itself is not enough. We either need to ensure that ARROW_ORC_NEED_TIME_ZONE_DATABASE_CHECK correctly gets set to false on orc >=2.0.1, or do something like this PR. I think this PR is better, because you cannot know at compile-time if the package will end up in a conda-environment, and IMO this check should match what orc does, i.e. check CONDA_PREFIX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed something might be going wrong there

Could you try this?

diff --git a/cpp/cmake_modules/FindorcAlt.cmake b/cpp/cmake_modules/FindorcAlt.cmake
index 289416678a..ce8cd11b4c 100644
--- a/cpp/cmake_modules/FindorcAlt.cmake
+++ b/cpp/cmake_modules/FindorcAlt.cmake
@@ -71,4 +71,5 @@ if(orcAlt_FOUND)
                           PROPERTIES IMPORTED_LOCATION "${ORC_STATIC_LIB}"
                                      INTERFACE_INCLUDE_DIRECTORIES "${ORC_INCLUDE_DIR}")
   endif()
+  set(orcAlt_VERSION ${ORC_VERSION})
 endif()

We can add tzdata as a dependency on orc, but I want to make this unconditional. It's better to always rely on conda-forge's tzdata, rather than potentially out-of-date stuff in /usr/share/zoneinfo (aside from not introducing unnecessary divergence between platforms).

It makes sense.

We either need to ensure that ARROW_ORC_NEED_TIME_ZONE_DATABASE_CHECK correctly gets set to false on orc >=2.0.1, or do something like this PR. I think this PR is better

We don't want to maintain this code as much as possible ORC 2.0.0 or later have the same check. So we want to choose the former.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try this?

Confirmed in conda-forge/arrow-cpp-feedstock#1432 that this works, thank you! Do you want to open a separate PR, or should I change this PR to your one-line suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I change this PR to your one-line suggestion?

Could you do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels May 22, 2024
Suggested-By: Sutou Kouhei <kou@clear-code.com>
@h-vetinari h-vetinari changed the title GH-41755: [C++][ORC] Teach CheckTimeZoneDatabaseAvailability how to inspect conda environments GH-41755: [C++][ORC] ensure orc version gets set correctly May 27, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 27, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 27, 2024
@kou kou changed the title GH-41755: [C++][ORC] ensure orc version gets set correctly GH-41755: [C++][ORC] Ensure setting detected ORC version May 27, 2024
@kou kou merged commit ff9921f into apache:main May 27, 2024
36 of 37 checks passed
@kou kou removed the awaiting merge Awaiting merge label May 27, 2024
@h-vetinari h-vetinari deleted the orc branch May 27, 2024 08:05
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit ff9921f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 114 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants