Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[foss/2022a] TensorFlow v2.9.1 w/ Python 3.10.4 + CUDA 11.7.0 #16620

Conversation

VRehnberg
Copy link
Contributor

(created using eb --new-pr)

…tches: TensorFlow-2.9.1_fix-protobuf-include-def.patch
@VRehnberg VRehnberg marked this pull request as draft November 11, 2022 11:19
@VRehnberg
Copy link
Contributor Author

Currently problem with nsync_cv.h not found from dependency during build. Might have to solve similarily to how this was solved for protobuf.

@boegelbot
Copy link
Collaborator

@VRehnberg: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/3444607748
Output from first failing test suite run:

FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 953, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches of extension astor in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension dill in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension tblib in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension TensorFlow in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 10 patches vs 10 checksums
Checksums missing for one or more sources/patches of extension tensorflow-datasets in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums

======================================================================
FAIL: test_style_conformance (test.easyconfigs.styletests.StyleTest)
Check the easyconfigs for style
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/styletests.py", line 68, in test_style_conformance
    self.assertEqual(result, 0, error_msg)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/base/testing.py", line 116, in assertEqual
    raise AssertionError("%s:\nDIFF%s:\n%s" % (msg, limit, ''.join(diff[:self.ASSERT_MAX_DIFF])))
AssertionError: There shouldn't be any code style errors (and/or warnings), found 2:
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:175:121: E501 line too long (127 > 120 characters)
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:196:121: E501 line too long (143 > 120 characters)

: 2 != 0:
DIFF:
- 2

----------------------------------------------------------------------
Ran 15691 tests in 678.346s

FAILED (failures=2)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@surak
Copy link
Contributor

surak commented Nov 16, 2022

In file included from tensorflow/core/platform/protobuf.cc:16:
./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
   28 | #include "google/protobuf/io/coded_stream.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

I've seen this error before and I don't remember the fix

@VRehnberg
Copy link
Contributor Author

VRehnberg commented Nov 16, 2022 via email

@surak
Copy link
Contributor

surak commented Nov 16, 2022

Adding

+    "coded_stream": ("google/protobuf/io/coded_stream.h", []),

right after the port_def line fixes this

@surak
Copy link
Contributor

surak commented Nov 16, 2022

adding the aforementioned line goes further but then I get this:

ERROR: /easybuild/2020/build/TensorFlow/2.9.1/foss-2022a-CUDA-11.7.0/TensorFlow/tensorflow-2.9.1/tensorflow/core/BUILD:1640:16: Compiling tensorflow/core/util/work_sharder.cc failed: undeclared inclusion(s) in rule '//tensorflow/core:framework_internal_impl':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/util/work_sharder.cc':
  'bazel-out/k8-opt/bin/external/com_google_protobuf/google/protobuf/io/coded_stream.h'
Target //tensorflow/tools/pip_package:build_pip_package failed to build

@VRehnberg
Copy link
Contributor Author

One possible reason for the latest error you write about is that the includes in coded_stream.h needs to be in the list of imports that is the second value in the include tuples.

I've seen the following include errors so far

  • <google/protobuf/port_def.inc>
  • "google/protobuf/io/coded_stream.h"
  • "nsync_cv.h" // NOLINT
  • "double-conversion/double-conversion.h"
  • "google/protobuf/duration.pb.h"
    adding duration.pb.h to the WELL_KNOWN_PROTOS dict gets a similar error as coded_stream.h, at least if the import list is empty. I haven't had time to test my hypothesis yet.

@VRehnberg
Copy link
Contributor Author

What I don't understand is about this PR tensorflow/tensorflow#44144 that seem to have removed many of those include files that we now want to link to.

@VRehnberg
Copy link
Contributor Author

At that time it should have been sufficient to rebuild with empty disk_cache, but either something has changed or that cache is very persistent. I've been building on different nodes and different module trees with similar symptoms.

@VRehnberg
Copy link
Contributor Author

VRehnberg commented Nov 18, 2022

This issue tensorflow/tensorflow#37861 describes our issue quite well. However, while the symptom is the same it doesn't seem to be the same cause. CPATH is specified correctly in the builds, but inlcude files are not found anyway.

The command that is run is
external/local_config_cuda/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc
which is a

Crosstool wrapper for compiling CUDA programs.

and uses(?) nvcc.

Perhaps, nvcc doesn't honor the CPATH? I've got no idea. But it seems that going back to symlinking files is probably the wrong approach and should perhaps be avoided.

@surak
Copy link
Contributor

surak commented Nov 22, 2022

I couldn't care less for protobuf version, and I would be happy to have a second protobuf in the system only for tensorflow.

They mention protobuf 3.6.1 in their ci build

  • I tried that.

As it also depends on protobuf-python, I created both from 3.19.4, just removing the checksums and changing versions. protobuf-python 3.6.1 needs a little patch which I made here: python3 patch for protobuf-python 3.6.0

But then I get this:

tensorflow/core/grappler/mutable_graph_view.cc:345:18: error: const class google::protobuf::MessageLite has no member named ShortDebugString

, which is correct - protobuf 3.6.1 doesn't have this ShortDebugString protobuf 3.6 source (first version which has it is 3.8.0)

So, I tried now with 3.8.0 to see if I see any improvement. Which made me arrive to...

The same place.

./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
   28 | #include "google/protobuf/io/coded_stream.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

I don't know what to do further.

@VRehnberg
Copy link
Contributor Author

./tensorflow/core/platform/protobuf.h:28:10: fatal error: google/protobuf/io/coded_stream.h: No such file or directory
   28 | #include "google/protobuf/io/coded_stream.h"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

google/protobuf/io/coded_stream.h exists at $PROTOBUF_INCLUDE_PATH. As such either CPATH=$PROTOBUF_INCLUDE_PATH:$CPATH should be specified or used by the build tool. From logs (building with eb --debug --trace) I can see that the correct CPATH is used for some SUBCOMMANDs in some actions

With Bazel I learnt that --action_env command line variable should be used to specify this. And at least on my system it is and CPATH is used for some actions. However, not for all. Right now I'm trying to find the first commit that introduced the error and then I should know which rule to patch at least.

This is how it has been done in the past tensorflow/tensorflow#43019

Other avenues are:

  • Figure out what goes wrong by understanding Bazel and TensorFlow such that you can tell how to fix it by looking at the log (I gave up on this in favor of brute force)
  • Create an issue with TensorFlow and hope that they know how to fix it (should probably create an issue sooner or later anyway)

@Flamefire
Copy link
Contributor

Flamefire commented Dec 21, 2022

With Bazel I learnt that --action_env command line variable should be used to specify this. And at least on my system it is and CPATH is used for some actions. However, not for all. Right now I'm trying to find the first commit that introduced the error and then I should know which rule to patch at least.

This is how it has been done in the past tensorflow/tensorflow#43019

I started getting back into TensorFlow ECs now and got 2.7.1 working on all platforms. Then I went to the next version we have which is 2.8.4 and it doesn't have a CUDA EC, so I added it and am testing it now:

I run into the same issue you see here: CPATH action_env is not passed to some commands resulting in e.g.

In file included from bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/mlir/tools/kernel_gen/compile_cache_item.pb.cc:4:
bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/compiler/mlir/tools/kernel_gen/compile_cache_item.pb.h:10:10: fatal error: google/protobuf/port_def.inc: No such file or directory
   10 | #include <google/protobuf/port_def.inc>

So the issue appeared somewhere between 2.7.1 and 2.8.4 and only for the CUDA version, the non-CUDA version compiles fine for me.

As for crosstool_wrapper_driver_is_not_gcc: This is a wrapper on the compiler which can be either GCC or NVCC depending on what the build process (i.e. Bazel) passes to it. Not much going on there.

Previously I was able to "solve" the issue by replacing "exec_tools=" by "tools=" in the build files but that doesn't seem to work now. I'm now trying to find the commit introducing the issue in TF

@Flamefire
Copy link
Contributor

Happy new year! Just as an update on this: On the last (work) day of the last year I was able to figure out why the includes were missing. I summarized that in an issue against Bazel pointing out the commit I found via git bisect: tensorflow/tensorflow@07cbc7b

I'm also currently working on a fix which will solve the issue here and in other TF ECs by changing the EasyBlock. So I'd ask for a bit of patience while I'm testing this.

This patch doesn't actually solve the problem and is the wrong solution anyway.
@VRehnberg
Copy link
Contributor Author

I'm also currently working on a fix which will solve the issue here and in other TF ECs by changing the EasyBlock. So I'd ask for a bit of patience while I'm testing this.

Happy new year and thanks for your effort! I started bisecting myself but I kept running into other bugs for commits between releases and feedback was so slow that I got tired of it and haven't touched it in a few weeks.

I wouldn't mind taking a look at something untested and could run tests on my end as well.

@boegelbot
Copy link
Collaborator

@VRehnberg: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/3836480475
Output from first failing test suite run:

FAIL: test__parse_easyconfig_TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 1515, in innertest
    template_easyconfig_test(self, spec_path)
  File "test/easyconfigs/easyconfigs.py", line 1409, in template_easyconfig_test
    self.assertTrue(os.path.isfile(ext_patch_full), msg)
AssertionError: Patch file /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1_fix-protobuf-include-def.patch is available for TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb

======================================================================
FAIL: test_pr_sha256_checksums (test.easyconfigs.easyconfigs.EasyConfigTest)
Make sure changed easyconfigs have SHA256 checksums in place.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/easyconfigs.py", line 960, in test_pr_sha256_checksums
    self.assertTrue(len(checksum_issues) == 0, "No checksum issues:\n%s" % '\n'.join(checksum_issues))
AssertionError: No checksum issues:
Checksums missing for one or more sources/patches of extension astor in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension dill in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension tblib in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums
Checksums missing for one or more sources/patches of extension TensorFlow in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 10 patches vs 10 checksums
Checksums missing for one or more sources/patches of extension tensorflow-datasets in TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb: found 1 sources + 0 patches vs 0 checksums

======================================================================
FAIL: test_style_conformance (test.easyconfigs.styletests.StyleTest)
Check the easyconfigs for style
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test/easyconfigs/styletests.py", line 68, in test_style_conformance
    self.assertEqual(result, 0, error_msg)
  File "/opt/hostedtoolcache/Python/2.7.18/x64/lib/python2.7/site-packages/easybuild/base/testing.py", line 116, in assertEqual
    raise AssertionError("%s:\nDIFF%s:\n%s" % (msg, limit, ''.join(diff[:self.ASSERT_MAX_DIFF])))
AssertionError: There shouldn't be any code style errors (and/or warnings), found 2:
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:175:121: E501 line too long (127 > 120 characters)
/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/t/TensorFlow/TensorFlow-2.9.1-foss-2022a-CUDA-11.7.0.eb:196:121: E501 line too long (143 > 120 characters)

: 2 != 0:
DIFF:
- 2

----------------------------------------------------------------------
Ran 16129 tests in 658.385s

FAILED (failures=3)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice you me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@Flamefire
Copy link
Contributor

Happy new year and thanks for your effort! I started bisecting myself but I kept running into other bugs for commits between releases and feedback was so slow that I got tired of it and haven't touched it in a few weeks.

I know your pain. It helped that I knew the bug was already in 2.8.4 but not in 2.7.1 and I had 2 powerful machines (unintentional pun: PPC-machines) for compiling and I was also using ccache with same hacked ad-hoc config.

I wouldn't mind taking a look at something untested and could run tests on my end as well.

Let me finish my changes to 2.8.4 first (which lacks the CUDA version and that runs into the same issue) and then we can tackle this one once the underlying issue is fixed.

@VRehnberg
Copy link
Contributor Author

Closing in favor of #17092

@VRehnberg VRehnberg closed this Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants