Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support distributed file path in cloud environment #1787

Merged
merged 7 commits into from
Mar 18, 2021

Conversation

NvTimLiu
Copy link
Collaborator

@NvTimLiu NvTimLiu commented Feb 22, 2021

LOCAL_JAR_PATH is set in the cloud environment to support alternate local jars NOT building from the source code

@NvTimLiu NvTimLiu changed the title support distributed file path in cloud environment [WIP] support distributed file path in cloud environment Feb 22, 2021
@NvTimLiu
Copy link
Collaborator Author

Related issue #1640

@NvTimLiu
Copy link
Collaborator Author

Related issue:
#1421
#1568

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Feb 22, 2021

--rootdir not support distribute file path: #1785

@NvTimLiu
Copy link
Collaborator Author

build

@pxLi pxLi added the build Related to CI / CD or cleanly building label Feb 23, 2021
@NvTimLiu NvTimLiu changed the title [WIP] support distributed file path in cloud environment support distributed file path in cloud environment Feb 23, 2021
@NvTimLiu
Copy link
Collaborator Author

build

TEST_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark-integration-tests*.jar)
UDF_EXAMPLE_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark-udf-examples*.jar)
## reset SCRIPTPATH to support alternate scripts path on the cloud environment
SCRIPTPATH="$LOCAL_JAR_PATH/integration_tests"
Copy link
Collaborator Author

@NvTimLiu NvTimLiu Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need not RESET SCRIPTPATH, revert this line

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is not the SCRIPTPATH any longer. SCRIPTPATH is where this script resides and we are finding other things relative to that path. I would prefer to have a new variable like INTEGRATION_TEST_ROOT or something that can be set to SCRIPTPATH in the common case, and then set here to what we want it to be.

@NvTimLiu
Copy link
Collaborator Author

@revans2 Could you please help to review? thanks!

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition I really think we need to update the document for this script to include LOCAL_JAR_PATH in integration_tests/README.md.

Before this change LOCAL_JAR_PATH was clear, it was the path to set where the dependency jars come from. After this change it looks like we are forcing a very specific directory layout, which needs to be documented very clearly and we need to be sure none of our CI builds are using LOCAL_JAR_PATH differently, or we need to separate out how to set where to find the integration_tests directory from how to find the local jars.

I also am not sure how this fixes the issues you were seeing with the tests finding std_input_path

integration_tests/run_pyspark_from_build.sh Show resolved Hide resolved
TEST_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark-integration-tests*.jar)
UDF_EXAMPLE_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark-udf-examples*.jar)
## reset SCRIPTPATH to support alternate scripts path on the cloud environment
SCRIPTPATH="$LOCAL_JAR_PATH/integration_tests"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is not the SCRIPTPATH any longer. SCRIPTPATH is where this script resides and we are finding other things relative to that path. I would prefer to have a new variable like INTEGRATION_TEST_ROOT or something that can be set to SCRIPTPATH in the common case, and then set here to what we want it to be.

@NvTimLiu
Copy link
Collaborator Author

Why was this causing issues where it does not cause issues in other places? It would be good to add a comment explaining this because a few lines below it in the else clause. --> better to add this, though it does not cause error currently. PLUGIN_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark*.jar) will find rapids-4-spark_2.12-SNAPSHOT.jar, rapids-4-spark-integration-tests*.jar rapids-4-spark-udf-examples*.jar . Here we only want PLUGIN_JARS=rapids-4-spark_2.12-SNAPSHOT.jar

@revans2
Copy link
Collaborator

revans2 commented Feb 24, 2021

Why was this causing issues where it does not cause issues in other places? It would be good to add a comment explaining this because a few lines below it in the else clause. --> better to add this, though it does not cause error currently. PLUGIN_JARS=$(echo "$LOCAL_JAR_PATH"/rapids-4-spark*.jar) will find rapids-4-spark_2.12-SNAPSHOT.jar, rapids-4-spark-integration-tests*.jar rapids-4-spark-udf-examples*.jar . Here we only want PLUGIN_JARS=rapids-4-spark_2.12-SNAPSHOT.jar

Then lets update it in both places so it is consistent

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Feb 24, 2021

But it is not the SCRIPTPATH any longer. --> Sorry, here we need NOT to change SCRIPTPATH as [SCRIPTPATH="$( cd "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )" cd "$SCRIPTPATH"](https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/integration_tests/run_pyspark_from_build.sh#L17) makes sure SCRIPTPATH is the script path (we already put run_pyspark_from_build.sh under dir integration_tests). I'll revert it.

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Feb 24, 2021

I also am not sure how this fixes the issues you were seeing with the tests finding std_input_path --> As --rootdir can NOT work with HDFS path(means we can not use HDFS), and --std_input_path can NOT work with relative path, so I put the integration_tests into all the yarn nodes under the absolute dir '$SCRIPTPATH', then the absolute PATH $SCRIPTPATH works for both --rootdir & --std_input_path

Why it did not work before? --> We only put the integration_tests onto the Yarn master node before, then we transferred the it on to Yarn worker nodes by --file/--archive, so integration_tests path was relative[did NOT work with --std_input_path].

@NvTimLiu
Copy link
Collaborator Author

In addition I really think we need to update the document for this script to include LOCAL_JAR_PATH in integration_tests/README.md. --> Will update README.md

@revans2
Copy link
Collaborator

revans2 commented Feb 25, 2021

This code change as it is now look fine, but it has very little to do with the title of this issue.

support distributed file path in cloud environment

It also has almost nothing to do with any of the issues/bugs mentioned in this PR

Before approving this I just want to be sure I understand how this fits into those other issues. Will there be follow on work, especially for --std_input_path as we discussed in #1785 or do you think this is enough to resolve it?

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Mar 1, 2021

It also has almost nothing to do with any of the issues/bugs mentioned in this PR

Updated Jenkins scripts to download the latest udf-example jar with _2.12 appended in the jar name.

TEST_TYPE is NOT set, need we config it TEST_TYPE=nightly for all the Clould environment IT pipeline? @jlowe @revans2
Test result on EMR as below:

23:49:36 SKIPPED [1] ../../../../../../integration_tests/src/main/python/conftest.py:468: rapids_udf_example_native is not configured to run
23:49:36 SKIPPED [105] ../../../../../../integration_tests/src/main/python/conftest.py:452: TPC-DS not configured to run
23:49:36 SKIPPED [4] ../../../../../../integration_tests/src/main/python/conftest.py:381: TPCxBB not configured to run
23:49:36 SKIPPED [11] ../../../../../../integration_tests/src/main/python/conftest.py:461: cudf_udf not configured to run
23:49:36 4715 passed, 21 skipped,145 xfailed, 7 xpassed,104 warnings, in 5248.88s (1:27:28)
23:49:36
23:49:36 End of LogType:stdout
23:49:36 ***********************************************************************

PR #1829, retarget to branch-0.5

Before approving this I just want to be sure I understand how this fits into those other issues. Will there be follow on work, especially for --std_input_path as we discussed in #1785 or do you think this is enough to resolve it?

Yes, for --std_input_path issue, we put integration tests scripts to all of the distributed filesystem nodes, so the $SCRIPTPATH can be found locally.

@NvTimLiu NvTimLiu self-assigned this Mar 1, 2021
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Mar 1, 2021

Update the document for how to set 'LOCAL_JAR_PATH'

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Mar 1, 2021

build

@revans2
Copy link
Collaborator

revans2 commented Mar 1, 2021

Yes, for --std_input_path issue, we put integration tests scripts to all of the distributed filesystem nodes, so the $SCRIPTPATH can be found locally.

I really don't like that. How is it less work to keep that data in sync on all of those nodes than it is to have a script to copy it to a distributed file system? Or worse it is not kept in sync and when we change something with the data the tests start to fail.

Like I said I think this code is fine as is and if you want to merge it in we can, but I don't like the current plan for the input test files and to fix it I think we will need changes to these scripts.

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Mar 1, 2021

I really don't like that. How is it less work to keep that data in sync on all of those nodes than it is to have a script to copy it to a distributed file system? Or worse it is not kept in sync and when we change something with the data the tests start to fail.

Like I said I think this code is fine as is and if you want to merge it in we can, but I don't like the current plan for the input test files and to fix it I think we will need changes to these scripts.

I agree with you, the way of syncing the data onto all the nodes' SCRIPTPATH dir is ugly.
But the we need NOT to change the un_pyspark_from_build.sh using this way.

I'd prefer to sync --std_input_path onto the distributed filesystem, too.

As the --rootdir did NOT support distributed file path, to utilize the distribute file system for the --std_input_path, besides $SCRIPTPATH, we will need to introduce 2 extra ENV into run_pyspark_from_build.sh , e.g.

  1. DISTRIBUTEPATH=hdfs/s3a/gs/integration_tests/src/test/resources for the --std_input_path=$INPUTPATH,
  2. LOCAL_ROOTDIR=./integration_tests for the --rootdir=$LOCAL_ROOTDIR $LOCAL_ROOTDIR/src/main/python
  •     --     "$SCRIPTPATH"/runtests.py --rootdir "$SCRIPTPATH" "$SCRIPTPATH"/src/main/python \
    
  •    ++  "$SCRIPTPATH"/runtests.py --rootdir "$LOCAL_ROOTDIR" "$LOCAL_ROOTDIR"/src/main/python \
         -v -rfExXs "$TEST_TAGS" \
    
  •     --      --std_input_path="$SCRIPTPATH"/src/test/resources/ \
    
  •    ++  --std_input_path="$DISTRIBUTEPATH"/src/test/resources/ \
    

Seems the extra too ENVs are introduced in the run_pyspark_from_build.sh?

@NvTimLiu NvTimLiu changed the base branch from branch-0.4 to branch-0.5 March 2, 2021 01:55
@revans2
Copy link
Collaborator

revans2 commented Mar 2, 2021

I am fine with having extra environment variable to help control run_pyspark_from_build.sh having one that would override the default value for std_input_path sounds like a good change. And the LOCAL_ROOT_DIR change also sounds fine.

NvTimLiu and others added 2 commits March 10, 2021 19:06
Under cloud environment, overwrite the '--rootdir' param to point to the working directory of each excutor
Under cloud environment, overwrite the '--std_input_path' param to point to the distributed file path

Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

build

@NvTimLiu
Copy link
Collaborator Author

I am fine with having extra environment variable to help control run_pyspark_from_build.sh having one that would override the default value for std_input_path sounds like a good change. And the LOCAL_ROOT_DIR change also sounds fine.

@revans2 Added --rootdir LOCAL_ROOTDIR & --std_input_path CLOUD_INPUTPATH, could you please help to review. Thanks!

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor nit

## Under cloud environment, overwrite the '--rootdir' param to point to the working directory of each excutor
LOCAL_ROOTDIR=${LOCAL_ROOTDIR:-"$SCRIPTPATH"}
## Under cloud environment, overwrite the '--std_input_path' param to point to the distributed file path
CLOUD_INPUTPATH=${CLOUD_INPUTPATH:-"$SCRIPTPATH"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not a CLOUD_INPUTPATH in all cases. I think it would be better to just call it INPUT_PATH i.e.

INPUTPATH=${INPUTPATH:-"$SCRIPTPATH"}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated!

Change the var 'CLOUD_INPUTPATH' to 'INPUT_PATH'

Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu
Copy link
Collaborator Author

build

@revans2
Copy link
Collaborator

revans2 commented Mar 17, 2021

@NvTimLiu could you please upmerge your branch? it looks like some changes to cudf went in and broke backwards compatibility.

@NvTimLiu
Copy link
Collaborator Author

build

@NvTimLiu NvTimLiu merged commit a8d3aea into NVIDIA:branch-0.5 Mar 18, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* support distributed file path in cloud environment

Signed-off-by: Tim Liu <timl@nvidia.com>

* reset SCRIPTPATH to support alternate scripts path on the cloud environment

* set PLUGIN_JARS correctly

* Update the document for how to set 'LOCAL_JAR_PATH'

* Overwrite parameters

Under cloud environment, overwrite the '--rootdir' param to point to the working directory of each excutor
Under cloud environment, overwrite the '--std_input_path' param to point to the distributed file path

Signed-off-by: Tim Liu <timl@nvidia.com>

* change to use a more readable var

Change the var 'CLOUD_INPUTPATH' to 'INPUT_PATH'

Signed-off-by: Tim Liu <timl@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* support distributed file path in cloud environment

Signed-off-by: Tim Liu <timl@nvidia.com>

* reset SCRIPTPATH to support alternate scripts path on the cloud environment

* set PLUGIN_JARS correctly

* Update the document for how to set 'LOCAL_JAR_PATH'

* Overwrite parameters

Under cloud environment, overwrite the '--rootdir' param to point to the working directory of each excutor
Under cloud environment, overwrite the '--std_input_path' param to point to the distributed file path

Signed-off-by: Tim Liu <timl@nvidia.com>

* change to use a more readable var

Change the var 'CLOUD_INPUTPATH' to 'INPUT_PATH'

Signed-off-by: Tim Liu <timl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants