Skip to content

Commit

Permalink
Add Alluxio auto mount feature (#5925)
Browse files Browse the repository at this point in the history
* Add Alluxio auto mount feature

Mount the cloud bucket to Alluxio when driver converts FileSourceScanExec to GPU plan
The Alluxio master should be the same node as Spark driver node when using this feature
Introduce new configs:
    spark.rapids.alluxio.automount.enabled
    spark.rapids.alluxio.bucket.regex
    spark.rapids.alluxio.mount.cmd

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Set access key and secret key when mounting

Read access key and secret key from spark config or environment variables
Use the key when running alluxio mount
Default ALLUXIO_HOME as /opt/alluxio-2.8.0

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Print thread id

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Check mounted point

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Fix parameter mistake

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Add log

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Fix mount command

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Use whitespace to split

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Update docs

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Add synchronized to mount command

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Update some logs

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Update docs

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Use Properties to read Alluxio config

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Fix build error

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Add empty line to pass mvn verify

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Fix comments

Check both access key and secret
Update document to refer to auto mount section
Explain more about limitation
Use /bucket in mountedBucket to match fs mount output
Use camel case to name variable
Use URI to parse the fs mount output

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Address comments

Use logDebug
Write new functions to return the replaceFunc
Use URI to parse the scheme and bucket

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Fix the command without su user

Support to run the alluxio command without su by Process(String)

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Don't use URI since s3 path may include space

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/AlluxioUtils.scala

Remove risk log

* Fix comments

Update docs
Add a space in runAlluxioCmd

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Fix the indentation

Signed-off-by: Gary Shen <gashen@nvidia.com>

* Set default value of alluxio.cmd

correct indent

Signed-off-by: Gary Shen <gashen@nvidia.com>
  • Loading branch information
GaryShen2008 authored Jul 26, 2022
1 parent a08f74c commit 98f2571
Show file tree
Hide file tree
Showing 4 changed files with 358 additions and 69 deletions.
5 changes: 4 additions & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,10 @@ scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)

Name | Description | Default Value
-----|-------------|--------------
<a name="alluxio.pathsToReplace"></a>spark.rapids.alluxio.pathsToReplace|List of paths to be replaced with corresponding alluxio scheme. Eg, when configureis set to "s3:/foo->alluxio://0.1.2.3:19998/foo,gcs:/bar->alluxio://0.1.2.3:19998/bar", which means: s3:/foo/a.csv will be replaced to alluxio://0.1.2.3:19998/foo/a.csv and gcs:/bar/b.csv will be replaced to alluxio://0.1.2.3:19998/bar/b.csv|None
<a name="alluxio.automount.enabled"></a>spark.rapids.alluxio.automount.enabled|Enable the feature of auto mounting the cloud storage to Alluxio. It requires the Alluxio master is the same node of Spark driver node. When it's true, it requires an environment variable ALLUXIO_HOME be set properly. The default value of ALLUXIO_HOME is "/opt/alluxio-2.8.0". You can set it as an environment variable when running a spark-submit or you can use spark.yarn.appMasterEnv.ALLUXIO_HOME to set it on Yarn. The Alluxio master's host and port will be read from alluxio.master.hostname and alluxio.master.rpc.port(default: 19998) from ALLUXIO_HOME/conf/alluxio-site.properties, then replace a cloud path which matches spark.rapids.alluxio.bucket.regex like "s3://bar/b.csv" to "alluxio://0.1.2.3:19998/bar/b.csv", and the bucket "s3://bar" will be mounted to "/bar" in Alluxio automatically.|false
<a name="alluxio.bucket.regex"></a>spark.rapids.alluxio.bucket.regex|A regex to decide which bucket should be auto-mounted to Alluxio. E.g. when setting as "^s3://bucket.*", the bucket which starts with "s3://bucket" will be mounted to Alluxio and the path "s3://bucket-foo/a.csv" will be replaced to "alluxio://0.1.2.3:19998/bucket-foo/a.csv". It's only valid when setting spark.rapids.alluxio.automount.enabled=true. The default value matches all the buckets in "s3://" or "s3a://" scheme.|^s3a{0,1}://.*
<a name="alluxio.cmd"></a>spark.rapids.alluxio.cmd|Provide the Alluxio command, which is used to mount or get information. The default value is "su,ubuntu,-c,/opt/alluxio-2.8.0/bin/alluxio", it means: run Process(Seq("su", "ubuntu", "-c", "/opt/alluxio-2.8.0/bin/alluxio fs mount --readonly /bucket-foo s3://bucket-foo")), to mount s3://bucket-foo to /bucket-foo. the delimiter "," is used to convert to Seq[String] when you need to use a special user to run the mount command.|List(su, ubuntu, -c, /opt/alluxio-2.8.0/bin/alluxio)
<a name="alluxio.pathsToReplace"></a>spark.rapids.alluxio.pathsToReplace|List of paths to be replaced with corresponding Alluxio scheme. E.g. when configure is set to "s3://foo->alluxio://0.1.2.3:19998/foo,gs://bar->alluxio://0.1.2.3:19998/bar", it means: "s3://foo/a.csv" will be replaced to "alluxio://0.1.2.3:19998/foo/a.csv" and "gs://bar/b.csv" will be replaced to "alluxio://0.1.2.3:19998/bar/b.csv". To use this config, you have to mount the buckets to Alluxio by yourself. If you set this config, spark.rapids.alluxio.automount.enabled won't be valid.|None
<a name="cloudSchemes"></a>spark.rapids.cloudSchemes|Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: abfs, abfss, dbfs, gs, s3, s3a, s3n, wasbs. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type|None
<a name="gpu.resourceName"></a>spark.rapids.gpu.resourceName|The name of the Spark resource that represents a GPU that you want the plugin to use if using custom resources with Spark.|gpu
<a name="memory.gpu.allocFraction"></a>spark.rapids.memory.gpu.allocFraction|The fraction of available (free) GPU memory that should be allocated for pooled memory. This must be less than or equal to the maximum limit configured via spark.rapids.memory.gpu.maxAllocFraction, and greater than or equal to the minimum limit configured via spark.rapids.memory.gpu.minAllocFraction.|1.0
Expand Down
42 changes: 42 additions & 0 deletions docs/get-started/getting-started-alluxio.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,10 +244,14 @@ NM_hostname_2
```

For other filesystems, please refer to [this site](https://www.alluxio.io/).
We also provide auto mount feature for an easier usage.
Please refer to [Alluxio auto mount for AWS S3 buckets](#alluxio-auto-mount-for-aws-s3-buckets)

## RAPIDS Configuration

There are two ways to leverage Alluxio in RAPIDS.
We also provide an auto mount way for AWS S3 bucket if you install Alluxio in your Spark cluster.
Please refer to [Alluxio auto mount for AWS S3 buckets](#alluxio-auto-mount-for-aws-s3-buckets)

1. Explicitly specify the Alluxio path

Expand Down Expand Up @@ -312,6 +316,44 @@ There are two ways to leverage Alluxio in RAPIDS.
--conf spark.executor.extraJavaOptions="-Dalluxio.conf.dir=${ALLUXIO_HOME}/conf" \
```

## Alluxio auto mount for AWS S3 buckets

There's a more user-friendly way to use Alluxio with RAPIDS when accessing S3 buckets.
Suppose that a user has multiple buckets on AWS S3.
To use `spark.rapids.alluxio.pathsToReplace` requires to mount all the buckets beforehand
and put the path replacement one by one into this config. It'll be boring when there're many buckets.
To solve this problem, we add a new feature of Alluxio auto mount, which can mount the S3 buckets
automatically when finding them from the input path in the Spark driver.
This feature requires the node running Spark driver has Alluxio installed,
which means the node is also the master of Alluxio cluster. It will use `alluxio fs mount` command to
mount the buckets in Alluxio. And the uid used to run the Spark application can run alluxio command.
For example, the uid of Spark application is as same as the uid of Alluxio service
or the uid of Spark application can use `su alluxio_uid` to run alluxio command.
To enable the Alluxio auto mount feature, the simplest way is only to enable it by below config
without setting `spark.rapids.alluxio.pathsToReplace`, which takes precedence over auto mount feature.
``` shell
--conf spark.rapids.alluxio.automount.enabled=true
```
If Alluxio is not installed in /opt/alluxio-2.8.0, you should set the environment variable `ALLUXIO_HOME`.
Additional configs:
``` shell
--conf spark.rapids.alluxio.bucket.regex="^s3a{0,1}://.*"
```
The regex is used to match the s3 URI, to decide which bucket we should auto mount.
The default value is to match all the URIs which start with `s3://` or `s3a://`.
For exmaple, `^s3a{1,1}://foo.*` will match the buckets which start with `foo`.
```shell
--conf spark.rapids.alluxio.cmd="su,ubuntu,-c,/opt/alluxio-2.8.0/bin/alluxio"
```
This cmd config defines a sequence to be used run the alluxio command by a specific user,
mostly the user with Alluxio permission. We run the command by user `ubuntu` as default.
If you have a different user and command path, you can redefine it.
The default value is suitable for the case of running Alluxio with RAPIDS on Databricks.
## Alluxio Troubleshooting
This section will give some links about how to configure, tune Alluxio and some troubleshooting.
Expand Down
Loading

0 comments on commit 98f2571

Please sign in to comment.