Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

Merged

Conversation

sperlingxx
Copy link
Collaborator

Signed-off-by: sperlingxx lovedreamf@gmail.com

Closes #3026

Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS. Just in case it might be important for the ORC methods called by us, either today or in the future

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx requested review from jlowe and wbo4958 and removed request for jlowe March 11, 2022 02:57
@sperlingxx
Copy link
Collaborator Author

build

@sameerz sameerz added this to the Feb 28 - Mar 18 milestone Mar 11, 2022
@wbo4958
Copy link
Collaborator

wbo4958 commented Mar 11, 2022

LGTM

BTW, does our current rapids plugin read all ORC columns?

@sperlingxx
Copy link
Collaborator Author

LGTM

BTW, does our current rapids plugin read all ORC columns?

I don't think so. There is a specialized helper function calOrcFileIncluded to exclude unnecessary columns in terms of files:

    /**
     * Compute an array of booleans, one for each column in the ORC file, indicating whether the
     * corresponding ORC column ID should be included in the file to be loaded by the GPU.
     *
     * @param evolution ORC schema evolution instance
     * @return per-column inclusion flags
     */
    private def calcOrcFileIncluded(evolution: SchemaEvolution): Array[Boolean] = {
      if (requestedMapping.isDefined) {
        // ORC schema has no column names, so need to filter based on index
        val orcSchema = orcReader.getSchema
        val topFields = orcSchema.getChildren
        val numFlattenedCols = orcSchema.getMaximumId
        val included = new Array[Boolean](numFlattenedCols + 1)
        util.Arrays.fill(included, false)
        // first column is the top-level schema struct, always add it
        included(0) = true
        // find each top-level column requested by top-level index and add it and all child columns
        requestedMapping.get.foreach { colIdx =>
          val field = topFields.get(colIdx)
          (field.getId to field.getMaximumId).foreach { i =>
            included(i) = true
          }
        }
        included
      } else {
        evolution.getFileIncluded
      }
    }

wbo4958
wbo4958 previously approved these changes Mar 14, 2022
// any difference. Just in case it might be important for the ORC methods called by us,
// either today or in the future.
val includeColumns = requestedColIds.filter(_ != -1).sorted.mkString(",")
conf.set(OrcConf.INCLUDE_COLUMNS.getAttribute, includeColumns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is conditional on !canPruneCols? The equivalent Spark code setting this is not similarly conditional. We're calling buildOrcReader in either case which will examine the INCLUDE_COLUMNS setting, so it seems prudent to set it whether or not we can prune, minimally to help keep this code in sync with the Spark version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, following your question, I found that we should add this conf on the opposite conditional branch, which canPruneCols is TRUE. Because canPruneCols represents whether trusts ORC to prune columns or do it by ourselves.

@wbo4958 please correct me if I am wrong. Thank you!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also simplified the requestedColumnIds method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also simplified the requestedColumnIds method.

Please revert this change. The code was intentionally trying to be similar to the Apache Spark version. The same is true for the code calling this function.

I found that we should add this conf on the opposite conditional branch, which canPruneCols is TRUE. Because canPruneCols represents whether trusts ORC to prune columns or do it by ourselves.

IMO we need to do what Spark is doing with respect to when and how it sets up INCLUDE_COLUMNS. Looking at the Spark code, it is always setting INCLUDE_COLUMNS as long as requestedColPruneInfo is non-empty. We should do the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe, I reverted it. But I found a bigger problem. It seems that the setting of OrcConf.INCLUDE_COLUMNS is NOT compatible with the schema pruning applied in checkSchemaCompatibility, which leads to the failure when constructing SchemaEvolution: https://github.com/NVIDIA/spark-rapids/blob/branch-22.04/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScanBase.scala#L951.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkSchemaCompatibility also is derived from Apache Spark, so does it's version too have issues with INCLUDE_COLUMNS when pruning is applied?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlowe I failed to find checkSchemaCompatibility or similiar things in Apache Spark. Alternatively, I made some change on the checkSchemaCompatibility. Let it prune include status array by the field ID of pruned fields from read schema, which takes place simultaneously with the prune of read schema.

@sperlingxx
Copy link
Collaborator Author

build

This reverts commit 315c99c.
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx merged commit ae8f21b into NVIDIA:branch-22.04 Mar 19, 2022
@sperlingxx sperlingxx deleted the add_orc_conf_include_column branch March 19, 2022 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data
4 participants