Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change parameters for memory limit in Parquet chunked reader #10718

Merged
merged 15 commits into from
Apr 23, 2024

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Apr 16, 2024

The parameter name for setting a limit on memory usage of Parquet chunked reader has been implemented in #9991. However, its name does not make much sense: CHUNKED_SUBPAGE_READER, and is very irrelevant as it is going to be used for ORC chunked reader.

This PR does the following:

  • Changing the parameter name CHUNKED_SUBPAGE_READER into something more expressive (although a bit longer): LIMIT_CHUNKED_READER_MEMORY_USAGE.
  • Adding a new parameter CHUNKED_READER_MEMORY_USAGE_RATIO, allowing to adjust the memory limit from config.

The parameters changed in this PR are also curated to be generic enough for using in both Parquet and ORC chunked reader.

@ttnghia ttnghia added P0 Must have for release task Work required that improves the product but is not user facing improve labels Apr 16, 2024
@ttnghia ttnghia self-assigned this Apr 16, 2024
Comment on lines -587 to -598
val CHUNKED_READER = conf("spark.rapids.sql.reader.chunked")
.doc("Enable a chunked reader where possible. A chunked reader allows " +
"reading highly compressed data that could not be read otherwise, but at the expense " +
"of more GPU memory, and in some cases more GPU computation.")
.booleanConf
.createWithDefault(true)

val CHUNKED_SUBPAGE_READER = conf("spark.rapids.sql.reader.chunked.subPage")
.doc("Enable a chunked reader where possible for reading data that is smaller " +
"than the typical row group/page limit. Currently this only works for parquet.")
.booleanConf
.createWithDefault(true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24.02 and 24.04 went out with this. spark.rapids.sql.reader.chunked.subPage is not marked as an internal conf. We should consider deprecation instead of outright removal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added back the old config with deprecation warning.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@ttnghia ttnghia changed the title Change parameter name for memory limit in Parquet chunked reader Change parameters for memory limit in Parquet chunked reader Apr 16, 2024
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Comment on lines 2555 to 2565
if(deprecatedConf.isDefined) {
logWarning(s"'${CHUNKED_SUBPAGE_READER.key}' is deprecated and is replaced by " +
s"'${LIMIT_CHUNKED_READER_MEMORY_USAGE}'.")
if(hasLimit.isDefined && hasLimit.get != deprecatedConf.get) {
throw new IllegalStateException(s"Both '${CHUNKED_SUBPAGE_READER.key}' and " +
s"'${LIMIT_CHUNKED_READER_MEMORY_USAGE.key}' are set but using different values.")
}
deprecatedConf.get
} else {
hasLimit.getOrElse(true)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should

  • Give precedence to the new conf if defined.
  • WARN if the deprecated conf is defined too, that it is being ignored
  • Never throw

Copy link
Collaborator Author

@ttnghia ttnghia Apr 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should select one and ignore the other, similar to the other configs in Spark (like spark.sql.legacy.parquet.int96RebaseModeInRead) where the legacy config is not ignored. As such, here I always make sure if both configs were set then they must have the same value, otherwise will just parse the only config that has been set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to check @felixcheung what the policy should be to avoid inconsistencies in our project. Another example of deprecation does not do anything drastic

lazy val multiThreadReadNumThreads: Int = {
// Use the largest value set among all the options.
val deprecatedConfs = Seq(
PARQUET_MULTITHREAD_READ_NUM_THREADS,
ORC_MULTITHREAD_READ_NUM_THREADS,
AVRO_MULTITHREAD_READ_NUM_THREADS)
val values = get(MULTITHREAD_READ_NUM_THREADS) +: deprecatedConfs.flatMap { deprecatedConf =>
val confValue = get(deprecatedConf)
confValue.foreach { _ =>
logWarning(s"$deprecatedConf is deprecated, use $MULTITHREAD_READ_NUM_THREADS. " +
"Conflicting multithreaded read thread count settings will use the largest value.")
}
confValue
}
values.max
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we decide to ignore the old config, it's better to throw instead of just printing a warning then proceed. The users may not be aware of such warnings at all and receive a wrong outcome.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion definitely has merit and should be considered. However as a user I would prefer an imperfect rule that can applied and easily remembered than explain a piecemeal of exceptions.

Copy link
Collaborator

@revans2 revans2 Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gerashegalov here.

When we renamed a config the old config should be deprecated and the new config should first check if it is set. If so then the value for the new config is used. If not then it checks to see if the old config is set. If so then that value is used and a warning should be output. If not then a default is used.

I am fine if there are extra warning or errors if both are set and are different. But if one is set and the other is not, then go with the one that is set.

Copy link
Collaborator Author

@ttnghia ttnghia Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all. I've changed the logic of reading conf as we discussed:

  • Warn if the deprecated conf is set
  • Throw if both confs are set but with different values
  • Prioritize reading value from the new conf

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of prioritizing if you end up looking at both if both are set, just so you can throw the exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point makes sense. That's because the priority here is just redundant: we end up checking both configs anyway, to make sure there will be no surprise in the output.

Anyway, I've filed an issue to remove the deprecated config: #10735

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@@ -69,7 +69,7 @@ public static class ReadBuilder {
private long maxBatchSizeBytes = Integer.MAX_VALUE;
private long targetBatchSizeBytes = Integer.MAX_VALUE;
private boolean useChunkedReader = false;
private boolean useSubPageChunked = false;
private long maxChunkedReaderMemoryUsageSizeBytes = 0;
Copy link
Collaborator Author

@ttnghia ttnghia Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this long long name throughout this PR. But I don't have a better candidate for it. If you have a good name then please suggest.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Apr 22, 2024

build

revans2
revans2 previously approved these changes Apr 23, 2024
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks mostly good. The config marked startup only should be good because it's being used in a static context.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@ttnghia ttnghia requested a review from abellina April 23, 2024 15:19
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@ttnghia
Copy link
Collaborator Author

ttnghia commented Apr 23, 2024

build

@ttnghia ttnghia merged commit ea41afc into NVIDIA:branch-24.06 Apr 23, 2024
43 checks passed
@ttnghia ttnghia deleted the refactor_parquet_reader branch April 23, 2024 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improve P0 Must have for release task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants