-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change parameters for memory limit in Parquet chunked reader #10718
Change parameters for memory limit in Parquet chunked reader #10718
Conversation
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
val CHUNKED_READER = conf("spark.rapids.sql.reader.chunked") | ||
.doc("Enable a chunked reader where possible. A chunked reader allows " + | ||
"reading highly compressed data that could not be read otherwise, but at the expense " + | ||
"of more GPU memory, and in some cases more GPU computation.") | ||
.booleanConf | ||
.createWithDefault(true) | ||
|
||
val CHUNKED_SUBPAGE_READER = conf("spark.rapids.sql.reader.chunked.subPage") | ||
.doc("Enable a chunked reader where possible for reading data that is smaller " + | ||
"than the typical row group/page limit. Currently this only works for parquet.") | ||
.booleanConf | ||
.createWithDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
24.02 and 24.04 went out with this. spark.rapids.sql.reader.chunked.subPage
is not marked as an internal conf. We should consider deprecation instead of outright removal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added back the old config with deprecation warning.
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
if(deprecatedConf.isDefined) { | ||
logWarning(s"'${CHUNKED_SUBPAGE_READER.key}' is deprecated and is replaced by " + | ||
s"'${LIMIT_CHUNKED_READER_MEMORY_USAGE}'.") | ||
if(hasLimit.isDefined && hasLimit.get != deprecatedConf.get) { | ||
throw new IllegalStateException(s"Both '${CHUNKED_SUBPAGE_READER.key}' and " + | ||
s"'${LIMIT_CHUNKED_READER_MEMORY_USAGE.key}' are set but using different values.") | ||
} | ||
deprecatedConf.get | ||
} else { | ||
hasLimit.getOrElse(true) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should
- Give precedence to the new conf if defined.
- WARN if the deprecated conf is defined too, that it is being ignored
- Never throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should select one and ignore the other, similar to the other configs in Spark (like spark.sql.legacy.parquet.int96RebaseModeInRead
) where the legacy config is not ignored. As such, here I always make sure if both configs were set then they must have the same value, otherwise will just parse the only config that has been set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to check @felixcheung what the policy should be to avoid inconsistencies in our project. Another example of deprecation does not do anything drastic
spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Lines 2602 to 2617 in 66f2cc5
lazy val multiThreadReadNumThreads: Int = { | |
// Use the largest value set among all the options. | |
val deprecatedConfs = Seq( | |
PARQUET_MULTITHREAD_READ_NUM_THREADS, | |
ORC_MULTITHREAD_READ_NUM_THREADS, | |
AVRO_MULTITHREAD_READ_NUM_THREADS) | |
val values = get(MULTITHREAD_READ_NUM_THREADS) +: deprecatedConfs.flatMap { deprecatedConf => | |
val confValue = get(deprecatedConf) | |
confValue.foreach { _ => | |
logWarning(s"$deprecatedConf is deprecated, use $MULTITHREAD_READ_NUM_THREADS. " + | |
"Conflicting multithreaded read thread count settings will use the largest value.") | |
} | |
confValue | |
} | |
values.max | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO if we decide to ignore the old config, it's better to throw instead of just printing a warning then proceed. The users may not be aware of such warnings at all and receive a wrong outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion definitely has merit and should be considered. However as a user I would prefer an imperfect rule that can applied and easily remembered than explain a piecemeal of exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @gerashegalov here.
When we renamed a config the old config should be deprecated and the new config should first check if it is set. If so then the value for the new config is used. If not then it checks to see if the old config is set. If so then that value is used and a warning should be output. If not then a default is used.
I am fine if there are extra warning or errors if both are set and are different. But if one is set and the other is not, then go with the one that is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all. I've changed the logic of reading conf as we discussed:
- Warn if the deprecated conf is set
- Throw if both confs are set but with different values
- Prioritize reading value from the new conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of prioritizing if you end up looking at both if both are set, just so you can throw the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point makes sense. That's because the priority here is just redundant: we end up checking both configs anyway, to make sure there will be no surprise in the output.
Anyway, I've filed an issue to remove the deprecated config: #10735
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@@ -69,7 +69,7 @@ public static class ReadBuilder { | |||
private long maxBatchSizeBytes = Integer.MAX_VALUE; | |||
private long targetBatchSizeBytes = Integer.MAX_VALUE; | |||
private boolean useChunkedReader = false; | |||
private boolean useSubPageChunked = false; | |||
private long maxChunkedReaderMemoryUsageSizeBytes = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with this long long name throughout this PR. But I don't have a better candidate for it. If you have a good name then please suggest.
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks mostly good. The config marked startup only should be good because it's being used in a static context.
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Signed-off-by: Nghia Truong <nghiat@nvidia.com>
build |
The parameter name for setting a limit on memory usage of Parquet chunked reader has been implemented in #9991. However, its name does not make much sense:
CHUNKED_SUBPAGE_READER
, and is very irrelevant as it is going to be used for ORC chunked reader.This PR does the following:
CHUNKED_SUBPAGE_READER
into something more expressive (although a bit longer):LIMIT_CHUNKED_READER_MEMORY_USAGE
.CHUNKED_READER_MEMORY_USAGE_RATIO
, allowing to adjust the memory limit from config.The parameters changed in this PR are also curated to be generic enough for using in both Parquet and ORC chunked reader.