Change parameters for memory limit in Parquet chunked reader #10718

ttnghia · 2024-04-16T19:01:42Z

The parameter name for setting a limit on memory usage of Parquet chunked reader has been implemented in #9991. However, its name does not make much sense: CHUNKED_SUBPAGE_READER, and is very irrelevant as it is going to be used for ORC chunked reader.

This PR does the following:

Changing the parameter name CHUNKED_SUBPAGE_READER into something more expressive (although a bit longer): LIMIT_CHUNKED_READER_MEMORY_USAGE.
Adding a new parameter CHUNKED_READER_MEMORY_USAGE_RATIO, allowing to adjust the memory limit from config.

The parameters changed in this PR are also curated to be generic enough for using in both Parquet and ORC chunked reader.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

gerashegalov · 2024-04-16T19:18:18Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

-  val CHUNKED_READER = conf("spark.rapids.sql.reader.chunked")
-      .doc("Enable a chunked reader where possible. A chunked reader allows " +
-          "reading highly compressed data that could not be read otherwise, but at the expense " +
-          "of more GPU memory, and in some cases more GPU computation.")
-      .booleanConf
-      .createWithDefault(true)
-
-  val CHUNKED_SUBPAGE_READER = conf("spark.rapids.sql.reader.chunked.subPage")
-      .doc("Enable a chunked reader where possible for reading data that is smaller " +
-          "than the typical row group/page limit. Currently this only works for parquet.")
-      .booleanConf
-      .createWithDefault(true)


24.02 and 24.04 went out with this. spark.rapids.sql.reader.chunked.subPage is not marked as an internal conf. We should consider deprecation instead of outright removal

That sounds good.

I've added back the old config with deprecation warning.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

gerashegalov · 2024-04-17T20:01:06Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+    if(deprecatedConf.isDefined) {
+      logWarning(s"'${CHUNKED_SUBPAGE_READER.key}' is deprecated and is replaced by " +
+        s"'${LIMIT_CHUNKED_READER_MEMORY_USAGE}'.")
+      if(hasLimit.isDefined && hasLimit.get != deprecatedConf.get) {
+        throw new IllegalStateException(s"Both '${CHUNKED_SUBPAGE_READER.key}' and " +
+          s"'${LIMIT_CHUNKED_READER_MEMORY_USAGE.key}' are set but using different values.")
+      }
+      deprecatedConf.get
+    } else {
+      hasLimit.getOrElse(true)
+    }


I think we should

Give precedence to the new conf if defined.

WARN if the deprecated conf is defined too, that it is being ignored

Never throw

I don't think we should select one and ignore the other, similar to the other configs in Spark (like spark.sql.legacy.parquet.int96RebaseModeInRead) where the legacy config is not ignored. As such, here I always make sure if both configs were set then they must have the same value, otherwise will just parse the only config that has been set.

Probably better to check @felixcheung what the policy should be to avoid inconsistencies in our project. Another example of deprecation does not do anything drastic

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Lines 2602 to 2617 in 66f2cc5

lazy val multiThreadReadNumThreads: Int = {

// Use the largest value set among all the options.

val deprecatedConfs = Seq(

PARQUET_MULTITHREAD_READ_NUM_THREADS,

ORC_MULTITHREAD_READ_NUM_THREADS,

AVRO_MULTITHREAD_READ_NUM_THREADS)

val values = get(MULTITHREAD_READ_NUM_THREADS) +: deprecatedConfs.flatMap { deprecatedConf =>

val confValue = get(deprecatedConf)

confValue.foreach { _ =>

logWarning(s"$deprecatedConf is deprecated, use $MULTITHREAD_READ_NUM_THREADS. " +

"Conflicting multithreaded read thread count settings will use the largest value.")

}

confValue

}

values.max

}

IMO if we decide to ignore the old config, it's better to throw instead of just printing a warning then proceed. The users may not be aware of such warnings at all and receive a wrong outcome.

Your suggestion definitely has merit and should be considered. However as a user I would prefer an imperfect rule that can applied and easily remembered than explain a piecemeal of exceptions.

I agree with @gerashegalov here.

When we renamed a config the old config should be deprecated and the new config should first check if it is set. If so then the value for the new config is used. If not then it checks to see if the old config is set. If so then that value is used and a warning should be output. If not then a default is used.

I am fine if there are extra warning or errors if both are set and are different. But if one is set and the other is not, then go with the one that is set.

Thanks all. I've changed the logic of reading conf as we discussed:

Warn if the deprecated conf is set

Throw if both confs are set but with different values

Prioritize reading value from the new conf

What is the point of prioritizing if you end up looking at both if both are set, just so you can throw the exception?

Your point makes sense. That's because the priority here is just redundant: we end up checking both configs anyway, to make sure there will be no surprise in the output.

Anyway, I've filed an issue to remove the deprecated config: #10735

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia · 2024-04-19T20:10:04Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/iceberg/parquet/GpuParquet.java

@@ -69,7 +69,7 @@ public static class ReadBuilder {
    private long maxBatchSizeBytes = Integer.MAX_VALUE;
    private long targetBatchSizeBytes = Integer.MAX_VALUE;
    private boolean useChunkedReader = false;
-    private boolean useSubPageChunked = false;
+    private long maxChunkedReaderMemoryUsageSizeBytes = 0;


I'm not happy with this long long name throughout this PR. But I don't have a better candidate for it. If you have a good name then please suggest.

ttnghia · 2024-04-22T23:17:46Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

abellina

I think it looks mostly good. The config marked startup only should be good because it's being used in a static context.

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia · 2024-04-23T15:46:09Z

build

ttnghia added 2 commits April 16, 2024 11:50

Refactor Parquet reader

bc3003f

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Merge branch 'branch-24.06' into refactor_parquet_reader

563099e

ttnghia added P0 Must have for release task Work required that improves the product but is not user facing improve labels Apr 16, 2024

ttnghia self-assigned this Apr 16, 2024

gerashegalov reviewed Apr 16, 2024

View reviewed changes

ttnghia added 4 commits April 16, 2024 12:48

Update config

bbd63bd

Add back the deprecated config

8053382

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Fix config

ff46bef

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Change message for the deprecated config

813b4c8

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia changed the title ~~Change parameter name for memory limit in Parquet chunked reader~~ Change parameters for memory limit in Parquet chunked reader Apr 16, 2024

Rename variable

decf024

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

gerashegalov reviewed Apr 17, 2024

View reviewed changes

abellina reviewed Apr 17, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Show resolved Hide resolved

ttnghia mentioned this pull request Apr 18, 2024

Implement chunked ORC reader #10723

Merged

ttnghia added 4 commits April 18, 2024 10:42

Change the logic of reading conf

1bdce47

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Add example and mark conf as internal()

28e1afd

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Reformat code

702dbfc

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Update docs

4fb1d74

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia requested review from abellina, revans2 and gerashegalov April 18, 2024 17:55

ttnghia commented Apr 19, 2024

View reviewed changes

revans2 previously approved these changes Apr 23, 2024

View reviewed changes

abellina reviewed Apr 23, 2024

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Show resolved Hide resolved

abellina reviewed Apr 23, 2024

View reviewed changes

Change configs

b1b6470

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia dismissed revans2’s stale review via b1b6470 April 23, 2024 15:13

Update docs

6478de9

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

ttnghia requested a review from abellina April 23, 2024 15:19

ttnghia added 2 commits April 23, 2024 08:32

Change variables into functions

f455342

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

Change functions back into lazy val

e218cdf

Signed-off-by: Nghia Truong <nghiat@nvidia.com>

abellina approved these changes Apr 23, 2024

View reviewed changes

ttnghia merged commit ea41afc into NVIDIA:branch-24.06 Apr 23, 2024
43 checks passed

ttnghia deleted the refactor_parquet_reader branch April 23, 2024 19:54

ttnghia mentioned this pull request Apr 23, 2024

[FEA] Remove deprecated CHUNKED_SUBPAGE_READER config #10735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change parameters for memory limit in Parquet chunked reader #10718

Change parameters for memory limit in Parquet chunked reader #10718

ttnghia commented Apr 16, 2024 •

edited

Loading

gerashegalov Apr 16, 2024

ttnghia Apr 16, 2024

ttnghia Apr 16, 2024

gerashegalov Apr 17, 2024

ttnghia Apr 17, 2024 •

edited

Loading

gerashegalov Apr 17, 2024

ttnghia Apr 17, 2024

gerashegalov Apr 17, 2024

revans2 Apr 18, 2024 •

edited

Loading

ttnghia Apr 18, 2024 •

edited

Loading

gerashegalov Apr 23, 2024

ttnghia Apr 23, 2024

ttnghia Apr 19, 2024 •

edited

Loading

ttnghia commented Apr 22, 2024

abellina left a comment

ttnghia commented Apr 23, 2024

	lazy val multiThreadReadNumThreads: Int = {
	// Use the largest value set among all the options.
	val deprecatedConfs = Seq(
	PARQUET_MULTITHREAD_READ_NUM_THREADS,
	ORC_MULTITHREAD_READ_NUM_THREADS,
	AVRO_MULTITHREAD_READ_NUM_THREADS)
	val values = get(MULTITHREAD_READ_NUM_THREADS) +: deprecatedConfs.flatMap { deprecatedConf =>
	val confValue = get(deprecatedConf)
	confValue.foreach { _ =>
	logWarning(s"$deprecatedConf is deprecated, use $MULTITHREAD_READ_NUM_THREADS. " +
	"Conflicting multithreaded read thread count settings will use the largest value.")
	}
	confValue
	}
	values.max
	}

Change parameters for memory limit in Parquet chunked reader #10718

Change parameters for memory limit in Parquet chunked reader #10718

Conversation

ttnghia commented Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Apr 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revans2 Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

ttnghia Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Apr 19, 2024 • edited Loading

Choose a reason for hiding this comment

ttnghia commented Apr 22, 2024

abellina left a comment

Choose a reason for hiding this comment

ttnghia commented Apr 23, 2024

ttnghia commented Apr 16, 2024 •

edited

Loading

ttnghia Apr 17, 2024 •

edited

Loading

revans2 Apr 18, 2024 •

edited

Loading

ttnghia Apr 18, 2024 •

edited

Loading

ttnghia Apr 19, 2024 •

edited

Loading