Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a few minor things with scale test #9200

Merged
merged 2 commits into from
Sep 12, 2023

Conversation

revans2
Copy link
Collaborator

@revans2 revans2 commented Sep 7, 2023

This fixes #9198

It also

  • increases the types of columns that can be used for min/max operations
  • Adds a user readable column name for all output columns (so comparing results will be simpler)
  • Adjusts the ratio of SUM vs MIN/MAX columns to be even instead of random.
  • Removes duplication in the query names. The query names before were put into the map and also into the TestQuery. This moves them to only ever be in the TestQuery and the map is built from that.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
@revans2
Copy link
Collaborator Author

revans2 commented Sep 7, 2023

build

Copy link
Collaborator

@razajafri razajafri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with this code. I have made one pass, will go over it one more time in a bit

@@ -73,6 +72,32 @@ class QuerySpecs(config: Config, spark: SparkSession) {
numericColumns
}

/**
* To get columns in a dataframe that work with min/max.
* Now numeric columns are limited to [byte, int, long, decimal, string]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs missing TimestampType and DateType

Comment on lines 90 to 97
df.dtypes.filter {
case (_, dataType) =>
dataType == "ByteType" || dataType == "IntegerType" || dataType == "LongType" ||
dataType.startsWith("DecimalType") || dataType == "StringType" ||
dataType == "TimestampType" || dataType == "DateType"
}.map {
case (columnName, _) => columnName
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider pattern matching for readability

    val decimalTypePrefixRegex = "^DecimalType.*".r
    df.dtypes.collect {
      case (columnName, "ByteType") => columnName
      case (columnName, "IntegerType") => columnName
      ...
      case (columnName, decimalTypePrefixRegex) => columnName
    }

@revans2
Copy link
Collaborator Author

revans2 commented Sep 8, 2023

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +60 to +64
df.schema.map { field =>
(field.name, field.dataType)
}.collect {
case (columnName, ByteType | IntegerType | LongType | _: DecimalType) =>
columnName
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to fix, just a remark that even with this improvement map is not really necessary (unless it causes issues with different Spark versions) if we collect with

case StructField(columnName, ByteType | IntegerType | LongType | _: DecimalType, _, _) => columnName

@revans2
Copy link
Collaborator Author

revans2 commented Sep 8, 2023

@wjxiz1992 could you take a look too.

Copy link
Collaborator

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM.

@revans2 revans2 merged commit 1ccdf89 into NVIDIA:branch-23.10 Sep 12, 2023
28 checks passed
@revans2 revans2 deleted the add_query_number branch September 12, 2023 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add query number into the query description not just the iteration number.
5 participants