Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong results when comparing double reading from CSV #5682

Closed
viadea opened this issue May 27, 2022 · 1 comment
Closed

[BUG] Wrong results when comparing double reading from CSV #5682

viadea opened this issue May 27, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@viadea
Copy link
Collaborator

viadea commented May 27, 2022

When comparing double type(read from CSV), GPU shows different result as CPU run.

Below is a minimum repro:

  1. Create 2 sample CSV files:
$  cat a.csv
1|7.5
$  cat b.csv
1|7.500
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, DataFrame, SparkSession}

spark.conf.set("spark.rapids.sql.csv.read.double.enabled",true)

val aPath = "/home/xx/data/samplecsv/a.csv"
val bPath = "/home/xx/data/samplecsv/b.csv"

val acquisitionSchema = StructType(Array(
      StructField("loan_id", LongType),
      StructField("orig_interest_rate", DoubleType))
    )

val reader = spark.read.option("header", false)
                              .option("nullValue", "")
                              .option("delimiter", "|")
                              .option("parserLib", "univocity")
val a = reader
      .schema(acquisitionSchema)
      .csv(aPath)
val b = reader
      .schema(acquisitionSchema)
      .csv(bPath)

a.createOrReplaceTempView("a")
b.createOrReplaceTempView("b")

//GPU mode:
spark.conf.set("spark.rapids.sql.enabled",true)

spark.sql("""
select a.orig_interest_rate as orig_interest_rate_a, b.orig_interest_rate as orig_interest_rate_b
from a,b
where a.loan_id=b.loan_id
and a.orig_interest_rate <> b.orig_interest_rate

""").show(false)

GPU result:

+--------------------+--------------------+
|orig_interest_rate_a|orig_interest_rate_b|
+--------------------+--------------------+
|7.5                 |7.5                 |
+--------------------+--------------------+

CPU result:

+--------------------+--------------------+
|orig_interest_rate_a|orig_interest_rate_b|
+--------------------+--------------------+
+--------------------+--------------------+

Above is tested on 2204GA but seems 2206snapshot fixed it

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 27, 2022
@andygrove
Copy link
Contributor

I believe this was fixed by #4637

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants