Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SparkUtils.flattenSchema method throws null pointer exception #466

Closed
codealways opened this issue Jan 29, 2022 · 16 comments
Closed

SparkUtils.flattenSchema method throws null pointer exception #466

codealways opened this issue Jan 29, 2022 · 16 comments
Labels
bug Something isn't working

Comments

@codealways
Copy link
Contributor

Describe the bug

SparkUtils.flattenSchema method throws NPE for the empty data frame having occurs or schema contains arrays

To Reproduce

create a data frame from empty data file and copybook with occurs and flatten the schema with SparkUtils.flattenSchema method

Expected behaviour

It should return the flatten the schema with zero records as it's empty dataset

Screenshots

Additional context

@codealways codealways added the bug Something isn't working label Jan 29, 2022
@codealways
Copy link
Contributor Author

codealways commented Jan 30, 2022

LOC which is causing NPE

var maxInd = df.agg(max(expr(s"size($path${structField.name})"))).collect()(0)(0).toString.toInt

in function flattenStructArray

I understand the above code depends on data because of OCCURS DEPENDING ON implementation and it will throw NPE in case for zero records in dataframe

Potential Fix

Approach 1: Before calculating maxInd We can check if dataframe is empty. If its empty we can get the max index of a group from copybook for OCCURS or OCCURS DEPENDING on.For empty dataset it is not required to get the maxInd from the data

This will return the dataframe with maximum possible columns from Occurs or OCCOURS DEPENDING on.
If it is OCCURS DEPENDING ON FROM 0 to 3 then we will consider 3. We have to pass copybookcontents in this case to the required function.

Approach 2:
We can only update the code to read the maxInd from copybook in case of OCCURS. For OCCURS DEPENDING ON we can use the existing code.

As per my view Approach 1 is good. Kindly let me know your view.

@yruslan
Copy link
Collaborator

yruslan commented Jan 31, 2022

Thanks for the report! We will fix the NPE exception.

The issue with getting info from OCCURS is once the Cobol schema is converted to Spark schema the maximum array size is lost because Spark arrays do not have maximum number of elements as part of metadata. That's why the maximum array elements are determined by the actual data.

I have some ideas, will let you know after I try them.

@codealways
Copy link
Contributor Author

codealways commented Jan 31, 2022

Sure thanks. Please check once Approach 1 where we can pass optional parameter as copybook contents and can derive max occurrences based on that.

for the below copybook in case of zero byte file the dataset will be generated as mentioned below.

  01 RECORD.
      02 COUNT PIC 9(1).
      02 GROUP OCCURS 0 TO 2 TIMES DEPENDING ON COUNT.
         03 INNER-COUNT PIC 9(1).
         03 INNER-GROUP OCCURS 0 TO 3 TIMES
                            DEPENDING ON INNER-COUNT.
            04 FIELD PIC X.

-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+
|COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_0_INNER_GROUP_1_FIELD|GROUP_0_INNER_GROUP_2_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|GROUP_1_INNER_GROUP_1_FIELD|GROUP_1_INNER_GROUP_2_FIELD|
+-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+
+-----+-------------------+---------------------------+---------------------------+---------------------------+-------------------+---------------------------+---------------------------+---------------------------+

yruslan added a commit that referenced this issue Feb 1, 2022
yruslan added a commit that referenced this issue Feb 1, 2022
yruslan added a commit that referenced this issue Feb 1, 2022
yruslan added a commit that referenced this issue Feb 1, 2022
@yruslan
Copy link
Collaborator

yruslan commented Feb 1, 2022

The idea worked. When creaing a Spark schema from a copybook 'minElements' and 'maxElements' metadata fields are added to arrays from OCCURS. This way the program no longer needs to determine the maximum array size from the data itself. If metadata fields are not available the program will get these maximums the old ways - through querying the data.

You can check the update in master, or wait until 2.4.8 s released.

@codealways
Copy link
Contributor Author

@yruslan Thanks a lot. Any tentative date for 2.4.8 release.

@yruslan
Copy link
Collaborator

yruslan commented Feb 1, 2022

Probably until the end of the week.

But I would encourage you to check if the updated flattenSchema() works for you from the current master, so if not changes could be made before the release.

@codealways
Copy link
Contributor Author

sure let me pull the changes and check via UTCs

@codealways
Copy link
Contributor Author

As we are currently on 2.1.3 which uses spark 2.4.5 I suppose the newer version should be backward compatible as currently we may not use spark 3.x

@codealways
Copy link
Contributor Author

codealways commented Feb 1, 2022

@yruslan I checked with below copybook

01 RECORD.
02 COUNT PIC 9(1).
02 GROUP OCCURS 2 TIMES.
03 INNER-COUNT PIC 9(1).
03 INNER-GROUP OCCURS 3 TIMES.
04 FIELD PIC X.

As per expectation i should get the columns as below

|COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_0_INNER_GROUP_1_FIELD|GROUP_0_INNER_GROUP_2_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|GROUP_1_INNER_GROUP_1_FIELD|GROUP_1_INNER_GROUP_2_FIELD|

FIELD should come 2*3 = 6
INNER-COUNT 2 TIMES
COUNT 1 TIME

but i am getting

|COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|

@yruslan
Copy link
Collaborator

yruslan commented Feb 1, 2022

Did you use spark-cobol dependency with version 2.4.8-SNAPSHOT ?

@codealways
Copy link
Contributor Author

@yruslan yes I am using master branch and its with version 2.4.8-SNAPSHOT

@yruslan
Copy link
Collaborator

yruslan commented Feb 1, 2022

Strange. What is the code snippet you are using?

@codealways
Copy link
Contributor Author

I am running a simple testcase with below code. Using the master branch itself. Didnt change anything in POM or main code.

Is it working for you as expected ? i.e creating below columns

COUNT|GROUP_0_INNER_COUNT|GROUP_0_INNER_GROUP_0_FIELD|GROUP_0_INNER_GROUP_1_FIELD|GROUP_0_INNER_GROUP_2_FIELD|GROUP_1_INNER_COUNT|GROUP_1_INNER_GROUP_0_FIELD|GROUP_1_INNER_GROUP_1_FIELD|GROUP_1_INNER_GROUP_2_FIELD|

val df = spark
.read
.format("cobol")
.option("copybook", inputCopybookPath)
.option("encoding", "ascii")
.option("schema_retention_policy", "collapse_root")
.load(inputDataPath)

@yruslan
Copy link
Collaborator

yruslan commented Feb 2, 2022

I confirm that inner occurs were not flattenned properly for empty files. It is fixed.

You can pull the latest master and try again. It is good that you checked, otherwise we wouldn't have spotted it!

@codealways
Copy link
Contributor Author

@yruslan let me test again post your commit

@codealways
Copy link
Contributor Author

Working fine

@yruslan yruslan closed this as completed Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants