Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Dataloss when exceptions received from BQ while using Storage API Write in batch mode #26521

Closed
2 of 15 tasks
johnjcasey opened this issue May 3, 2023 · 3 comments
Closed
2 of 15 tasks

Comments

@johnjcasey
Copy link
Contributor

What happened?

This affects versions 2.35 - 2.47, though it is most severe in 2.44-2.47 due to #26520 causing guaranteed exceptions.

When we get an exception from BQ when attempting to append to a stream, we will retry the current append on a new stream. However, in doing so, we abandon the existing stream. This results in messages on that stream that have not yet been committed not making it to BQ, resulting in data consistency issues.

public static void main(String[] args) {
 BigQueryOptions options = PipelineOptionsFactory.fromArgs(args).create().as(BigQueryOptions.class);
 options.setStorageApiAppendThresholdRecordCount(5);
 Pipeline p= Pipeline.create(options);

 p.apply("ReadLines", TextIO.read().from("gs://apache-beam-samples/shakespeare/kinglear.txt"))
     .apply("Save Events To BigQuery", BigQueryIO.<String>write()
         .to("google.com:clouddfe:reprodataset.reprotable")
         .withFormatFunction(s -> new TableRow().set("words", s))
         .withMethod(Write.Method.STORAGE_WRITE_API)
         .withCreateDisposition(CreateDisposition.CREATE_NEVER)
         .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

 p.run();
}

repros the issue on versions 2.44-2.47

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@johnjcasey
Copy link
Contributor Author

Fixed by #26503

@rszper
Copy link
Contributor

rszper commented May 8, 2023

As a workaround until the fix is available, use the File Loads method instead of the Storage Write API.

Issue symptom that customers might see:

  • Data written to BigQuery is incomplete even though Dataflow shows the correct output write number.

@Dhaamodharan
Copy link

Can someone help me how to fix/update the SDK version? Currently i am using SDK 2.39.0, I dont find any solution to update the SDK on runtime, or Is there any other way to do without restarting the job?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants