Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark rewrite Files Action OOM #10054

Open
Zhanxiao-Ma opened this issue Mar 28, 2024 · 11 comments
Open

Spark rewrite Files Action OOM #10054

Zhanxiao-Ma opened this issue Mar 28, 2024 · 11 comments
Labels
question Further information is requested

Comments

@Zhanxiao-Ma
Copy link

Query engine

Spark

Question

V2 table support equality deletes for row level delete. When I use the Java API to write a large number of delete records and then start the spark rewrite files Action, I get an OOM error. Is it not allowed to delete too many records? How can I resolve this issue?
Does the community have plans to improve this issue?

@Zhanxiao-Ma Zhanxiao-Ma added the question Further information is requested label Mar 28, 2024
@Zhanxiao-Ma Zhanxiao-Ma changed the title Rewrite Files Action OOM Spark rewrite Files Action OOM Mar 28, 2024
@manuzhang
Copy link
Contributor

It's not forbidden to delete too many records but could increase memory required in the driver. If you are using position deletes, there's rewrite_position_delete_files. As for equality deletes, there was #2364 to rewrite equality deletes as position deletes but not merged.

@RussellSpitzer
Copy link
Member

There really isn't enough information here to dig into the issue. How many records are there, what were the spark settings, did it not oom before there were deletes? Did it oom during shuffle? Did the executors OOM? Was an unreasonable amount of memory being consumed?

As a general statement we are interested in improving performance but OOM's can happen for many different reasons so it's not something that can be universally fixed.

@Zhanxiao-Ma
Copy link
Author

There really isn't enough information here to dig into the issue. How many records are there, what were the spark settings, did it not oom before there were deletes? Did it oom during shuffle? Did the executors OOM? Was an unreasonable amount of memory being consumed?

As a general statement we are interested in improving performance but OOM's can happen for many different reasons so it's not something that can be universally fixed.

OK. Actually, this line of code is causing the OOM error because it loads all EqDelete record into memory. And as the number of records to be deleted increases, the memory requirement also increases.

StructLikeSet deleteSet = deleteLoader().loadEqualityDeletes(deletes, deleteSchema);

@nk1506
Copy link
Contributor

nk1506 commented Apr 5, 2024

@RussellSpitzer / @manuzhang , are we planning to make any fix for this? OOM has been observed with RewriteFiles too?
If we use this API with large chunks of small files to be rewritten with new large files, It causes OOM.

@manuzhang
Copy link
Contributor

@nk1506 Echoing Russell's comments, how many small files are there in your OOM case? How much memory do you set up?

@Zhanxiao-Ma
Copy link
Author

@nk1506 Echoing Russell's comments, how many small files are there in your OOM case? How much memory do you set up?

@RussellSpitzer I believe increasing memory is not a good solution for dealing with excessive information deletion because it is impossible to predict how much memory would be appropriate.

@Zhanxiao-Ma
Copy link
Author

@RussellSpitzer I have implemented a disk-based map to solve this problem. Is this what Iceberg expects? If so, I will submit the code.

@nk1506
Copy link
Contributor

nk1506 commented Apr 7, 2024

@nk1506 Echoing Russell's comments, how many small files are there in your OOM case? How much memory do you set up?

I didn't use spark-engine for compaction. I was using Java Client API. My queries might distract from the original problem. Although my requirement is to compact very large datasets(say 10K datafiles) with single commit. Using RewriteFiles always might cause OOM. So I am looking something which can help to manage manifestFiles more intelligently. I think I will start different thread to discuss the other problem.

@manuzhang
Copy link
Contributor

manuzhang commented May 8, 2024

I have implemented a disk-based map to solve this problem. Is this what Iceberg expects? If so, I will submit the code.

@Zhanxiao-Ma I think it will be valuable to the community. Please open a PR.

@pdames
Copy link

pdames commented May 21, 2024

Any updates here @Zhanxiao-Ma? Would love to take a look at what you've implemented if you've got a pending PR to link back to this issue, and see if there's an opportunity to work together to improve the state of affairs here!

@manuzhang
Copy link
Contributor

manuzhang commented Jul 9, 2024

I've created a draft PR which stores equality deletes in RocksDB. It's been verified in our environment, but requires more work to be integrated with existing API, caching mechanism, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants