-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218
Conversation
… during blob transfer
internal/venice-common/src/main/java/com/linkedin/venice/blobtransfer/ServerBlobFinder.java
Outdated
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Outdated
Show resolved
Hide resolved
...n/src/integrationTest/java/com/linkedin/venice/endToEnd/BlobP2PTransferAmongServersTest.java
Outdated
Show resolved
Hide resolved
Have we considered picking random hosts, rather than first available ? |
Yes, thanks for this good call! Add shuffling list step to ensure hosts are randomly picked. |
…lob transfer. 3. retry if connect errors. 4. skip host if no snapshot
...t-common/src/main/java/com/linkedin/venice/exceptions/VenicePeersCannotConnectException.java
Outdated
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Outdated
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Outdated
Show resolved
Hide resolved
...ent/src/main/java/com/linkedin/davinci/blobtransfer/client/P2PFileTransferClientHandler.java
Outdated
Show resolved
Hide resolved
...ent/src/main/java/com/linkedin/davinci/blobtransfer/client/P2PFileTransferClientHandler.java
Show resolved
Hide resolved
...venice-common/src/main/java/com/linkedin/venice/blobtransfer/BlobPeersDiscoveryResponse.java
Outdated
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change and adding a lot of tests, it is good practice to add some unit tests for critical behavior coverage.
I left a few comments, most of them are simple, but the sequential vs parallel blob transfer issue I think it is needed to be tackled.
...venice-common/src/main/java/com/linkedin/venice/blobtransfer/BlobPeersDiscoveryResponse.java
Outdated
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Show resolved
Hide resolved
...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java
Outdated
Show resolved
Hide resolved
...ts/da-vinci-client/src/main/java/com/linkedin/davinci/ingestion/DefaultIngestionBackend.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thank you for the fixes!
… during blob transfer (linkedin#1218)
[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer
There is an issue in the peer discovery process during blob transfers where only the first host provided by discoverPeers is used. If the initial host fails to establish a connection, the system does not attempt to retry or connect to other available hosts.
This problem arises because, within the
NettyP2PTransferManager
, the inputStream obtained fromnettyClient.get
is returned even when it is associated with a failed CompletableFuture due to a connection error. NettyP2PBlobTransferManager do not explicitly verify whether the inputStream has failed before return it. Consequently, the inputStream from the first host is returned immediately, causing the for-loop that iterates through discoverPeers to exit prematurely.This PR fix:
Error handling:
Relocates the completion check from the DefaultIngestionBackend to the NettyP2PBlobTransferManager. This adjustment allows for more clearer error handling and retry processes logic.
Minor: simplifies the server blob finder logic
Minor: shuffling discoverPeers list to ensure the hosts are randomly picked.
How was this PR tested?
Unit Test: Verify the retry mechanism and the ability to skip a faulty host and subsequently transfer files to the local host.
Integration Test: Ensure that if a server's snapshot does not exist, the mechanism moves to the next available host. If no hosts are available, the process should default to using Kafka for ingestion.
Does this PR introduce any user-facing changes?