[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218

jingy-li · 2024-10-04T20:39:18Z

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer

There is an issue in the peer discovery process during blob transfers where only the first host provided by discoverPeers is used. If the initial host fails to establish a connection, the system does not attempt to retry or connect to other available hosts.

This problem arises because, within the NettyP2PTransferManager, the inputStream obtained from nettyClient.get is returned even when it is associated with a failed CompletableFuture due to a connection error. NettyP2PBlobTransferManager do not explicitly verify whether the inputStream has failed before return it. Consequently, the inputStream from the first host is returned immediately, causing the for-loop that iterates through discoverPeers to exit prematurely.

This PR fix:

Error handling:
- [Fatal case] if there is NO peers found for the requested blob, skip bootstrap via blob transfer.
- [Fatal case] if ALL founded peers failed to connect or has no snapshot, skip bootstrap via blob transfer.
- [Fatal case] if any unexpected exception occurs during the file/metadata transfer, skip bootstrap via blob transfer for saving time, do cleanup.
- [Retry case] if one host has connect error, it will retry up to max-retry times. Automatically switches to the next available host after reaching the maximum retry limit, ensuring continued attempts to complete the file transfer.
- [Skip host case] If one host return 404 snapshot not found, it will skip and move to next host.
Relocates the completion check from the DefaultIngestionBackend to the NettyP2PBlobTransferManager. This adjustment allows for more clearer error handling and retry processes logic.
Minor: simplifies the server blob finder logic
Minor: shuffling discoverPeers list to ensure the hosts are randomly picked.

How was this PR tested?

Unit Test: Verify the retry mechanism and the ability to skip a faulty host and subsequently transfer files to the local host.

Integration Test: Ensure that if a server's snapshot does not exist, the mechanism moves to the next available host. If no hosts are available, the process should default to using Kafka for ingestion.

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

… during blob transfer

internal/venice-common/src/main/java/com/linkedin/venice/blobtransfer/ServerBlobFinder.java

...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java

...n/src/integrationTest/java/com/linkedin/venice/endToEnd/BlobP2PTransferAmongServersTest.java

eldernewborn · 2024-10-07T00:19:44Z

Have we considered picking random hosts, rather than first available ?
picking the first can create un-even and at times excessive load on the hosts that happen to show up as first on the list for the rest of the fleet.

jingy-li · 2024-10-07T17:28:33Z

Have we considered picking random hosts, rather than first available ? picking the first can create un-even and at times excessive load on the hosts that happen to show up as first on the list for the rest of the fleet.

Yes, thanks for this good call! Add shuffling list step to ensure hosts are randomly picked.

…lob transfer. 3. retry if connect errors. 4. skip host if no snapshot

...t-common/src/main/java/com/linkedin/venice/exceptions/VenicePeersCannotConnectException.java

...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java

...ent/src/main/java/com/linkedin/davinci/blobtransfer/client/P2PFileTransferClientHandler.java

...venice-common/src/main/java/com/linkedin/venice/blobtransfer/BlobPeersDiscoveryResponse.java

...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java

sixpluszero

Thanks for making this change and adding a lot of tests, it is good practice to add some unit tests for critical behavior coverage.
I left a few comments, most of them are simple, but the sequential vs parallel blob transfer issue I think it is needed to be tackled.

...venice-common/src/main/java/com/linkedin/venice/blobtransfer/BlobPeersDiscoveryResponse.java

...inci-client/src/main/java/com/linkedin/davinci/blobtransfer/NettyP2PBlobTransferManager.java

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/ingestion/DefaultIngestionBackend.java

sixpluszero

Looks good to me! Thank you for the fixes!

… during blob transfer (linkedin#1218)

[server][dvc] Retry/Skip hosts that fail to connect or transfer files…

3bc6ed7

… during blob transfer

jingy-li requested review from adamxchen and sixpluszero October 4, 2024 20:39

adamxchen reviewed Oct 4, 2024

View reviewed changes

address code review

a3ba116

jingy-li added 2 commits October 7, 2024 09:59

fix unit test error

a63b649

Shuffle the list to guarantee a random selection of hosts.

ac87b73

jingy-li requested a review from adamxchen October 7, 2024 18:03

jingy-li added 4 commits October 7, 2024 11:13

Merge branch 'main' into retry-peer-finding

d5d44de

fix unit test error

c2a3f73

1. remove 5-hrs timeout limitation. 2. if timeout in one node, skip b…

8ef5092

…lob transfer. 3. retry if connect errors. 4. skip host if no snapshot

Merge branch 'main' into retry-peer-finding

ddeab87

adamxchen reviewed Oct 9, 2024

View reviewed changes

sixpluszero reviewed Oct 9, 2024

View reviewed changes

address code review

0cd27ae

jingy-li requested review from sixpluszero and adamxchen October 9, 2024 17:43

fix unit test error

9876676

sixpluszero approved these changes Oct 9, 2024

View reviewed changes

adamxchen approved these changes Oct 9, 2024

View reviewed changes

jingy-li enabled auto-merge (squash) October 9, 2024 23:40

jingy-li merged commit 5cc32a7 into linkedin:main Oct 9, 2024
45 checks passed

jingy-li deleted the retry-peer-finding branch October 9, 2024 23:42

kvargha pushed a commit to kvargha/venice that referenced this pull request Oct 11, 2024

[server][dvc] Retry/Skip hosts that fail to connect or transfer files…

b163746

… during blob transfer (linkedin#1218)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218

jingy-li commented Oct 4, 2024 •

edited

Loading

eldernewborn commented Oct 7, 2024

jingy-li commented Oct 7, 2024

sixpluszero left a comment

sixpluszero left a comment

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218

Conversation

jingy-li commented Oct 4, 2024 • edited Loading