Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer #1218

Merged
merged 10 commits into from
Oct 9, 2024

Conversation

jingy-li
Copy link
Collaborator

@jingy-li jingy-li commented Oct 4, 2024

[server][dvc] Retry/Skip hosts that fail to connect or transfer files during blob transfer

There is an issue in the peer discovery process during blob transfers where only the first host provided by discoverPeers is used. If the initial host fails to establish a connection, the system does not attempt to retry or connect to other available hosts.

This problem arises because, within the NettyP2PTransferManager, the inputStream obtained from nettyClient.get is returned even when it is associated with a failed CompletableFuture due to a connection error. NettyP2PBlobTransferManager do not explicitly verify whether the inputStream has failed before return it. Consequently, the inputStream from the first host is returned immediately, causing the for-loop that iterates through discoverPeers to exit prematurely.


This PR fix:

  1. Error handling:

    • [Fatal case] if there is NO peers found for the requested blob, skip bootstrap via blob transfer.
    • [Fatal case] if ALL founded peers failed to connect or has no snapshot, skip bootstrap via blob transfer.
    • [Fatal case] if any unexpected exception occurs during the file/metadata transfer, skip bootstrap via blob transfer for saving time, do cleanup.
    • [Retry case] if one host has connect error, it will retry up to max-retry times. Automatically switches to the next available host after reaching the maximum retry limit, ensuring continued attempts to complete the file transfer.
    • [Skip host case] If one host return 404 snapshot not found, it will skip and move to next host.
  2. Relocates the completion check from the DefaultIngestionBackend to the NettyP2PBlobTransferManager. This adjustment allows for more clearer error handling and retry processes logic.

  3. Minor: simplifies the server blob finder logic

  4. Minor: shuffling discoverPeers list to ensure the hosts are randomly picked.

How was this PR tested?

Unit Test: Verify the retry mechanism and the ability to skip a faulty host and subsequently transfer files to the local host.

Integration Test: Ensure that if a server's snapshot does not exist, the mechanism moves to the next available host. If no hosts are available, the process should default to using Kafka for ingestion.

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • Yes. Make sure to explain your proposed changes and call out the behavior change.

@eldernewborn
Copy link
Collaborator

Have we considered picking random hosts, rather than first available ?
picking the first can create un-even and at times excessive load on the hosts that happen to show up as first on the list for the rest of the fleet.

@jingy-li
Copy link
Collaborator Author

jingy-li commented Oct 7, 2024

Have we considered picking random hosts, rather than first available ? picking the first can create un-even and at times excessive load on the hosts that happen to show up as first on the list for the rest of the fleet.

Yes, thanks for this good call! Add shuffling list step to ensure hosts are randomly picked.

Copy link
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change and adding a lot of tests, it is good practice to add some unit tests for critical behavior coverage.
I left a few comments, most of them are simple, but the sequential vs parallel blob transfer issue I think it is needed to be tackled.

Copy link
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you for the fixes!

@jingy-li jingy-li enabled auto-merge (squash) October 9, 2024 23:40
@jingy-li jingy-li merged commit 5cc32a7 into linkedin:main Oct 9, 2024
45 checks passed
@jingy-li jingy-li deleted the retry-peer-finding branch October 9, 2024 23:42
kvargha pushed a commit to kvargha/venice that referenced this pull request Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants