Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Get Training Backend to Update User on Training Progress with AWS Appsync #920

Open
dwu359 opened this issue Aug 21, 2023 · 11 comments
Assignees
Labels
backend backend tasks enhancement New feature or request

Comments

@dwu359
Copy link
Contributor

dwu359 commented Aug 21, 2023

Feature Name

Get Training Backend to Update User on Training Progress with AWS Appsync

Your Name

Daniel Wu

Description

The http protocol that we use to communicate between the frontend and backend is unidirectional, meaning that the frontend needs to send a request for the backend to send back a response. To send back training progress, the backend needs to send multiple messages back to the frontend after the initial training request. Luckily, AWS AppSync handles that for us with its websocket pub/sub apis, which can be used to allow bidirectional communication between the frontend and backend. More specifically, we can have both the frontend and backend listen to AppSync's websocket endpoint for messages of a particular channel id and have both the frontend and backend make graph api requests to AppSync to send messages with the same channel id.

Use AWS AppSync to update the user on training progress for a particular training request. For now, let's say that training progress means the # of epochs completed.

@dwu359 dwu359 added enhancement New feature or request backend backend tasks labels Aug 21, 2023
@github-actions
Copy link
Contributor

Hello @dwu359! Thank you for submitting the Feature Request Form. We appreciate your contribution. 👋

We will look into it and provide a response as soon as possible.

To work on this feature request, you can follow these branch setup instructions:

  1. Checkout the main branch:
```
 git checkout nextjs
```
  1. Pull the latest changes from the remote main branch:
```
 git pull origin nextjs
```
  1. Create a new branch specific to this feature request using the issue number:
```
 git checkout -b feature-920
```

Feel free to make the necessary changes in this branch and submit a pull request when you're ready.

Best regards,
Deep Learning Playground (DLP) Team

@karkir0003
Copy link
Member

@dwu359 can you provide more detail?

@karkir0003
Copy link
Member

can you provide more detail here?

@andrewpeng02
Copy link
Contributor

Why aws appsync instead of websockets?

@karkir0003
Copy link
Member

@dwu359

@dwu359
Copy link
Contributor Author

dwu359 commented Feb 13, 2024

Appsync seems to handle the websockets stuff for us, but if you are able to find a way to implement it via websockets, then go for it. I will say though that I looked into implementing it via websockets before and the library support for websockets isn't as good as rest apis.

@andrewpeng02
Copy link
Contributor

I just don't see the need to use another service, and it'll also complicate development (we'd have to deploy to some staging env every time we want to test something?). I'll look into libraries

@andrewpeng02
Copy link
Contributor

What other uses of websockets do you think we'd want to add in the future?

@andrewpeng02
Copy link
Contributor

andrewpeng02 commented Feb 13, 2024

Django channels seem to be the accepted library for websockets, and the implementation won't be too bad. The one thing is we'd probably have to port our training methods as new websocket consumers and also deal with authentication a bit differently in a middleware. So, it seems like either:

  1. Port the entire train endpoints into websockets. Ninja schemas and stuff may not be supported?
  2. Define a websocket to just check on the current training epoch, will require 2 separate requests and figuring out how to connect the two will be annoying but it'll involve less refactoring (we can likely just create a job uuid on the client side and pass it to the endpoint and websocket)
  3. Retain the original HTTP training endpoints so we don't have to create new authentication and we have schema support for the input, but instead of doing the training in the endpoint, create a task via Celery and return the job id to the user (this is better for long-running tasks too). Then, the user will open a websocket with Django Channels and the Celery task will update the websocket group periodically with the progress and eventually return the result. Long-term, using Celery tasks would be best if we're planning on having long running train times especially with image data.

In terms of effort, 2 < 1 = 3

@karkir0003
Copy link
Member

@dwu359 ?

@dwu359
Copy link
Contributor Author

dwu359 commented Feb 14, 2024

Django channels seems like a good start, keep in mind you will need to find some way to host the websockets server (likely thru ec2) and access it (likely through api gateway or something else). I'm sorry I can't help much further, I'm no longer a direct contributor to this project and it seems like at this point you know more about websockets than I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend backend tasks enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

5 participants