Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local Llama2 support from llama2-wrapper backend #400

Merged
merged 7 commits into from
Aug 31, 2023
Merged

Add local Llama2 support from llama2-wrapper backend #400

merged 7 commits into from
Aug 31, 2023

Conversation

liltom-eth
Copy link
Contributor

@liltom-eth liltom-eth commented Aug 25, 2023

Hi @Shaunwei @pycui ,
I am working on the project llama2-wrapper to make it easily call Llama2 model locally as an LLM backend.
And to follow up on the Twitter discussion, I made this PR as a showcase running Realchar and Llama2 locally on an M2 Macbook Air.
Here is the demo:
realchar

How to run on Mac:

Run OpenAI Compatible API on Llama2 models

pip install llama2-wrapper
python -m llama2_wrapper.server  --port 8001
# Llama2 running on http://localhost:8001

Start Realchar

python cli.py web-build
python cli.py run-uvicorn

Implementation

I found it hard to load local LLM object directly as backend since Realchar is using langchain.chat_models as the LLM.
Thus I chose to run a local LLM as OpenAI Compatible API, then call langchain.chat_models.ChatOpenAI to run LLM from the local URL.

Issues

Now the PR still has issues automatically passing customize URL from .env as the model URL to llm. I haven't figured out how to add a new LLM option in the new Realchar Web UI and hard code to make Realchar run on llama2-wrapper.

Showcase

This showcase is running Realchar and Llama2 on Mac. (13.70 tokens/sec through llama.cpp)
Another interesting showcase might be running Realchar and Llama2 on free colab T4 GPU. (18.19 tokens/sec through gptq)

@Shaunwei Shaunwei self-requested a review August 26, 2023 00:00
Copy link
Collaborator

@pycui pycui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this! A few comments

streaming=True,
# openai_api_base=url,
# temporaryly use fixed url
openai_api_base="http://localhost:8001/v1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can make this an env

# need figure out how to set up llama2wrapper in frontend
from realtime_ai_character.llm.llama2wrapper_llm import Llama2wrapperLlm

return Llama2wrapperLlm(url=model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the branching logic for the formal PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also might need a convention to route to local, e.g. maybe just call it local for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I found that OPENAI_API_KEY in .env is always required and if not will raise the error:

openai.error.AuthenticationError: Incorrect API key provided: YOUR_API_KEY. You can find your API key at https://platform.openai.com/account/api-keys.

Copy link
Contributor Author

@liltom-eth liltom-eth Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the branching logic for the formal PR

If I keep the branching logic here, the arg model will always be "gpt-3.5-turbo-16k", then initial a OpenaiLlm.
I think the reason is that there is no local button on frontend, and my choice GPT-3.5 will always set arg model as gpt-3.5-turbo-16k.
And LLM_MODEL_USE from .env is overwritten by frontend choice.

屏幕截图 2023-08-26 021701

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(presumably this is unusable on 3090 / too slow, right? ) @liltom-eth - do you have an a100 - or 2x 4090s?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your current code makes showcasing the demo easier, but for us to merge the code base we should still aim to incorporate with existing logic. I suggest we first make the backend part ready.

For the frontend selection, we can make an environment variable or UI advanced option to enable local Llama inference. When this is toggled, the model string passed to the backend can be your choice here in the backend. The frontend part can be a separate PR if you would like. For testing only, you can change the model string of the "Llama-2-70b" to test.

Copy link
Contributor Author

@liltom-eth liltom-eth Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(presumably this is unusable on 3090 / too slow, right? ) @liltom-eth - do you have an a100 - or 2x 4090s?

I believe it is usable on 3090, (running gptq model 18.85 tokens/sec on 2080ti).
But right now when I was running on Windows WSL2 to demo on 2080ti, I got some errors on Realchar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your current code makes showcasing the demo easier, but for us to merge the code base we should still aim to incorporate with existing logic. I suggest we first make the backend part ready.

For the frontend selection, we can make an environment variable or UI advanced option to enable local Llama inference. When this is toggled, the model string passed to the backend can be your choice here in the backend. The frontend part can be a separate PR if you would like. For testing only, you can change the model string of the "Llama-2-70b" to test.

Thank you! I will test it by using "Llama-2-70b" button in this PR. Another PR for frontend would be helphul.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pycui When I tried the frontend button "Llama-2-70b", it always through an error like:

Screenshot 2023-08-28 at 8 56 35 PM

Is that error happening because of checking anyscale key?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems because using a non-3.5 model directs you to a firebase auth, but you probably don't have a working Firebase App. You can probably edit thisclient/web/src/App.jsx L218, so that (in your test) your model name doesn't require a sign-in.

@johndpope
Copy link

before this gets merged - theres some caveats to be mindful with these local llms.
mostly - the model supplied by facebook out of the box is somewhat unusable on consumer hardware. I'm not sure if this pr is directly targeting that file format -
it has high floating point precision making model huge and vram intensive.
here's an article explaining ins and out of this.
https://brandolosaria.medium.com/setting-up-metaais-code-llama-34b-instruct-model-fc009aa937f6

so everyone is using the quantized / smaller 4 or 5bit models to get anything usable.
and they also use hugging face to download the models. so it becomes trivial to get the latest models / using text-generation-webui.
eg. there's been 2 new ones for codellama in last 24hrs.

Screenshot from 2023-08-29 14-08-42

there's also contention on what models get merged - and this becomes tech spike
I raised this issue in another repo suggesting to yield to a flexible upstream model provider
nomic-ai/gpt4all#1238

this one seems great - then it's their problem to update the models.
https://github.com/lmstudio-ai/model-catalog/blob/main/catalog.json

# need figure out how to set up model=url in frontend
# if select "Llama-2-70b" button from frontend,
# model here will be "meta-llama/Llama-2-70b-chat-hf"
model = os.getenv('LOCAL_LLM_URL')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pycui Thank you! Have made some updates based on your suggestions.
If I select "Llama-2-70b" button from frontend, model here will be "meta-llama/Llama-2-70b-chat-hf".
Thus I load model temporarily from .env here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I made some change to still use the model param. For testing, you can modify the frontend to pass localhost as model name.

.env.example Outdated
@@ -24,6 +24,9 @@ OPENAI_API_KEY=YOUR_API_KEY
ANTHROPIC_API_KEY=YOUR_API_KEY
# Anyscale Endpoint API Key
ANYSCALE_ENDPOINT_API_KEY=
# Local LLM with Openai Compatiable API
# LOCAL_LLM_URL="http://localhost:8001/v1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# LOCAL_LLM_URL="http://localhost:8001/v1"
# Example value: "http://localhost:8001/v1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

temperature=0.5,
streaming=True,
openai_api_base=url,
# openai_api_base="http://localhost:8001/v1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Made an update to clean this.

@liltom-eth
Copy link
Contributor Author

before this gets merged - theres some caveats to be mindful with these local llms. mostly - the model supplied by facebook out of the box is somewhat unusable on consumer hardware. I'm not sure if this pr is directly targeting that file format - it has high floating point precision making model huge and vram intensive. here's an article explaining ins and out of this. https://brandolosaria.medium.com/setting-up-metaais-code-llama-34b-instruct-model-fc009aa937f6

so everyone is using the quantized / smaller 4 or 5bit models to get anything usable. and they also use hugging face to download the models. so it becomes trivial to get the latest models / using text-generation-webui. eg. there's been 2 new ones for codellama in last 24hrs.

Screenshot from 2023-08-29 14-08-42

there's also contention on what models get merged - and this becomes tech spike I raised this issue in another repo suggesting to yield to a flexible upstream model provider nomic-ai/gpt4all#1238

this one seems great - then it's their problem to update the models. https://github.com/lmstudio-ai/model-catalog/blob/main/catalog.json

Thanks! That is a good idea. A model catalog can be helpful for users and developers.

@pycui pycui merged commit 0e0bd26 into Shaunwei:main Aug 31, 2023
Shaunwei pushed a commit that referenced this pull request Sep 3, 2023
* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>
Shaunwei added a commit that referenced this pull request Sep 19, 2023
* Add local Llama2 support from llama2-wrapper backend (#400)

* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Fix style issues and refine code (#425)

* Minor fix

* fix build

* fix style issues and refine code

* Use consistent name style

* Add API_HOST to react-web (#426)

* Update style to fit tablet screens (#427)

* Update README.md (minor typo) 😅 (#429)

* Add a Render deployment guide (#431)

* Add a Render deployment guide

* Update render_deploy.md

* Lint

* Format

* Lei/use zustand (#428)

* update page logic

* Apply zustand, fix minor bugs

* Solve the scroll issue

* Upload zustand files

* minor fix

* Add local Llama2 support from llama2-wrapper backend (#400)

* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Fix style issues and refine code (#425)

* Minor fix

* fix build

* fix style issues and refine code

* Use consistent name style

* Add API_HOST to react-web (#426)

* Update style to fit tablet screens (#427)

* Update README.md (minor typo) 😅 (#429)

* Add a Render deployment guide (#431)

* Add a Render deployment guide

* Update render_deploy.md

* Lint

* Format

* Lei/use zustand (#428)

* update page logic

* Apply zustand, fix minor bugs

* Solve the scroll issue

* Upload zustand files

* minor fix

---------

Co-authored-by: Tom <plain1994@gmail.com>
Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>
Co-authored-by: Lei Qiu <amethystlei@gmail.com>
Co-authored-by: Devansh <mdevansh28@gmail.com>
Shaunwei added a commit that referenced this pull request Sep 19, 2023
* Add local Llama2 support from llama2-wrapper backend (#400)

* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Fix style issues and refine code (#425)

* Minor fix

* fix build

* fix style issues and refine code

* Use consistent name style

* Add API_HOST to react-web (#426)

* Update style to fit tablet screens (#427)

* Update README.md (minor typo) 😅 (#429)

* Add a Render deployment guide (#431)

* Add a Render deployment guide

* Update render_deploy.md

* Lint

* Format

* Lei/use zustand (#428)

* update page logic

* Apply zustand, fix minor bugs

* Solve the scroll issue

* Upload zustand files

* minor fix

* update cli to support next-web (#432)

---------

Co-authored-by: Tom <plain1994@gmail.com>
Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>
Co-authored-by: Lei Qiu <amethystlei@gmail.com>
Co-authored-by: Devansh <mdevansh28@gmail.com>
Shaunwei added a commit that referenced this pull request Sep 19, 2023
* Add local Llama2 support from llama2-wrapper backend (#400)

* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Fix style issues and refine code (#425)

* Minor fix

* fix build

* fix style issues and refine code

* Use consistent name style

* Add API_HOST to react-web (#426)

* Update style to fit tablet screens (#427)

* Update README.md (minor typo) 😅 (#429)

* Add a Render deployment guide (#431)

* Add a Render deployment guide

* Update render_deploy.md

* Lint

* Format

* Lei/use zustand (#428)

* update page logic

* Apply zustand, fix minor bugs

* Solve the scroll issue

* Upload zustand files

* minor fix

* update cli to support next-web (#432)

* Update .gitignore (#433)

* local dev change

* Update .gitignore (#436)

* Reduce VAD latency. (#430)

* Lei/mobile next web (#437)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Lei/mobile next web (#439)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Finish hamburger menu and update page layout

* Fix minor layout issues

* Add ion (#442)

---------

Co-authored-by: Tom <plain1994@gmail.com>
Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>
Co-authored-by: Lei Qiu <amethystlei@gmail.com>
Co-authored-by: Devansh <mdevansh28@gmail.com>
Co-authored-by: Fangbai Chai <139947087+hksfang@users.noreply.github.com>
Shaunwei added a commit that referenced this pull request Sep 19, 2023
* Add local Llama2 support from llama2-wrapper backend (#400)

* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Fix style issues and refine code (#425)

* Minor fix

* fix build

* fix style issues and refine code

* Use consistent name style

* Add API_HOST to react-web (#426)

* Update style to fit tablet screens (#427)

* Update README.md (minor typo) 😅 (#429)

* Add a Render deployment guide (#431)

* Add a Render deployment guide

* Update render_deploy.md

* Lint

* Format

* Lei/use zustand (#428)

* update page logic

* Apply zustand, fix minor bugs

* Solve the scroll issue

* Upload zustand files

* minor fix

* update cli to support next-web (#432)

* Update .gitignore (#433)

* Update .gitignore (#436)

* Reduce VAD latency. (#430)

* Lei/mobile next web (#437)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Lei/mobile next web (#439)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Finish hamburger menu and update page layout

* Fix minor layout issues

* Add ion (#442)

* Avatar embedding (#441)

* fix: update audio

* feat: avatar generation embedding

* chore: move embedding to top

* Lei/mobile next web (#439)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Finish hamburger menu and update page layout

* Fix minor layout issues

* fix: no audio in other character

---------

Co-authored-by: Lei Qiu <amethystlei@gmail.com>

* Add info loggers showing latencies of STT, LLM, TTS processes (#445)

* deployment working except for voice cloning

* update README: new issue about tts doesn't speak due to bad llm response

* deployment successful; essential features all function

* update README

* prepare to merge with main

* Add info loggers showing latencies of STT, LLM, TTS processes

* update .gitignore

* untrack reset_databash.sh

* update README

* Add info loggers showing latencies of STT, LLM, TTS processes

* Add more latency monitors specific for the APIs

* Refactor the timers into decorators; Report latencies together

* Add terms of service page (#453)

---------

Co-authored-by: Tom <plain1994@gmail.com>
Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>
Co-authored-by: Lei Qiu <amethystlei@gmail.com>
Co-authored-by: Devansh <mdevansh28@gmail.com>
Co-authored-by: Fangbai Chai <139947087+hksfang@users.noreply.github.com>
Co-authored-by: Edwin Wong <73209427+HongSiu@users.noreply.github.com>
Co-authored-by: Yi Guo <guoyi0328@gmail.com>
Shaunwei added a commit that referenced this pull request Sep 19, 2023
* Add local Llama2 support from llama2-wrapper backend (#400)

* add llama2-wrapper as local backend

* update local llm backend

* update local llm backend

* update

* Update __init__.py

* Update __init__.py

---------

Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Fix style issues and refine code (#425)

* Minor fix

* fix build

* fix style issues and refine code

* Use consistent name style

* Add API_HOST to react-web (#426)

* Update style to fit tablet screens (#427)

* Update README.md (minor typo) 😅 (#429)

* Add a Render deployment guide (#431)

* Add a Render deployment guide

* Update render_deploy.md

* Lint

* Format

* Lei/use zustand (#428)

* update page logic

* Apply zustand, fix minor bugs

* Solve the scroll issue

* Upload zustand files

* minor fix

* update cli to support next-web (#432)

* Update .gitignore (#433)

* Update .gitignore (#436)

* Reduce VAD latency. (#430)

* Lei/mobile next web (#437)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Lei/mobile next web (#439)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Finish hamburger menu and update page layout

* Fix minor layout issues

* Add ion (#442)

* Avatar embedding (#441)

* fix: update audio

* feat: avatar generation embedding

* chore: move embedding to top

* Lei/mobile next web (#439)

* Fix the avatar size in home page

* Update home page style to support mobile device

* Add mobile support for most of the page

* Remove 'add character' when small screen

* Finish hamburger menu and update page layout

* Fix minor layout issues

* fix: no audio in other character

---------

Co-authored-by: Lei Qiu <amethystlei@gmail.com>

* Add info loggers showing latencies of STT, LLM, TTS processes (#445)

* deployment working except for voice cloning

* update README: new issue about tts doesn't speak due to bad llm response

* deployment successful; essential features all function

* update README

* prepare to merge with main

* Add info loggers showing latencies of STT, LLM, TTS processes

* update .gitignore

* untrack reset_databash.sh

* update README

* Add info loggers showing latencies of STT, LLM, TTS processes

* Add more latency monitors specific for the APIs

* Refactor the timers into decorators; Report latencies together

* Add terms of service page (#453)

* Implement next-web functionalities.

* fix small issue recorderSlice.js (#455)

---------

Co-authored-by: Tom <plain1994@gmail.com>
Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>
Co-authored-by: Lei Qiu <amethystlei@gmail.com>
Co-authored-by: Devansh <mdevansh28@gmail.com>
Co-authored-by: Fangbai Chai <139947087+hksfang@users.noreply.github.com>
Co-authored-by: Edwin Wong <73209427+HongSiu@users.noreply.github.com>
Co-authored-by: Yi Guo <guoyi0328@gmail.com>
Co-authored-by: Fangbai Chai <fangbaichai@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants