Add local Llama2 support from llama2-wrapper backend #400

liltom-eth · 2023-08-25T23:59:56Z

Hi @Shaunwei @pycui ,
I am working on the project llama2-wrapper to make it easily call Llama2 model locally as an LLM backend.
And to follow up on the Twitter discussion, I made this PR as a showcase running Realchar and Llama2 locally on an M2 Macbook Air.
Here is the demo:

How to run on Mac:

Run OpenAI Compatible API on Llama2 models

pip install llama2-wrapper
python -m llama2_wrapper.server  --port 8001
# Llama2 running on http://localhost:8001

Start Realchar

python cli.py web-build
python cli.py run-uvicorn

Implementation

I found it hard to load local LLM object directly as backend since Realchar is using langchain.chat_models as the LLM.
Thus I chose to run a local LLM as OpenAI Compatible API, then call langchain.chat_models.ChatOpenAI to run LLM from the local URL.

Issues

Now the PR still has issues automatically passing customize URL from .env as the model URL to llm. I haven't figured out how to add a new LLM option in the new Realchar Web UI and hard code to make Realchar run on llama2-wrapper.

Showcase

This showcase is running Realchar and Llama2 on Mac. (13.70 tokens/sec through llama.cpp)
Another interesting showcase might be running Realchar and Llama2 on free colab T4 GPU. (18.19 tokens/sec through gptq)

…er-backend

pycui

Thanks for making this! A few comments

pycui · 2023-08-26T00:12:36Z

realtime_ai_character/llm/llama2wrapper_llm.py

+            streaming=True,
+            # openai_api_base=url,
+            # temporaryly use fixed url
+            openai_api_base="http://localhost:8001/v1",


can make this an env

pycui · 2023-08-26T00:13:06Z

realtime_ai_character/llm/__init__.py

+    # need figure out how to set up llama2wrapper in frontend
+    from realtime_ai_character.llm.llama2wrapper_llm import Llama2wrapperLlm
+
+    return Llama2wrapperLlm(url=model)


let's keep the branching logic for the formal PR

also might need a convention to route to local, e.g. maybe just call it local for now

Thanks! I found that OPENAI_API_KEY in .env is always required and if not will raise the error:

openai.error.AuthenticationError: Incorrect API key provided: YOUR_API_KEY. You can find your API key at https://platform.openai.com/account/api-keys.

let's keep the branching logic for the formal PR

If I keep the branching logic here, the arg model will always be "gpt-3.5-turbo-16k", then initial a OpenaiLlm.
I think the reason is that there is no local button on frontend, and my choice GPT-3.5 will always set arg model as gpt-3.5-turbo-16k.
And LLM_MODEL_USE from .env is overwritten by frontend choice.

(presumably this is unusable on 3090 / too slow, right? ) @liltom-eth - do you have an a100 - or 2x 4090s?

I understand your current code makes showcasing the demo easier, but for us to merge the code base we should still aim to incorporate with existing logic. I suggest we first make the backend part ready.

For the frontend selection, we can make an environment variable or UI advanced option to enable local Llama inference. When this is toggled, the model string passed to the backend can be your choice here in the backend. The frontend part can be a separate PR if you would like. For testing only, you can change the model string of the "Llama-2-70b" to test.

(presumably this is unusable on 3090 / too slow, right? ) @liltom-eth - do you have an a100 - or 2x 4090s?

I believe it is usable on 3090, (running gptq model 18.85 tokens/sec on 2080ti).
But right now when I was running on Windows WSL2 to demo on 2080ti, I got some errors on Realchar.

I understand your current code makes showcasing the demo easier, but for us to merge the code base we should still aim to incorporate with existing logic. I suggest we first make the backend part ready.

For the frontend selection, we can make an environment variable or UI advanced option to enable local Llama inference. When this is toggled, the model string passed to the backend can be your choice here in the backend. The frontend part can be a separate PR if you would like. For testing only, you can change the model string of the "Llama-2-70b" to test.

Thank you! I will test it by using "Llama-2-70b" button in this PR. Another PR for frontend would be helphul.

@pycui When I tried the frontend button "Llama-2-70b", it always through an error like:

Is that error happening because of checking anyscale key?

It seems because using a non-3.5 model directs you to a firebase auth, but you probably don't have a working Firebase App. You can probably edit thisclient/web/src/App.jsx L218, so that (in your test) your model name doesn't require a sign-in.

johndpope · 2023-08-29T04:12:31Z

before this gets merged - theres some caveats to be mindful with these local llms.
mostly - the model supplied by facebook out of the box is somewhat unusable on consumer hardware. I'm not sure if this pr is directly targeting that file format -
it has high floating point precision making model huge and vram intensive.
here's an article explaining ins and out of this.
https://brandolosaria.medium.com/setting-up-metaais-code-llama-34b-instruct-model-fc009aa937f6

so everyone is using the quantized / smaller 4 or 5bit models to get anything usable.
and they also use hugging face to download the models. so it becomes trivial to get the latest models / using text-generation-webui.
eg. there's been 2 new ones for codellama in last 24hrs.

there's also contention on what models get merged - and this becomes tech spike
I raised this issue in another repo suggesting to yield to a flexible upstream model provider
nomic-ai/gpt4all#1238

this one seems great - then it's their problem to update the models.
https://github.com/lmstudio-ai/model-catalog/blob/main/catalog.json

liltom-eth · 2023-08-29T20:22:06Z

realtime_ai_character/llm/__init__.py

+    # need figure out how to set up model=url in frontend
+    # if select "Llama-2-70b" button from frontend,
+    # model here will be "meta-llama/Llama-2-70b-chat-hf"
+    model = os.getenv('LOCAL_LLM_URL')


@pycui Thank you! Have made some updates based on your suggestions.
If I select "Llama-2-70b" button from frontend, model here will be "meta-llama/Llama-2-70b-chat-hf".
Thus I load model temporarily from .env here.

Thanks, I made some change to still use the model param. For testing, you can modify the frontend to pass localhost as model name.

pycui · 2023-08-29T23:02:27Z

.env.example

@@ -24,6 +24,9 @@ OPENAI_API_KEY=YOUR_API_KEY
 ANTHROPIC_API_KEY=YOUR_API_KEY
 # Anyscale Endpoint API Key
 ANYSCALE_ENDPOINT_API_KEY=
+# Local LLM with Openai Compatiable API
+# LOCAL_LLM_URL="http://localhost:8001/v1"


Suggested change

# LOCAL_LLM_URL="http://localhost:8001/v1"

# Example value: "http://localhost:8001/v1"

pycui · 2023-08-29T23:02:53Z

realtime_ai_character/llm/local_llm.py

+            temperature=0.5,
+            streaming=True,
+            openai_api_base=url,
+            # openai_api_base="http://localhost:8001/v1",


remove this?

Thanks! Made an update to clean this.

liltom-eth · 2023-08-30T05:23:16Z

before this gets merged - theres some caveats to be mindful with these local llms. mostly - the model supplied by facebook out of the box is somewhat unusable on consumer hardware. I'm not sure if this pr is directly targeting that file format - it has high floating point precision making model huge and vram intensive. here's an article explaining ins and out of this. https://brandolosaria.medium.com/setting-up-metaais-code-llama-34b-instruct-model-fc009aa937f6

so everyone is using the quantized / smaller 4 or 5bit models to get anything usable. and they also use hugging face to download the models. so it becomes trivial to get the latest models / using text-generation-webui. eg. there's been 2 new ones for codellama in last 24hrs.

there's also contention on what models get merged - and this becomes tech spike I raised this issue in another repo suggesting to yield to a flexible upstream model provider nomic-ai/gpt4all#1238

this one seems great - then it's their problem to update the models. https://github.com/lmstudio-ai/model-catalog/blob/main/catalog.json

Thanks! That is a good idea. A model catalog can be helpful for users and developers.

* add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com>

* Add local Llama2 support from llama2-wrapper backend (#400) * add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> * Fix style issues and refine code (#425) * Minor fix * fix build * fix style issues and refine code * Use consistent name style * Add API_HOST to react-web (#426) * Update style to fit tablet screens (#427) * Update README.md (minor typo) 😅 (#429) * Add a Render deployment guide (#431) * Add a Render deployment guide * Update render_deploy.md * Lint * Format * Lei/use zustand (#428) * update page logic * Apply zustand, fix minor bugs * Solve the scroll issue * Upload zustand files * minor fix * Add local Llama2 support from llama2-wrapper backend (#400) * add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> * Fix style issues and refine code (#425) * Minor fix * fix build * fix style issues and refine code * Use consistent name style * Add API_HOST to react-web (#426) * Update style to fit tablet screens (#427) * Update README.md (minor typo) 😅 (#429) * Add a Render deployment guide (#431) * Add a Render deployment guide * Update render_deploy.md * Lint * Format * Lei/use zustand (#428) * update page logic * Apply zustand, fix minor bugs * Solve the scroll issue * Upload zustand files * minor fix --------- Co-authored-by: Tom <plain1994@gmail.com> Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> Co-authored-by: Lei Qiu <amethystlei@gmail.com> Co-authored-by: Devansh <mdevansh28@gmail.com>

* Add local Llama2 support from llama2-wrapper backend (#400) * add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> * Fix style issues and refine code (#425) * Minor fix * fix build * fix style issues and refine code * Use consistent name style * Add API_HOST to react-web (#426) * Update style to fit tablet screens (#427) * Update README.md (minor typo) 😅 (#429) * Add a Render deployment guide (#431) * Add a Render deployment guide * Update render_deploy.md * Lint * Format * Lei/use zustand (#428) * update page logic * Apply zustand, fix minor bugs * Solve the scroll issue * Upload zustand files * minor fix * update cli to support next-web (#432) --------- Co-authored-by: Tom <plain1994@gmail.com> Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> Co-authored-by: Lei Qiu <amethystlei@gmail.com> Co-authored-by: Devansh <mdevansh28@gmail.com>

* Add local Llama2 support from llama2-wrapper backend (#400) * add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> * Fix style issues and refine code (#425) * Minor fix * fix build * fix style issues and refine code * Use consistent name style * Add API_HOST to react-web (#426) * Update style to fit tablet screens (#427) * Update README.md (minor typo) 😅 (#429) * Add a Render deployment guide (#431) * Add a Render deployment guide * Update render_deploy.md * Lint * Format * Lei/use zustand (#428) * update page logic * Apply zustand, fix minor bugs * Solve the scroll issue * Upload zustand files * minor fix * update cli to support next-web (#432) * Update .gitignore (#433) * local dev change * Update .gitignore (#436) * Reduce VAD latency. (#430) * Lei/mobile next web (#437) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Lei/mobile next web (#439) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Finish hamburger menu and update page layout * Fix minor layout issues * Add ion (#442) --------- Co-authored-by: Tom <plain1994@gmail.com> Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> Co-authored-by: Lei Qiu <amethystlei@gmail.com> Co-authored-by: Devansh <mdevansh28@gmail.com> Co-authored-by: Fangbai Chai <139947087+hksfang@users.noreply.github.com>

* Add local Llama2 support from llama2-wrapper backend (#400) * add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> * Fix style issues and refine code (#425) * Minor fix * fix build * fix style issues and refine code * Use consistent name style * Add API_HOST to react-web (#426) * Update style to fit tablet screens (#427) * Update README.md (minor typo) 😅 (#429) * Add a Render deployment guide (#431) * Add a Render deployment guide * Update render_deploy.md * Lint * Format * Lei/use zustand (#428) * update page logic * Apply zustand, fix minor bugs * Solve the scroll issue * Upload zustand files * minor fix * update cli to support next-web (#432) * Update .gitignore (#433) * Update .gitignore (#436) * Reduce VAD latency. (#430) * Lei/mobile next web (#437) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Lei/mobile next web (#439) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Finish hamburger menu and update page layout * Fix minor layout issues * Add ion (#442) * Avatar embedding (#441) * fix: update audio * feat: avatar generation embedding * chore: move embedding to top * Lei/mobile next web (#439) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Finish hamburger menu and update page layout * Fix minor layout issues * fix: no audio in other character --------- Co-authored-by: Lei Qiu <amethystlei@gmail.com> * Add info loggers showing latencies of STT, LLM, TTS processes (#445) * deployment working except for voice cloning * update README: new issue about tts doesn't speak due to bad llm response * deployment successful; essential features all function * update README * prepare to merge with main * Add info loggers showing latencies of STT, LLM, TTS processes * update .gitignore * untrack reset_databash.sh * update README * Add info loggers showing latencies of STT, LLM, TTS processes * Add more latency monitors specific for the APIs * Refactor the timers into decorators; Report latencies together * Add terms of service page (#453) --------- Co-authored-by: Tom <plain1994@gmail.com> Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> Co-authored-by: Lei Qiu <amethystlei@gmail.com> Co-authored-by: Devansh <mdevansh28@gmail.com> Co-authored-by: Fangbai Chai <139947087+hksfang@users.noreply.github.com> Co-authored-by: Edwin Wong <73209427+HongSiu@users.noreply.github.com> Co-authored-by: Yi Guo <guoyi0328@gmail.com>

* Add local Llama2 support from llama2-wrapper backend (#400) * add llama2-wrapper as local backend * update local llm backend * update local llm backend * update * Update __init__.py * Update __init__.py --------- Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> * Fix style issues and refine code (#425) * Minor fix * fix build * fix style issues and refine code * Use consistent name style * Add API_HOST to react-web (#426) * Update style to fit tablet screens (#427) * Update README.md (minor typo) 😅 (#429) * Add a Render deployment guide (#431) * Add a Render deployment guide * Update render_deploy.md * Lint * Format * Lei/use zustand (#428) * update page logic * Apply zustand, fix minor bugs * Solve the scroll issue * Upload zustand files * minor fix * update cli to support next-web (#432) * Update .gitignore (#433) * Update .gitignore (#436) * Reduce VAD latency. (#430) * Lei/mobile next web (#437) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Lei/mobile next web (#439) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Finish hamburger menu and update page layout * Fix minor layout issues * Add ion (#442) * Avatar embedding (#441) * fix: update audio * feat: avatar generation embedding * chore: move embedding to top * Lei/mobile next web (#439) * Fix the avatar size in home page * Update home page style to support mobile device * Add mobile support for most of the page * Remove 'add character' when small screen * Finish hamburger menu and update page layout * Fix minor layout issues * fix: no audio in other character --------- Co-authored-by: Lei Qiu <amethystlei@gmail.com> * Add info loggers showing latencies of STT, LLM, TTS processes (#445) * deployment working except for voice cloning * update README: new issue about tts doesn't speak due to bad llm response * deployment successful; essential features all function * update README * prepare to merge with main * Add info loggers showing latencies of STT, LLM, TTS processes * update .gitignore * untrack reset_databash.sh * update README * Add info loggers showing latencies of STT, LLM, TTS processes * Add more latency monitors specific for the APIs * Refactor the timers into decorators; Report latencies together * Add terms of service page (#453) * Implement next-web functionalities. * fix small issue recorderSlice.js (#455) --------- Co-authored-by: Tom <plain1994@gmail.com> Co-authored-by: Piaoyang Cui <bcstyle@gmail.com> Co-authored-by: Lei Qiu <amethystlei@gmail.com> Co-authored-by: Devansh <mdevansh28@gmail.com> Co-authored-by: Fangbai Chai <139947087+hksfang@users.noreply.github.com> Co-authored-by: Edwin Wong <73209427+HongSiu@users.noreply.github.com> Co-authored-by: Yi Guo <guoyi0328@gmail.com> Co-authored-by: Fangbai Chai <fangbaichai@gmail.com>

liltom-eth added 2 commits August 24, 2023 21:42

add llama2-wrapper as local backend

5ade905

Merge branch 'main' of github.com:Shaunwei/RealChar into llama2-wrapp…

012d46f

…er-backend

Shaunwei self-requested a review August 26, 2023 00:00

pycui reviewed Aug 26, 2023

View reviewed changes

liltom-eth added 2 commits August 29, 2023 13:17

update local llm backend

c070bfa

update local llm backend

9c50467

liltom-eth commented Aug 29, 2023

View reviewed changes

pycui reviewed Aug 29, 2023

View reviewed changes

update

215d416

pycui added 2 commits August 31, 2023 11:40

Update __init__.py

09f0270

Update __init__.py

978e6a7

pycui merged commit 0e0bd26 into Shaunwei:main Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add local Llama2 support from llama2-wrapper backend #400

Add local Llama2 support from llama2-wrapper backend #400

liltom-eth commented Aug 25, 2023 •

edited

Loading

pycui left a comment

pycui Aug 26, 2023

pycui Aug 26, 2023

pycui Aug 26, 2023

liltom-eth Aug 26, 2023

liltom-eth Aug 26, 2023 •

edited

Loading

johndpope Aug 26, 2023

pycui Aug 26, 2023

liltom-eth Aug 28, 2023 •

edited

Loading

liltom-eth Aug 28, 2023

liltom-eth Aug 29, 2023

pycui Aug 29, 2023

johndpope commented Aug 29, 2023

liltom-eth Aug 29, 2023

pycui Aug 31, 2023

pycui Aug 29, 2023

liltom-eth Aug 30, 2023

pycui Aug 29, 2023

liltom-eth Aug 30, 2023

liltom-eth commented Aug 30, 2023

	# LOCAL_LLM_URL="http://localhost:8001/v1"
	# Example value: "http://localhost:8001/v1"

Add local Llama2 support from llama2-wrapper backend #400

Add local Llama2 support from llama2-wrapper backend #400

Conversation

liltom-eth commented Aug 25, 2023 • edited Loading

How to run on Mac:

Run OpenAI Compatible API on Llama2 models

Start Realchar

Implementation

Issues

Showcase

pycui left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liltom-eth Aug 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liltom-eth Aug 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johndpope commented Aug 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liltom-eth commented Aug 30, 2023

liltom-eth commented Aug 25, 2023 •

edited

Loading

liltom-eth Aug 26, 2023 •

edited

Loading

liltom-eth Aug 28, 2023 •

edited

Loading