Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make some tests and choose an openai api compatible local llm server #7

Open
furlat opened this issue Feb 12, 2024 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@furlat
Copy link
Contributor

furlat commented Feb 12, 2024

https://github.com/ollama/ollama
https://github.com/abetlen/llama-cpp-python
https://github.com/vllm-project/vllm

@furlat furlat added the enhancement New feature or request label Feb 12, 2024
@furlat furlat changed the title make some tests an choose an openai api compatible local llm server make some tests and choose an openai api compatible local llm server Feb 12, 2024
@furlat
Copy link
Contributor Author

furlat commented Feb 20, 2024

Could be worth to try a modal-deployment of the lmm server with modal as well
https://modal.com/docs/examples/vllm_inference

@furlat
Copy link
Contributor Author

furlat commented Feb 23, 2024

@furlat
Copy link
Contributor Author

furlat commented Mar 1, 2024

interesting pr for vlmm with respect to speculative decoding vllm-project/vllm#2188 and fused moe kernels vllm-project/vllm#2913 vllm-project/vllm#2979

@furlat
Copy link
Contributor Author

furlat commented Mar 21, 2024

The neural network architecture used as the language model will be Mixtral. The server must meet the following requirements:

Structured extraction using Pydantic.
Efficient batching in order to repeat the same task on thousands of different documents.
Use of a common prefix system to reuse the KV cache regarding the system prompt and other shared prefixes.
Speculative decoding, such as prompt n-gram caching, which allows the model to make suggestions with the text already present within the input.
The ability to use quantized models, as we should be able to use Mixtral with a budget of 2 GPU 4090s per server, for a total of 48 GB of VRAM per server.
Use of FastAPI to serve the inference server to other machines within the VPN/LAN of the servers.

The goal and evaluation will be the speed in terms of reading and writing of the inference server. In particular, we are interested in knowing how much time it takes to read and write one million tokens with the same structured extraction task on about a thousand documents in parallel.

---> also on Modal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant