-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
It took annoyingly a lot of effort just to make this simple server. I tried rouille web framework first, but it didn't support getting chunked output to the client line-by-line. (seems that if it exposed more details about the underlying tiny-http package I could have hacked it to work). I went with Rocket because it had less async stuff and seemed decent. I got weird issues where it seemed as if memory use kept increasing and increasing. I may have got that fixed but I couldn't figure out what made it use so much memory, even tools like valgrind and heaptrack told me there isn't that much memory allocated but I can see RES increasing in `htop`. Switched to MiMalloc as it seems to slightly decrease memory use. Added details about the inference server to README.md. And also added an example Python script of it. I want to use this feature to later investigate how much do quantizations or f16/f32 affect output. Easier to do such things on Python.
- Loading branch information
Showing
10 changed files
with
1,335 additions
and
54 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
#!/usr/bin/env python3 | ||
|
||
""" | ||
This script uses the rllama API to generate tokens. | ||
It does not print the tokens nicely. | ||
""" | ||
|
||
import requests | ||
|
||
def main(): | ||
url = 'http://127.0.0.1:8080/rllama/v1/inference' | ||
req = { | ||
'prompt': 'Hello world!', | ||
'max_seq_len': 1024, | ||
'max_new_tokens': 200, | ||
'no_token_sampling': False | ||
} | ||
res = requests.post(url, json=req, stream=True) | ||
for line in res.iter_lines(): | ||
print(line.decode('utf-8')) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.