Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: InternLM 2.5 Chat Tool Calls: Incorrect and Inconsistent Formatting #8405

Open
apresence opened this issue Jul 10, 2024 · 13 comments
Open
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@apresence
Copy link

apresence commented Jul 10, 2024

What happened?

After having spent the better part of two days chasing my tail with this, I figured I'd try to save someone else the trouble.

I have tested using the latest llama.cpp with the various GGUFs of internlm2_5-7b-chat, including the one provided by internlm and several converted by the community, as well as HF transformers AND in both chat and completion modes. I cannot get tool calls to work as described in the paper and summarized here and outlined further here.

I'm creating this issue for two reasons:

  1. When run through llama.cpp, the tool call output formatting varies when it should adhere to the strict format outlined in the paper.
  2. The checkpoint on HF appears to be broken too, but to a lesser degree. This isn't a llama.cpp issue. More on that below.

I posted a message about this issue on HF as well.

Name and Version

$ ./main --version
version: 3093 (7672ade)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Tokenization?

For the record, this doesn't appear to be a (de)tokenization issue, as those are (de/en)coding fine:

$ curl http://127.0.0.1:8080/tokenize -H "Content-Type: application/json" -d '{"content": "<|plugin|><|interpreter|><|action_end|><|action_start|><|im_end|><|im_start|>"}'; echo
{"tokens":[92538,92539,92540,92541,92542,92543]}
$ curl http://127.0.0.1:8080/detokenize -H "Content-Type: application/json" -d '{"tokens":[92538,92539,92540,92541,92542,92543]}'; echo
{"content":"<|plugin|><|interpreter|><|action_end|><|action_start|><|im_end|><|im_start|>"}

The HF model is (de/en)coding fine too.

Issue 1: Inconsistent Formatting

Command:

./main --predict 512 --gpu-layers 32 --temp 0.8 --top-p 0.8 --top-k 50 -r '<|im_end|>\n' -if --multiline-input --model models/internlm.internlm2_5-7b-chat-q4_k_m.gguf --special

NOTE: top-k and other parameters are taken from the model card on HF.

Input:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>system name=<|plugin|>
[{"name": "generate_image", "description": "Generates an image based on the given text prompt", "parameters": {"type": "object", "properties": {"prompt": {"type": "string", "description": "The text prompt used to guide image generation"}}, "required": ["prompt"]}}]<|im_end|>
<|im_start|>user
Draw a picture of a kitten.<|im_end|>
<|im_start|>assistant
\

Example Output 1:

I will call an image generation api to generate image<|action_start|><|plugin|>
{"name": "generate_image", "parameters": {"prompt": "A cute, fluffy kitten with big, expressive eyes, sitting on a soft, plush cushion, natural lighting, pastel colors, impressionism, high resolution, captured on a DSLR camera, detailed textures on the fur and cushion."}}<|action_end|>
<|im_end|><|im_end|>

There should be only one <|im_end|> token there, not two. Let's continue on and simulate the return for that tool call.

Input:

<|im_start|>environment name=<|plugin|>
{"image_url": "http://127.0.0.1/sd/out/123253252345.png"}<|im_end|>
<|im_start|>assistant
\

Output:

![alt text](http://127.0.0.1/sd/out/123253252345.png)

Here is the picture of a kitten:

[![alt text](http://127.0.0.1/sd/out/123253252345.png)](http://127.0.0.1/sd/out/123253252345.png)<|im_end|>

So it sorta works, but the tokens seem to be jumbled up, and it's linkin' the heck out of that image!

The other problem here is that the special tokens aren't being passed on in the server program when the /completion endpoint is used. I tried the --special option to see if that would work, and it didn't.

Example Output 2: <|api_name=generate_image|>...<|api_name_end|>

I didn't have --special turned on for this so the special tokens weren't displayed, but the model made up new ones!

I will call an image generation api to generate image<|im_end|>
<|api_name=generate_image|>{"parameters": {"prompt": "A playful kitten sitting on a windowsill, looking out at the world, bright eyes, fluffy fur, natural lighting, impressionism, high resolution, captured on a DSLR camera, Monet-style painting, with a touch of modern art."}}<|api_name_end|>

Issue 2: Incorrect Formatting

FWIW, the models/internlm.internlm2_5-7b-chat model does not adhere to the spec even when run on transfomers, but at least the output is consistent. That is, it always returns only the JSON part of the tool call without the surrounding <|action_start|><|plugin|>...<|action_end|> tags.

It might just not be returning the special tokens. I'm looking into that.

The following is a console snippet from a docker container created from the image huggingface/transformers-pytorch-gpu and run on a GeForce RTX 4090.

$ cat > kitten-example << EOF
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "internlm/internlm2_5-7b-chat"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# This is a hack to stuff the plugin prompt in there without having to deal with all the (de)tokenization
meta_inst = """You are a helpful assistant.<|im_end|>
<|im_start|>system name=<|plugin|>
[{"name": "generate_image", "description": "Generates an image based on the given text prompt", "parameters": {"type": "object", "properties": {"prompt": {"type": "string", "description": "The text prompt used to guide image generation"}}, "required": ["prompt"]}}]
""".strip()
model = model.eval()
response, history = model.chat(
    tokenizer=tokenizer,
    query="Draw a picture of a kitten.",
    meta_instruction=meta_inst
)
print(response)
EOF
$ /usr/bin/python3 kitten-example
Loading checkpoint shards: 100%|████████| 8/8 [00:02<00:00,  3.69it/s]
I'm calling the API function 'generate_image' with the argument 'prompt' set to 'A kitten'. This API call will generate an image of a kitten based on the provided text prompt. I believe this API call is made because the user expressed interest in seeing a picture of a kitten. By using this function, I can fulfill the user's request and provide them with a visual representation of a kitten.
{"name": "generate_image", "parameters": {"prompt": "A kitten"}}
@apresence apresence added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Jul 10, 2024
@dspasyuk
Copy link
Contributor

@apresence it looks like you are using the old release. Guys have done a lot of work addressing issues with prompt templates. Everything should be working close to perfect right now. Also main does not exist anymore use llama-cli refer to the manual at the bottom of the README for more details.

@apresence
Copy link
Author

apresence commented Jul 11, 2024

Well, that was a n00b mistake. I had pulled the latest from git in-place in the dir where I had cloned it previously, then recompiled. But I didn't realize the binary names changed AND I didn't do a make clean first, so I kept using the old server/main instead of the new llama-* variants. In any case, I am now running the latest version, and the results are the same, at least with the /completion endpoint.

$ ./llama-cli --version
version: 3368 (dd07a123)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

@dspasyuk
Copy link
Contributor

@apresence no worries, I am using qwen for this test:
./llama.cpp/llama-cli -ngl 15 -m ../../models/qwen2-7b-instruct-q2_k.gguf --predict 512 -p "<|im_start|>system\nAnswer as a pirate<|im_end|>\n<|im_start|>user\nWhat is the capital of Canada<|im_end|>\n<|im_start|>assistant" --special and this is what I get:

<|im_start|>system
Answer as a pirate<|im_end|>
<|im_start|>user
What is the capital of Canada<|im_end|>
<|im_start|>assistant
Arrrr matey! The capital of Canada be Ottawa. It's located in the heart of the country, near the St. Lawrence River. Now, let's set sail to explore some treasure maps or perhaps raid some Spanish galleons! What be your next quest?<|im_end|> [end of text]

@apresence
Copy link
Author

apresence commented Jul 11, 2024

I'm focusing specifically on the tool calling feature of InternLM 2.5.

There are a few main issues:

  1. Hidden tokens: I think because the tokens associated with tool calling (i.e. <|action_start|>, <|action_end|> and friends) are marked as special tokens, thus llama.cpp hides them. Interestingly, this same issue exists within the transformers library, I believe for the same reason. This makes it difficult to reliably parse the tool calls on the client side because the tokens are missing.
  2. Unreliable tool calling: Often the model will claim it does not have tools, or even when it admits that it does, it says it doesn't have the capabilities to do so, or it says that the user has to call them. This is not an issue with the transformers implementation; it calls the tools every time. I've started digging into the custom InternLM code there to see if they're using some special tricks.
  3. Increasing perplexity as prompt gets longer: As the prompt gets longer, the model starts getting confused, repeating itself, etc., even when it's length is well below the configured context size. This might be due to some nuance within the RoPE scaling that isn't taken into consideration. I've started looking into the InternLM-specific code in transformers for this, but I'll admit that I don't know much about how the scaling works. This might be one of the reasons why Issue 2 is occurring.

Regarding Issue 1, there are two ways I see to fix this:

  1. Completion endpoint: I believe if we remove the tool tokens from the special tokens list, then llama.cpp will pass them on. This could be done within the GGUF metadata (would require re-converting the copies already out there), or by overriding them within llama.cpp. I don't think the latter is standard practice, so I hesitate to take this approach.
  2. Chat endpoint: Mostly a matter of implementing the chat template within llama.cpp. InternLM 2.5 uses a modified chatml template, so you can mostly get away with using it without any changes, however there might be something particular about the tool calls that requires custom handling. I didn't realize until recently when I reviewed llama.cpp's code that it doesn't directly handle the chat templates, but rather it uses some sort of fingerprinting to identify which template is required, then maps that to custom code within llama.cpp. I think I have what I need to implement this. The only question here is if I can get access to those hidden tokens within llama.cpp. I haven't had a chance to dig into that yet.

@apresence
Copy link
Author

apresence commented Jul 11, 2024

OK, I modified tokenizer_config.json and cleared the "special" flag for the tools tokens ('<|plugin|>', '<|interpreter|>', '<|action_end|>', '<|action_start|>'), and now they are being passed properly for the /completion endpoint.

I'll have fixed GGUFs up on HF shortly.

@foldl
Copy link
Contributor

foldl commented Jul 11, 2024

Here's my experiment. It looks ok.

https://github.com/foldl/chatllm.cpp/blob/master/scripts/tool_internlm.py

@apresence
Copy link
Author

apresence commented Jul 12, 2024

Here's my experiment. It looks ok.

https://github.com/foldl/chatllm.cpp/blob/master/scripts/tool_internlm.py

It looks like you're using a modified version of llama.cpp there. It's possible your version handles special tokens differently, or you're using a different route to the backend ggml libs than I am.

Also, your code is using Chinese for the prompts. I doubt that would make a difference, but it's possible.

@foldl
Copy link
Contributor

foldl commented Jul 12, 2024

It looks like you're using a modified version of llama.cpp there. It's possible your version handles special tokens differently, or you're using a different route to the backend ggml libs than I am.

Also, your code is using Chinese for the prompts. I doubt that would make a difference, but it's possible.

The Chinese prompt is from their official example:

https://github.com/InternLM/lagent/blob/main/lagent/agents/internlm2_agent.py

@apresence
Copy link
Author

apresence commented Jul 12, 2024

After some more testing, I've found that the tool call works 100% of the time if --rope-scaling none is passed to llama-cli.

The following config output is from running the sample code in transformers with debugging turned on:

Model config InternLM2Config {
  "_name_or_path": "internlm/internlm2_5-7b-chat",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "internlm/internlm2_5-7b-chat--configuration_internlm2.InternLM2Config",
    "AutoModel": "internlm/internlm2_5-7b-chat--modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat--modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.43.0.dev0",
  "use_cache": true,
  "vocab_size": 92544
}

As you can see, rope scaling is set to dynamic. I'm guessing that neither of llama.cpp's options -- linear or yarn -- match:

         --rope-scaling {none,linear,yarn}
                                  RoPE frequency scaling method, defaults to linear unless specified by the model

I think this might be why tool calls are unreliable with rope scaling turned on.

Looking at the configuration_internlm2.py code from transformers, we can see this comment about the dynamic rope:

rope_scaling (`Dict`, *optional*):
    Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
    strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
    `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
    `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
    these scaling strategies behave:
    https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
    experimental feature, subject to breaking API changes in future versions.

Since it says that 'linear' is an option, I tried it with rope_scaling={"type": "linear", "factor": 2.0} and... guess what... tool calls are unreliable, and perplexity went up. So there you have it.

Implementing dynamic RoPE scaling in llama.cpp is way over my head... any ideas out there?

Just to button it all up ...

This works 100% of the time:

./llama-cli --gpu-layers 32 --temp 0.8 --top-p 0.8 --top-k 50 -r '<|im_end|>\n' -if --multiline-input --model internlm2_5-7b-chat-Q4_K_M.gguf --rope-scaling none

This works only ~ 50% of the time:

./llama-cli --gpu-layers 32 --temp 0.8 --top-p 0.8 --top-k 50 -r '<|im_end|>\n' -if --multiline-input --model internlm2_5-7b-chat-Q4_K_M.gguf

And it's probably because:

llm_load_print_meta: rope scaling     = linear

@RunningLeon
Copy link
Contributor

RunningLeon commented Jul 16, 2024

OK, I modified tokenizer_config.json and cleared the "special" flag for the tools tokens ('<|plugin|>', '<|interpreter|>', '<|action_end|>', '<|action_start|>'), and now they are being passed properly for the /completion endpoint.

I'll have fixed GGUFs up on HF shortly.

@apresence @dspasyuk hi, guys. llama-server also has --special argument, does it work for /completion endpoint when using the function call feature?

BTW, we can pull a request and set the tools special tokens to type of SentencePieceTokenTypes.USER_DEFINED by changing convert_hf_to_gguf.py if we want to output them. But is it a good way to solve it?

@RunningLeon
Copy link
Contributor

After some more testing, I've found that the tool call works 100% of the time if --rope-scaling none is passed to llama-cli.

The following config output is from running the sample code in transformers with debugging turned on:

Model config InternLM2Config {
  "_name_or_path": "internlm/internlm2_5-7b-chat",
  "architectures": [
    "InternLM2ForCausalLM"
  ],
  "attn_implementation": "eager",
  "auto_map": {
    "AutoConfig": "internlm/internlm2_5-7b-chat--configuration_internlm2.InternLM2Config",
    "AutoModel": "internlm/internlm2_5-7b-chat--modeling_internlm2.InternLM2ForCausalLM",
    "AutoModelForCausalLM": "internlm/internlm2_5-7b-chat--modeling_internlm2.InternLM2ForCausalLM"
  },
  "bias": false,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "internlm2",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 2,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 2.0,
    "type": "dynamic"
  },
  "rope_theta": 1000000,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.43.0.dev0",
  "use_cache": true,
  "vocab_size": 92544
}

As you can see, rope scaling is set to dynamic. I'm guessing that neither of llama.cpp's options -- linear or yarn -- match:

         --rope-scaling {none,linear,yarn}
                                  RoPE frequency scaling method, defaults to linear unless specified by the model

I think this might be why tool calls are unreliable with rope scaling turned on.

Looking at the configuration_internlm2.py code from transformers, we can see this comment about the dynamic rope:

rope_scaling (`Dict`, *optional*):
    Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
    strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
    `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
    `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
    these scaling strategies behave:
    https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
    experimental feature, subject to breaking API changes in future versions.

Since it says that 'linear' is an option, I tried it with rope_scaling={"type": "linear", "factor": 2.0} and... guess what... tool calls are unreliable, and perplexity went up. So there you have it.

Implementing dynamic RoPE scaling in llama.cpp is way over my head... any ideas out there?

Just to button it all up ...

This works 100% of the time:

./llama-cli --gpu-layers 32 --temp 0.8 --top-p 0.8 --top-k 50 -r '<|im_end|>\n' -if --multiline-input --model internlm2_5-7b-chat-Q4_K_M.gguf --rope-scaling none

This works only ~ 50% of the time:

./llama-cli --gpu-layers 32 --temp 0.8 --top-p 0.8 --top-k 50 -r '<|im_end|>\n' -if --multiline-input --model internlm2_5-7b-chat-Q4_K_M.gguf

And it's probably because:

llm_load_print_meta: rope scaling     = linear

@apresence hi, this is interesting. How do you test? on a dataset?

@apresence
Copy link
Author

apresence commented Jul 18, 2024

Thank you for taking the time to address this topic.

You are right, llama-cli does show the tokens when the --special flag is used. However, I discovered the issue originally with the /completion endpoint of llama-server. I just happened to use llama-cli to demonstrate the issue because it was easy to provide output that others could follow and verify on their own. As an interesting note, unlike the HF generate() function, I don't see an option for llama-cli to hide/unhide special tokens, either as a command line option (since I prove below that --special is ignored) or json arguments in the API call itself. The only way I am aware to change the behavior is to modify GGUF metadata. That is exactly what I did, and the reason I posted models with those changes applied.

Let's remove llama-cli from the equation. To that end, I've written and used a little test program to call the /completion endpoint and demonstrate the issue. Below are clips of the output for different scenarios. I can provide the script and command line parameters upon request.

For the record, this is the version of llama-server I used for these tests:

$ ./llama-server --version
version: 3368 (dd07a123)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Without tool call fix

The tool call tokens are never included regardless of the --special option.

With --special

[ SYS ] === TEST MODEL: internlm.internlm2_5-7b-chat-q4_k_m.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'internlm.internlm2_5-7b-chat-q4_k_m.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none', '--special']
[ <<< ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ >>> ] 'I need to use the get_current_weather function to get the current weather in Shanghai.\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}\n'
[ SYS ] Overall result for 'internlm.internlm2_5-7b-chat-q4_k_m.gguf': FAIL
[ SYS ] Reason for result: Response does not match expected pattern

Without --special

[ SYS ] === TEST MODEL: internlm.internlm2_5-7b-chat-q4_k_m.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'internlm.internlm2_5-7b-chat-q4_k_m.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none']
[ >>> ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ <<< ] 'To fulfill your request, I need to use the \\"get_current_weather\\" function and provide the location parameter as \\"Shanghai\\". I will also specify the unit of measurement as \\"metric\\" to ensure accuracy.\n{"name": "get_current_weather", "parameters": {"location": "Shanghai", "unit": "metric"}}\n'
[ SYS ] Overall result for 'internlm.internlm2_5-7b-chat-q4_k_m.gguf': FAIL
[ SYS ] Reason for result: Response does not match expected pattern

With tool call fix

The tool call tokens are always included regardless of the --special option.

With --special

[ SYS ] === TEST MODEL: apresence.internlm2_5-7b-chat-Q4_K_M.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none', '--special']
[ <<< ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ >>> ] 'I need to use the get_current_weather function to get the current weather in Shanghai.<|action_start|><|plugin|>\n{"name": "get_current_weather", "parameters": {"location": "Shanghai", "unit": "metric"}}<|action_end|>\n'
[ SYS ] Test Result: PASS
[ <<< ] '<|im_start|>environment name=<|plugin|>\n{"temperature": 22}<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*\\b22\\b.*$'
[ >>> ] 'The temperature is currently at 22 degrees Celsius.'
[ SYS ] Test Result: PASS
[ SYS ] Overall result for 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf': PASS

Without --special

[ SYS ] === TEST MODEL: apresence.internlm2_5-7b-chat-Q4_K_M.gguf ===
[ SYS ] Args: ['./llama-server', '--model', 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf', '--host', '127.0.0.1', '--port', '52756', '--gpu_layers', '32', '--split_mode', 'none']
[ >>> ] '<|im_start|>system\nYou are InternLM2-Chat, a harmless AI assistant<|im_end|>\n<|im_start|>system name=<|plugin|>\n[\n{\n"name": "get_current_weather",\n"description": "Get the current weather in a given location",\n"parameters": {\n"type": "object",\n"properties": {\n"location": {\n"type": "string",\n"description": "The city and state, e.g. San Francisco, CA",\n},\n"unit": {"type": "string"},\n},\n"required": ["location"],\n},\n}\n]\n<|im_end|>\n<|im_start|>user\nI want to know today\'s weather in Shanghai<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*<\\|action_start\\|><|plugin|>\\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<\\|action_end\\|><\\|im_end\\|>$'
[ <<< ] 'I need to use the get_current_weather function to get the current weather in Shanghai.<|action_start|><|plugin|>\n{"name": "get_current_weather", "parameters": {"location": "Shanghai"}}<|action_end|>\n'
[ SYS ] Test Result: PASS
[ >>> ] '<|im_start|>environment name=<|plugin|>\n{"temperature": 22}<|im_end|>\n<|im_start|>assistant\n'
[ SYS ] Expected response pattern: '^.*\\b22\\b.*$'
[ >>> ] "It seems you're interested in the temperature, which is currently at 22 degrees Celsius. How can I assist you further today? Is there a specific task or information you need?"
[ SYS ] Test Result: PASS
[ SYS ] Overall result for 'apresence.internlm2_5-7b-chat-Q4_K_M.gguf': PASS

@RunningLeon
Copy link
Contributor

@apresence hi, llama-server with --special can be fixed as this comment: #8506 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
Development

No branches or pull requests

4 participants