Pulse · InternLM/lmdeploy · GitHub

June 15, 2024 – July 15, 2024

Overview

103 Active pull requests

173 Active issues

1 Release published by 1 person

v0.5.0 LMDeploy Release V0.5.0
published Jul 1, 2024

89 Pull requests merged by 12 people

Fix internvl2-40b awq inference
#2023 merged Jul 15, 2024
Avoid the same session id for openai endpoint
#1995 merged Jul 15, 2024
add chat template for codegeex4
#2013 merged Jul 15, 2024
support internlm-xcomposer2d5-7b
#1932 merged Jul 15, 2024
Add exception handler to imge encoder
#2010 merged Jul 13, 2024
docs: fix Ada compatibility
#2016 merged Jul 13, 2024
Support glm 4v
#1947 merged Jul 12, 2024
feat: support llama2 and internlm2 on 910B
#2011 merged Jul 12, 2024
Fix logprobs openai api
#1985 merged Jul 12, 2024
Fix the session_len assignment logic
#2007 merged Jul 12, 2024
Fix table rendering for readthedocs
#1998 merged Jul 12, 2024
docs: sync the core features in README to index.rst
#1988 merged Jul 11, 2024
fix mixtral and mistral cache_position
#1941 merged Jul 11, 2024
support internvl2-1b
#1983 merged Jul 11, 2024
fix unexpected argument error when deploying "cogvlm-chat-hf"
#1982 merged Jul 10, 2024
fix logprobs
#1968 merged Jul 10, 2024
refactor sampling layer setup
#1912 merged Jul 10, 2024
Fix internvl2-40b model export
#1979 merged Jul 10, 2024
docs: update kv quant doc
#1977 merged Jul 10, 2024
feat: support llama2 and internlm2 on 910B
#1889 merged Jul 9, 2024
fix: set PYTHONIOENCODING to UTF-8 before start tritonserver
#1971 merged Jul 9, 2024
[ci] add internlm2.5 models into testcase
#1928 merged Jul 9, 2024
Upgrade gradio
#1930 merged Jul 9, 2024
Add tools to api_server for InternLM2 model
#1763 merged Jul 9, 2024
fix transformers version check for InternVL2
#1952 merged Jul 9, 2024
fix llama3 chat template
#1956 merged Jul 9, 2024
feat: add gpu topo for check_env
#1944 merged Jul 8, 2024
refactor: update awq linear and rm legacy
#1940 merged Jul 8, 2024
docs: update compatibility section in README
#1946 merged Jul 8, 2024
support gemma2 in pytorch engine
#1924 merged Jul 5, 2024
fix: append _stats when size > 0
#1809 merged Jul 5, 2024
misc: add transformers version check for TurboMind Tokenizer
#1917 merged Jul 5, 2024
Support internvl2 chat template
#1911 merged Jul 5, 2024
misc: add default api_server_url for api_client
#1922 merged Jul 5, 2024
vision model use tp number of gpu
#1854 merged Jul 5, 2024
Fix smem size for fused split-kv reduction
#1909 merged Jul 4, 2024
Remove deprecated chat cli and vl examples
#1899 merged Jul 4, 2024
[Doc]: Change to sphinx-book-theme in readthedocs
#1880 merged Jul 4, 2024
Optimize sampling on pytorch engine.
#1853 merged Jul 3, 2024
Support phi3-vision
#1845 merged Jul 2, 2024
Add usage in stream response
#1876 merged Jul 2, 2024
docs: update faq for turbomind so not found
#1877 merged Jul 2, 2024
fix SamplingDecodeTest and SamplingDecodeTest2 unittest failure
#1874 merged Jul 1, 2024
drop stop words
#1823 merged Jul 1, 2024
Fix internlm-xcomposer2-vl awq search scale
#1890 merged Jul 1, 2024
Fix error link reference
#1881 merged Jul 1, 2024
misc: rm unnecessary files
#1875 merged Jul 1, 2024
bump version to v0.5.0
#1852 merged Jul 1, 2024
docs: update cache-max-entry-count help message
#1892 merged Jul 1, 2024
[Doc]: Update docs for internlm2.5
#1887 merged Jul 1, 2024
fix qwen2 cache_position for PyTorch Engine when transformers>4.41.2
#1886 merged Jul 1, 2024
fix gradio vl "stop_words"
#1873 merged Jun 27, 2024
fix model name matching for internvl
#1867 merged Jun 27, 2024
Fix vl session-len
#1860 merged Jun 26, 2024
[side-effect] bring back "--cap" argument in chat cli
#1859 merged Jun 26, 2024
react test evaluation config
#1861 merged Jun 26, 2024
misc: align PyTorch Engine temprature with TurboMind
#1850 merged Jun 26, 2024
remove chat template config in turbomind engine
#1161 merged Jun 25, 2024
Add interfaces to the pipeline to obtain logits and ppl
#1652 merged Jun 25, 2024
Support Qwen2-1.5b awq
#1793 merged Jun 24, 2024
Harden stream callback
#1838 merged Jun 24, 2024
fix image encoder request queue
#1837 merged Jun 24, 2024
Support internvl-chat for pytorch engine
#1797 merged Jun 24, 2024
Add model revision & download_dir to cli
#1814 merged Jun 24, 2024
compat internlm2 for pytorch engine
#1825 merged Jun 24, 2024
Torch deepseek v2
#1621 merged Jun 24, 2024
Update engine.py to fix small typos
#1829 merged Jun 24, 2024
Detokenize with prompt token ids
#1753 merged Jun 22, 2024
fix qwen-vl-chat hung
#1824 merged Jun 21, 2024
AsyncEngine create cancel task in exception.
#1807 merged Jun 21, 2024
Fix Request completed log
#1821 merged Jun 21, 2024
Add GLM-4-9B-Chat
#1724 merged Jun 21, 2024
PyTorchEngine adapts to the latest internlm2 modeling.
#1798 merged Jun 21, 2024
Device dispatcher
#1775 merged Jun 21, 2024
fix best_match_model
#1812 merged Jun 20, 2024
check driver mismatch
#1811 merged Jun 20, 2024
fix pr test for newest internlm2 model
#1806 merged Jun 20, 2024
feat: auto set awq model_format from hf
#1799 merged Jun 19, 2024
Optimize kernel launch for triton2.2.0 and triton2.3.0
#1499 merged Jun 19, 2024
[Feature]: Support llava for pytorch engine
#1641 merged Jun 19, 2024
More accurate time logging for ImageEncoder and fix concurrent image processing corruption
#1765 merged Jun 18, 2024
[side-effect] fix weight_type caused by PR #1702
#1795 merged Jun 18, 2024
fix: prevent numpy breakage
#1791 merged Jun 18, 2024
Refine AsyncEngine exception handler
#1789 merged Jun 18, 2024
skip inference for oversized inputs
#1769 merged Jun 18, 2024
Encode raw image file to base64
#1773 merged Jun 17, 2024
fix falcon attention
#1761 merged Jun 17, 2024
support qwen2 1.5b
#1782 merged Jun 17, 2024
Add anomaly handler
#1780 merged Jun 17, 2024

14 Pull requests opened by 7 people

Add Jetson platform support (by docker)
#1820 opened Jun 21, 2024
Maybe a workaround for qwen2 quantization Nan error
#1844 opened Jun 25, 2024
feat: decouple input_ids and output_ids
#1855 opened Jun 25, 2024
Support guided decoding for pytorch backend
#1856 opened Jun 26, 2024
Fix index error when profiling token generation with `-ct 1`
#1898 opened Jul 2, 2024
PyTorch Engine AWQ support
#1913 opened Jul 3, 2024
Remove deprecated arguments from API and clarify model_name and chat_template_name
#1931 opened Jul 5, 2024
torch engine optimize prefill for long context
#1962 opened Jul 9, 2024
support min_p sampling & do_sample setting
#1966 opened Jul 9, 2024
Phi3 awq
#1984 opened Jul 10, 2024
Remove the triton inference server backend "turbomind_backend"
#1986 opened Jul 10, 2024
Supoort glm4 awq
#1993 opened Jul 11, 2024
Add log info for prefix cache
#2018 opened Jul 13, 2024
bump version to v0.5.1
#2022 opened Jul 15, 2024

91 Issues closed by 39 people

[Bug] 请问下lmdeploy具体支持哪些（类型）的显卡，哪些是明确不支持的呢
#2015 closed Jul 15, 2024
Q: Continuous Batching without Turbomind?
#2025 closed Jul 15, 2024
[Feature] Can you please do INT4 Quantization for InternVL2-26B and InternVL2-40B
#1955 closed Jul 15, 2024
[Bug] InternVL2-40B generates nonsense outputs
#1965 closed Jul 15, 2024
[Bug] AWQ量化InternVL2 40B输出无意义的结果
#2017 closed Jul 15, 2024
关于internv2的支持
#1919 closed Jul 15, 2024
[Bug] KeyError: 'plora_glb_GN' after quantization of internlm/internlm-xcomposer2-4khd-7b to 4-bit
#2014 closed Jul 15, 2024
[Bug] InternVL2-40B量化后部署，无法访问
#2009 closed Jul 15, 2024
Unable to infer on multiple CPUs
#2008 closed Jul 15, 2024
AWQ quantized model produces garbled output during multi-GPU inference
#1996 closed Jul 15, 2024
Quantization of internlm/internlm-xcomposer2-4khd-7b to 4-bit?
#2012 closed Jul 12, 2024
使用gemma2进行离线推理时报错lmdeploy - ERROR - Engine loop failed with error: 'list' object has no attribute 'get_max_length'
#1994 closed Jul 12, 2024
[Bug] 部署经过微调的internvl-chat-v1_5模型导致无法停止输出。
#2000 closed Jul 12, 2024
Qwen 2 72b Instruct tp 8 performance degradation
#1904 closed Jul 12, 2024
about LMDeploy delivers up to 1.8x higher request throughput than vLLM
#2005 closed Jul 12, 2024
[Bug] Turbomind Docker getting failed after High load
#1954 closed Jul 11, 2024
[Bug] Segmentation fault occurs and the machine with openEuler os was automatically reboots
#1905 closed Jul 11, 2024
Internvl2 api 使用没法正常返回结果，用transforms的推理方式可以
#1959 closed Jul 11, 2024
[Bug] 使用lmdeploy serve开启internvl-v1-5后一定输出到最长长度
#1958 closed Jul 11, 2024
[Bug] The official documentation does not automatically update
#1975 closed Jul 11, 2024
[Feature] 为internVL2添加function calling能力（Tools能力）
#1987 closed Jul 11, 2024
[Feature] support function calling
#1800 closed Jul 10, 2024
ScaleLLM inspiration
#1510 closed Jul 10, 2024
[Bug] KeyError: 'Qwen2ForCausalLM' for InternVL2 1B
#1963 closed Jul 10, 2024
[Bug] Encount TCP error (Port Aready used) when deploy with PytorchEngine
#1925 closed Jul 10, 2024
[Bug] Mini-InternVL-Chat-2B-V1-5 AWQ量化尝试：NameError: free variable 'state_dict' referenced before assignment in enclosing scope
#1978 closed Jul 10, 2024
[Bug] internlm2-chat-1_8b模型使用4bit KV量化的时候找不到key_stats.pth
#1720 closed Jul 10, 2024
[Feature] support InternVL-2.0
#1900 closed Jul 9, 2024
[Bug] Failed to load InternVL-Chat-V1-5-Int8 quantized model. RuntimeError: Only Tensors of floating point and complex dtype can require gradients
#1907 closed Jul 9, 2024
是否会支持torch 2.3.0 和 triton 2.3.0
#1914 closed Jul 9, 2024
[Bug] Llama3 Chat Template are not consistency with the Huggingface implementation.
#1945 closed Jul 9, 2024
AttributeError: 'AsyncEngine' object has no attribute 'get_ppl'
#1950 closed Jul 9, 2024
为什么离线转换lmdeploy convert不支持internlm2.5和Qwen2
#1960 closed Jul 9, 2024
[Bug] 最新更新的lmdeploy_0.5.0在批量推理时候没有输出logprob
#1973 closed Jul 9, 2024
[Feature] 是否能支持MiniCPMv2.5
#1969 closed Jul 9, 2024
[Bug] File "/opt/py38/lib/python3.8/site-packages/packaging/version.py", line 200, in __init__ match = self._regex.search(version) TypeError: expected string or bytes-like object
#1957 closed Jul 9, 2024
llmdeploy 使用openai形式提示词请求报错[Bug]
#1939 closed Jul 9, 2024
[Feature] Is there any plan to support internvl2 inference?
#1953 closed Jul 9, 2024
[Bug] assistant always replies ""
#1937 closed Jul 6, 2024
[Bug] assistant always replies ""
#1934 closed Jul 6, 2024
[Bug] assistant always replies ""
#1936 closed Jul 6, 2024
[Feature] support Gemma 2
#1878 closed Jul 5, 2024
[Bug] while load MGM 8B and 7B, I meet some bug, such as dimension size mismatch and AttributeError: 'NoneType' object has no attribute 'split'
#1929 closed Jul 5, 2024
[Bug] ValueError: Tokenizer class Qwen2Tokenizer does not exist or is not currently imported.
#1903 closed Jul 5, 2024
[Feature] diff tool for troubleshooting
#1908 closed Jul 5, 2024
[Bug] internvl-chat-v-1-5 predict
#1918 closed Jul 4, 2024
Nightly Build for LMDeploy
#1828 closed Jul 3, 2024
[Bug] lmdeploy - [31mERROR[0m - Truncate max_new_tokens to 221
#1841 closed Jul 2, 2024
[Bug] 量化时候采取默认参数能够正常推理量化，设置了--search-scale True --batch-size 8，量化后无法推理
#1883 closed Jul 1, 2024
[Bug] Mini-InternVL1.5-4B does not suceessfully initialized.
#1721 closed Jul 1, 2024
[Feature] update the range of torch versions
#1857 closed Jul 1, 2024
[Bug] qwen 2 issue when transformers>4.41.2 for PyTorch Engine
#1885 closed Jul 1, 2024
need gemma2 support
#1888 closed Jul 1, 2024
[Bug] xcomposer 4khd lora weight error in lmdeploy
#1747 closed Jun 30, 2024
[Feature] Function call
#1882 closed Jun 28, 2024
[Bug] InternVL 1.5性能瓶颈在ViT，有计划支持ViT TM backend+TP推理不？
#1869 closed Jun 28, 2024
[Bug] hang when many requests
#1619 closed Jun 27, 2024
How to quantify deepseek-ai/deepseek-vl-7b-chat
#1865 closed Jun 27, 2024
[Feature] 多模态的模型支持在线serving吗？
#1762 closed Jun 27, 2024
under stream mode, if break generator in advance, it may lead to server stuck [Bug]
#1848 closed Jun 26, 2024
[Bug] python -m lmdeploy.serve.proxy.proxy --server_name "xxx" --server_port xxx --strategy "min_expected_latency"
#1851 closed Jun 26, 2024
[Bug] pytorch方式多卡部署internlm-xcomposer2-vl-7b，报错KeyError: 'parameter name can\'t contain "."'
#1834 closed Jun 26, 2024
换用 LLM 基座的 LLaVA 模型适配
#1655 closed Jun 25, 2024
[Bug] 多图推理效果不佳
#1843 closed Jun 25, 2024
[Feature] Please add support for Qwen2
#1805 closed Jun 25, 2024
[Feature] 使用已经构建好的input使用lmdeploy来进行推理
#1760 closed Jun 25, 2024
about getting the deterministic answer from VLM model, such as InternVL-Chat-V1-5-AWQ
#1783 closed Jun 24, 2024
[Bug] smooth_quant量化后的模型重新运行，lmdeploy无法正常推理
#1822 closed Jun 24, 2024
[Feature] Support DeepSeek-V2 Model
#1556 closed Jun 24, 2024
[Bug] Space is incorrectly removed from start of generated text for `/v1/completion` endpoint
#1743 closed Jun 23, 2024
[Bug] Many concurrent requests with `--enable-prefix-caching` AND `--quant-policy 8` crashes with: `CUDA runtime error: an illegal memory access was encountered /opt/lmdeploy/src/turbomind/utils/allocator.h:231`
#1744 closed Jun 23, 2024
[Bug] Task was destroyed but it is pending! ImageEncoder._forward_loop()
#1818 closed Jun 22, 2024
int8 kv cache 和 Flash Attention 无法一起使用
#1816 closed Jun 20, 2024
[Feature] lmdeploy chat <model_path> --chat-template {json}
#1519 closed Jun 20, 2024
[Feature] Implement COG-VLM2
#1622 closed Jun 20, 2024
[Bug] 部署cogvlm2运行时，接受的多个并发之间存在干扰，后面的请求使用前面请求传的图像
#1730 closed Jun 20, 2024
[Bug] Key Error loading OpenGVLab/Mini-InternVL-Chat-4B-V1-5
#1756 closed Jun 20, 2024
"Aborted (core dumped)" when running Qwen2-7B-Instruct [Bug]
#1792 closed Jun 20, 2024
[Feature] qwen2系列模型
#1777 closed Jun 20, 2024
[Bug] 判断条件检查
#1757 closed Jun 20, 2024
[Feature] Layer Wise Calibration and Quantization of Models (To quantize model on Low VRAM GPU)
#1625 closed Jun 20, 2024
[Bug] 关于流式并发相关
#1557 closed Jun 20, 2024
logger in `lmdeploy/serve/async_engine.py` is hard coded
#1503 closed Jun 20, 2024
[Feature] 多模态模型量化示例
#1483 closed Jun 20, 2024
[Bug] Client-aborted streaming requests 'leak', which eventually stalls/crashes turbomind after 100 to 300 requests
#1788 closed Jun 20, 2024
batch inference
#1689 closed Jun 20, 2024
[Bug] ModuleNotFoundError: No module named '_turbomind' loading llava Mistral 7B
#1699 closed Jun 20, 2024
[Bug] lmdeploy got nccl error
#1803 closed Jun 19, 2024
[Bug] lmdeploy lite auto_awq: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
#1675 closed Jun 18, 2024
[Feature] lmdeploy通过命令行可以启动一个gradio应用，这个gradio的应用是不是可以给用户提供UI修改的方法？
#1710 closed Jun 17, 2024
能支持下mini_internvl_2b_1.5模型的部署么？
#1774 closed Jun 16, 2024

82 Issues opened by 64 people

[Bug] 微调后的 qwen-vl-chat的模型，使用提供的方法量化成 INT4，显存占用没有下降
#2028 opened Jul 15, 2024
How multimodal batch inference works?
#2027 opened Jul 15, 2024
微调后的glm-4v-9b通过lmdeploy部署会报错
#2026 opened Jul 15, 2024
What is the expected chat template for phi-3-vl?
#2024 opened Jul 15, 2024
[Docs] A100算力加持！书生大模型实战营第3期全面升级，趣味闯关模式等你开启
#2021 opened Jul 15, 2024
[Bug] lmdeploy has two questions about lora
#2020 opened Jul 14, 2024
[Benchmark] PyTorch Engine Mixtral 8x7B performance issue
#2019 opened Jul 13, 2024
使用lmdeploy部署后模型返回空
#2006 opened Jul 12, 2024
[Bug] 两张 V100 部署 InternVL2-26B，多模态对话时无应答
#2004 opened Jul 12, 2024
[Feature] Flash Attention 3
#2003 opened Jul 11, 2024
[Bug] 访问一段时间后服务卡死/无响应
#2001 opened Jul 11, 2024
[Feature] 请问是否可以支持多模态模型donut的推理
#1999 opened Jul 11, 2024
lmdeploy serve的并发机制是怎样的
#1997 opened Jul 11, 2024
[Bug] 使用lmdeploy方法部署InternVL-Chat-V1-5-AWQ，0.4.2版本openai客户端可以正常访问，0.5.0版本会卡死，没有任何响应，新增的cogvlm2也有类似现象。
#1992 opened Jul 11, 2024
用lmdeploy部署internlm2_5-7B-chat请求返回为空
#1991 opened Jul 11, 2024
Could not use my local internVL mini model for inference
#1990 opened Jul 10, 2024
[Feature] 我们支持gptq量化模型的推理么
#1989 opened Jul 10, 2024
[Bug] MiniCPMV的推理有问题
#1981 opened Jul 10, 2024
[Feature] Any plan for support Minfernece?
#1980 opened Jul 10, 2024
glm4-9b如何量化运行？
#1976 opened Jul 9, 2024
[Benchmark] TurboMind benchmark with GLM-4-9B-Chat and Qwen2-72B-Instruct vs vLLM
#1974 opened Jul 9, 2024
[Feature] tubromind有计划支持cogvlm2吗？
#1970 opened Jul 9, 2024
[Feature] Support for CogVLM2-Video-LLama3-Chat in TorchEngine
#1964 opened Jul 9, 2024
internvl 多图文使用openai 接口形式如何传数据[Bug]
#1961 opened Jul 9, 2024
[Bug] response里需要生成's 但是只显示' s都不输出。
#1951 opened Jul 8, 2024
多模态批推理如何实现？
#1949 opened Jul 8, 2024
[Bug] 为什么logprobs的内容是None？Why the value of logprobs is None?
#1948 opened Jul 8, 2024
[Bug] v0.5.0 crashes with CUDA OOM error while v0.4.2 does not (in exactly the same scenario - 30 concurrent requests to LLama2 70B)
#1943 opened Jul 7, 2024
[Feature] Prefix cache hit/miss/eviction statistics to detect cache thrashing
#1942 opened Jul 7, 2024
[Bug] same code A800 good but A10 stuck MiniCPM-Llama3-V-2_5
#1938 opened Jul 6, 2024
[Bug] unified_attention split kv for prefill with more workspace coredump
#1935 opened Jul 6, 2024
logits的获取
#1933 opened Jul 5, 2024
[Feature] 可以支持embedding模型吗，类似于xinference的功能
#1927 opened Jul 5, 2024
[Bug] lmdeploy awq量化后不能多卡部署
#1923 opened Jul 4, 2024
[Feature] Is there any plan to support for InternLM-XComposer2.5 inference?
#1920 opened Jul 4, 2024
能否支持glm-4v-9b模型
#1916 opened Jul 4, 2024
[Bug] AWQ Model Fails Loading ADapter
#1915 opened Jul 3, 2024
[Bug] qwen2-0.5b-insturct
#1910 opened Jul 3, 2024
minicpm-v采用W4A16量化，推理速度没什么变化
#1906 opened Jul 3, 2024
请问什么时候会支持对CogVLM2的量化
#1902 opened Jul 3, 2024
多轮对话批处理耗时异常
#1901 opened Jul 3, 2024
[Bug] Using the turbomind engine, prompting more than 10k tokens will result in garbage output.
#1896 opened Jul 2, 2024
[Bug] CUDA runtime error: an illegal memory access was encountered when 8bit kv quant was enabled
#1895 opened Jul 1, 2024
[Bug]
#1894 opened Jul 1, 2024
GenerationConfig 类中的参数n没有发挥作用
#1893 opened Jul 1, 2024
单条样本推理可以不使用stream_infer吗
#1891 opened Jul 1, 2024
[Feature] blazing great work about KV Cache: Mooncake
#1884 opened Jun 28, 2024
[Feature] long context inference optimization
#1879 opened Jun 27, 2024
[Docs] TurboMind推理引擎与PyTorch推理引擎速度对比
#1872 opened Jun 27, 2024
[Bug] 不支持qwen0.5b的加速？以及qwen0.5b的awq量化？
#1870 opened Jun 27, 2024
[Bug] AttributeError: 'LlavaNextConfig' object has no attribute 'hidden_size'
#1868 opened Jun 27, 2024
[Bug] internvl 模型被推理后，针对图片内容回答的答案不正确
#1866 opened Jun 27, 2024
使用pipeline加载Qwen1.5-32B-Chat，tp=4，使用openai prompt格式提示其清洗中文但生成回复都是英文
#1864 opened Jun 26, 2024
使用OpenAI format的输入得到的response要如何提取出回复文本，返回的response好像是分段的
#1863 opened Jun 26, 2024
[Bug] 单轮的图文交错对话的实现原理
#1862 opened Jun 26, 2024
[Bug] Segmentation fault: address not mapped to object at address 0x2058
#1849 opened Jun 25, 2024
[Bug] InternLM2MLP.forward() missing 1 required positional argument: 'im_mask'
#1847 opened Jun 25, 2024
如何指定模型的数据类型为f16
#1846 opened Jun 25, 2024
[Docs] 多模态模型的api_server应该如何多卡部署？
#1840 opened Jun 24, 2024
[Feature] How to support bf16 when inferencing Internvl-chat
#1839 opened Jun 24, 2024
[Bug] qwen2 awq量化微调后的模型报错
#1836 opened Jun 24, 2024
使用TurboMind 推理 + Python 代码集成的方式报错
#1835 opened Jun 24, 2024
[Feature] How to support do_sample config just like Automodel 能否像Automodel推理中的do_sample参数支持，支持使用确定性生成方法，而不是随机采样
#1833 opened Jun 23, 2024
[Bug] smoothquant量化Bacihuan2-7B-Chat模型，无法正常量化
#1831 opened Jun 23, 2024
[Bug] Qwen-7B-Chat 量化报错 AttributeError: 'RMSNorm' object has no attribute 'variance_epsilon'
#1830 opened Jun 23, 2024
Model name id returned is weird specially when using Docker [Bug]
#1827 opened Jun 21, 2024
[Bug] awq for Qwen2-72B-instruct
#1826 opened Jun 21, 2024
[Bug] MiniCPM-llama3-V2_5 启动后使用image url 使用base64 没有回复结果
#1819 opened Jun 21, 2024
[Feature] Option to also use host memory for the KV cache
#1817 opened Jun 21, 2024
[Bug] lmdeploy部署intermlm2-chat-20b，遇到<|im_end|>不会停止
#1815 opened Jun 20, 2024
[Bug] vl pipeline triggle cudaMemcpyAsync ERROR illegal memory access
#1813 opened Jun 20, 2024
[Bug] 使用领域数据sft qwen2-7b后，转awq 报错
#1810 opened Jun 20, 2024
支持glm-4-9b吗
#1808 opened Jun 19, 2024
[Bug] No way to you specify a model revision?
#1804 opened Jun 19, 2024
[Bug] n_token = outputs.num_token . Error: AttributeError: 'tuple' object has no attribute 'num_token'
#1802 opened Jun 19, 2024
[Feature] Prefill/Decoding disaggregation substantially boosts throughput
#1801 opened Jun 19, 2024
[Bug] 对Llama-3-70B-Instruct进行量化的时候会出现OOM的问题
#1796 opened Jun 18, 2024
[Bug] KeyError: 'Phi3ForCausalLM'
#1794 opened Jun 18, 2024
[Feature] 多模态api_server推理速度性能测试
#1790 opened Jun 17, 2024
是否兼容openai中参数n的设置？尝试设置n>1，但仍然只返回一条结果
#1787 opened Jun 16, 2024
[Bug] Qwen/Qwen2-72B-Instruct AWQ Quantization NaN Error
#1786 opened Jun 16, 2024
[Docs] 吞吐的提升主要是因为重写了GQA的kernel？
#1785 opened Jun 16, 2024

22 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Bug] 为什么pipeline输出只有一个1个token？
#1766 commented on Jun 18, 2024 • 0 new comments
GPTQ 和 AWQ 的推理 kernel 能否互用？
#1623 commented on Jun 18, 2024 • 0 new comments
Error When loading 'openbmb/MiniCPM-Llama3-V-2_5'
#1771 commented on Jun 19, 2024 • 0 new comments
多模态base64的接口有diff
#1779 commented on Jun 20, 2024 • 0 new comments
同样的 prompt 和采样参数，输出有差异
#975 commented on Jun 21, 2024 • 0 new comments
[Feature]- Support for the microsoft/Phi-3-vision-128k-instruct Vision Model
#1637 commented on Jun 25, 2024 • 0 new comments
[Feature] Grammar/structured output support
#1614 commented on Jun 25, 2024 • 0 new comments
[Feature] Support W4A8KV4 Quantization(QServe/QoQ)
#1587 commented on Jun 27, 2024 • 0 new comments
[Docs] How are multiple images handled?
#1686 commented on Jun 28, 2024 • 0 new comments
[Feature] V100量化推理
#1711 commented on Jun 28, 2024 • 0 new comments
[Feature] 想问下有打算支持GLM4V模型吗
#1713 commented on Jul 1, 2024 • 0 new comments
[Feature] support Nemotron-4 340B
#1784 commented on Jul 3, 2024 • 0 new comments
AWQ small batches optimization
#1707 commented on Jul 3, 2024 • 0 new comments
[Bug] lmdeploy chat model_name 对话的时候，报Aborted (core dumped)
#1706 commented on Jul 4, 2024 • 0 new comments
[Bug] tp=4 tp=8 no response
#1755 commented on Jul 8, 2024 • 0 new comments
[Feature] 请问turbomind有支持slidding window的计划么？
#1327 commented on Jul 8, 2024 • 0 new comments
[Bug] output diff when temperature set zero
#1688 commented on Jul 10, 2024 • 0 new comments
[Feature] Speculative Decoding
#1738 commented on Jul 11, 2024 • 0 new comments
[Docs] Guidance on setting `num_tokens_per_iter` and `max_prefill_iters` to optimal values
#1740 commented on Jul 12, 2024 • 0 new comments
[Benchmark] benchmarks on different cuda architecture with models of various size
#815 commented on Jul 13, 2024 • 0 new comments
[Bug] KV Cache INT8 校准警告：Token indices sequence length is longer than the specified maximum sequence length for this model (2874305 > 4096)
#1033 commented on Jul 15, 2024 • 0 new comments
support vl benchmark
#1662 commented on Jun 19, 2024 • 0 new comments