diff --git a/README.md b/README.md
index ffa7bacce..3992d901b 100644
--- a/README.md
+++ b/README.md
@@ -36,7 +36,7 @@
 - [WeChat Group](#-Wechat-Group)
 
 ## 📝 Introduction
-SWIFT supports training, inference, evaluation and deployment of **300+ LLMs and 40+ MLLMs** (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by [PEFT](https://github.com/huggingface/peft), we also provide a complete **Adapters library** to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts.
+SWIFT supports training, inference, evaluation and deployment of **300+ LLMs and 50+ MLLMs** (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by [PEFT](https://github.com/huggingface/peft), we also provide a complete **Adapters library** to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts.
 
 To facilitate use by users unfamiliar with deep learning, we provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners.
 
@@ -47,6 +47,10 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
 SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!
 
 ## 🎉 News
+- 2024.07.08: Support internlm-xcomposer2_5-7b-chat. You can check the best practice [here](docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md).
+- 2024.07.06: Support for the llava-next-video series models: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. You can refer to [llava-video best practice](docs/source_en/Multi-Modal/llava-video-best-practice.md) for more information.
+- 2024.07.06: Support internvl2 series: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
+- 2024.07.06: Support codegeex4-9b-chat.
 - 2024.07.04: Support internlm2_5-7b series: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
 - 2024.07.02: Support for using vLLM for accelerating inference and deployment of multimodal large models such as the llava series and phi3-vision models. You can refer to the [Multimodal & vLLM Inference Acceleration Documentation](docs/source_en/Multi-Modal/vllm-inference-acceleration.md) for more information.
 - 2024.07.02: Support for `llava1_6-vicuna-7b-instruct`, `llava1_6-vicuna-13b-instruct` and other llava-hf models. For best practices, refer to [here](docs/source_en/Multi-Modal/llava-best-practice.md).
@@ -61,6 +65,8 @@ SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spa
 - 🔥2024.06.01: Supports **SimPO** training! See [document](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/SimPO.md) to start training!
 - 🔥2024.06.01: Support for deploying large multimodal models, please refer to the [Multimodal Deployment Documentation](docs/source_en/Multi-Modal/mutlimodal-deployment.md) for more information.
 - 2024.05.31: Supports Mini-Internvl model, Use model_type `mini-internvl-chat-2b-v1_5` and `mini-internvl-chat-4b-v1_5`to train.
+<details><summary>More</summary>
+
 - 2024.05.24: Supports Phi3-vision model, Use model_type `phi3-vision-128k-instruct` to train.
 - 2024.05.22: Supports DeepSeek-V2-Lite series models, model_type are `deepseek-v2-lite` and `deepseek-v2-lite-chat`
 - 2024.05.22: Supports TeleChat-12B-v2 model with quantized version, model_type are `telechat-12b-v2` and `telechat-12b-v2-gptq-int4`
@@ -77,8 +83,6 @@ SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spa
 - 2024.04.29: Supports inference and fine-tuning of InternVL-Chat-V1.5 model. For best practice, you can refer to [here](https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal/internvl-best-practice.md).
 - 🔥2024.04.26: Support **LISA** and **unsloth** training! Specify `--lisa_activated_layers=2` to use LISA(to reduce the memory cost to 30 percent!), specify `--tuner_backend unsloth` to use unsloth to train a huge model(full or lora) with lesser memory(30 percent or lesser) and faster speed(5x)!
 - 🔥2024.04.26: Support the fine-tuning and inference of Qwen1.5-110B and Qwen1.5-110B-Chat model, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_110b_chat/lora_ddp_ds/sft.sh) to start training!
-<details><summary>More</summary>
-
 - 2024.04.24: Support for inference and fine-tuning of Phi3 series models. Including: [phi3-4b-4k-instruct](examples/pytorch/llm/scripts/phi3_4b_4k_instruct/lora), phi3-4b-128k-instruct.
 - 2024.04.22: Support for inference, fine-tuning, and deployment of **chinese-llama-alpaca-2** series models. This includes：chinese-llama-2-1.3b, chinese-llama-2-7b, chinese-llama-2-13b, chinese-alpaca-2-1.3b, chinese-alpaca-2-7b and chinese-alpaca-2-13b along with their corresponding 16k and 64k long text versions.
 - 2024.04.22: Support for inference and fine-tuning of Llama3 GPTQ-Int4, GPTQ-Int8, and AWQ series models. Support for inference and fine-tuning of chatglm3-6b-128k, Openbuddy-Llama3.
@@ -387,6 +391,7 @@ swift sft \
 
 #### Multi-node Multi-GPU
 ```shell
+# If multiple machines share a disk, please additionally specify `--save_on_each_node false`.
 # node0
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 NNODES=2 \
@@ -507,7 +512,7 @@ The complete list of supported models and datasets can be found at [Supported Mo
 | Model Type                                     | Model Introduction                                                     | Language           | Model Size                             | Model Type                                 |
 |------------------------------------------------|------------------------------------------------------------------------|--------------------|----------------------------------------|------------------------------------------- |
 | Qwen<br>Qwen1.5<br>Qwen2                            | [Tongyi Qwen 1.0 and 1.5 series models](https://github.com/QwenLM)  | Chinese<br>English    | 0.5B-110B<br>including quantized versions | base model<br>chat model<br>MoE model<br>code model                      |
-| ChatGLM2<br>ChatGLM3<br>Codegeex2<br>GLM4           | [Zhipu ChatGLM series models](https://github.com/THUDM)               | Chinese<br>English    | 6B-9B                                     | base model<br>chat model<br>code model<br>long text model  |
+| ChatGLM2<br>ChatGLM3<br>Codegeex2<br>GLM4<br>Codegeex4           | [Zhipu ChatGLM series models](https://github.com/THUDM)               | Chinese<br>English    | 6B-9B                                     | base model<br>chat model<br>code model<br>long text model  |
 | Baichuan<br>Baichuan2                             | [Baichuan 1 and Baichuan 2](https://github.com/baichuan-inc)           | Chinese<br>English    | 7B-13B<br>including quantized versions             | base model<br>chat model                       |
 | Yuan2                                          | [Langchao Yuan series models](https://github.com/IEIT-Yuan)             | Chinese<br>English    | 2B-102B                                | instruct model                                 |
 | XVerse                                         | [XVerse series models](https://github.com/xverse-ai)                    | Chinese<br>English    | 7B-65B                                 | base model<br>chat model<br>long text model<br>MoE model                |
@@ -550,14 +555,14 @@ The complete list of supported models and datasets can be found at [Supported Mo
 | Qwen-VL            | [Tongyi Qwen vision model](https://github.com/QwenLM)                        | Chinese<br>English | 7B<br>including quantized versions | base model<br>chat model |
 | Qwen-Audio         | [Tongyi Qwen speech model](https://github.com/QwenLM)                        | Chinese<br>English | 7B                                 | base model<br>chat model |
 | YI-VL              | [01AI's YI series vision models](https://github.com/01-ai)                   | Chinese<br>English | 6B-34B                             | chat model         |
-| XComposer2         | [Pujiang AI Lab InternLM vision model](https://github.com/InternLM/InternLM) | Chinese<br>English | 7B                                 | chat model         |
+| XComposer2<br>XComposer2.5         | [Pujiang AI Lab InternLM vision model](https://github.com/InternLM/InternLM-XComposer) | Chinese<br>English | 7B                                 | chat model         |
 | DeepSeek-VL        | [DeepSeek series vision models](https://github.com/deepseek-ai)              | Chinese<br>English | 1.3B-7B                            | chat model         |
 | MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5  | [OpenBmB MiniCPM vision model](https://github.com/OpenBMB/MiniCPM) | Chinese<br>English | 3B-9B            | chat model          |
 | CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V | [Zhipu ChatGLM visual QA and Agent model](https://github.com/THUDM/)         | Chinese<br>English | 9B-19B                            | chat model         |
 | Llava1.5<br>Llava1.6           | [Llava series models](https://github.com/haotian-liu/LLaVA)                  | English            | 7B-34B                             | chat model |
-| Llava-Next              | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT)                  | Chinese<br>English | 8B-110B                             | chat model |
+| Llava-Next<br>Llava-Next-Video             | [Llava-Next series models](https://github.com/LLaVA-VL/LLaVA-NeXT)                  | Chinese<br>English | 7B-110B                             | chat model |
 | mPLUG-Owl          | [mPLUG-Owl series models](https://github.com/X-PLUG/mPLUG-Owl)               | English            | 11B                                | chat model |
-| InternVL           | [InternVL](https://github.com/OpenGVLab/InternVL)                            | Chinese<br>English | 2B-25.5B<br>including quantized version                              | chat model |
+| InternVL<br>Mini-Internvl<br>Internvl2           | [InternVL](https://github.com/OpenGVLab/InternVL)                            | Chinese<br>English | 2B-25.5B<br>including quantized version                              | chat model |
 | Llava-llama3       | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)   | English            | 8B                                 | chat model |
 | Phi3-Vision                                      | Microsoft                        | English            | 4B              | chat model |
 | PaliGemma                                  | Google              | English | 3B              | chat model |
@@ -619,6 +624,19 @@ The complete list of supported models and datasets can be found at [Supported Mo
 | Computing cards A10/A100, etc. | Support BF16 and FlashAttn                      |
 | Huawei Ascend NPU              |                                                 |
 
+### Environment variables
+
+- DATASET_ENABLE_CACHE: Enable cache when preprocess dataset, you can use `1/True` or `0/False`, default `False`
+- WEBUI_SHARE: Share your web-ui, you can use `1/True` or `0/False`, default `False`
+- SWIFT_UI_LANG: web-ui language, you can use `en` or `zh`, default `zh`
+- WEBUI_SERVER: web-ui host ip，`0.0.0.0` for all routes，`127.0.0.1` for local network only. Default `127.0.0.1`
+- WEBUI_PORT: web-ui port
+- USE_HF: Use huggingface endpoint or ModelScope endpoint to download models and datasets. you can use `1/True` or `0/False`, default `False`
+- FORCE_REDOWNLOAD: Force to re-download the dataset
+
+Other variables like `CUDA_VISIBLE_DEVICES` are also supported, which are not listed here.
+
+
 ## 📃 Documentation
 
 ### Documentation Compiling
diff --git a/README_CN.md b/README_CN.md
index 3d11cf834..512e5af1b 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -37,7 +37,7 @@
 - [微信用户群](#-微信用户群)
 
 ## 📝 简介
-SWIFT支持**300+ LLM和40+ MLLM**（多模态大模型）的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中，实现模型训练评测到应用的完整链路。我们除支持了[PEFT](https://github.com/huggingface/peft)提供的轻量训练方案外，也提供了一个完整的**Adapters库**以支持最新的训练技术，如NEFTune、LoRA+、LLaMA-PRO等，这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
+SWIFT支持**300+ LLM和50+ MLLM**（多模态大模型）的训练、推理、评测和部署。开发者可以直接将我们的框架应用到自己的Research和生产环境中，实现模型训练评测到应用的完整链路。我们除支持了[PEFT](https://github.com/huggingface/peft)提供的轻量训练方案外，也提供了一个完整的**Adapters库**以支持最新的训练技术，如NEFTune、LoRA+、LLaMA-PRO等，这个适配器库可以脱离训练脚本直接使用在自己的自定流程中。
 
 为方便不熟悉深度学习的用户使用，我们提供了一个Gradio的web-ui用于控制训练和推理，并提供了配套的深度学习课程和最佳实践供新手入门。
 
@@ -48,6 +48,10 @@ SWIFT具有丰富的文档体系，如有使用问题请请查看[这里](https:
 可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) 和 [ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。
 
 ## 🎉 新闻
+- 2024.07.08: 支持internlm-xcomposer2_5-7b-chat. 最佳实践可以查看[这里](docs/source/Multi-Modal/internlm-xcomposer2最佳实践.md).
+- 2024.07.06: 支持llava-next-video系列模型: llava-next-video-7b-instruct, llava-next-video-7b-32k-instruct, llava-next-video-7b-dpo-instruct, llava-next-video-34b-instruct. 可以查看[llava-video最佳实践](docs/source/Multi-Modal/llava-video最佳实践.md)了解更多.
+- 2024.07.06: 支持internvl-2系列: internvl2-2b, internvl2-4b, internvl2-8b, internvl2-26b.
+- 2024.07.06: 支持codegeex4-9b-chat.
 - 2024.07.04: 支持internlm2_5-7b系列: internlm2_5-7b, internlm2_5-7b-chat, internlm2_5-7b-chat-1m.
 - 2024.07.02: 支持使用vllm对多模态大模型: llava系列, phi3-vision模型进行推理加速和部署. 可以查看[多模态&vLLM推理加速文档](docs/source/Multi-Modal/vLLM推理加速文档.md)获取更多信息.
 - 2024.07.02: 支持`llava1_6-vicuna-7b-instruct`, `llava1_6-vicuna-13b-instruct`等llava-hf模型. 最佳实践可以查看[这里](docs/source/Multi-Modal/llava最佳实践.md).
@@ -62,6 +66,8 @@ SWIFT具有丰富的文档体系，如有使用问题请请查看[这里](https:
 - 🔥2024.06.01: 支持**SimPO**训练，使用`swift simpo`来开始训练，最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/LLM/SimPO算法最佳实践.md)
 - 🔥2024.06.01: 支持多模态大模型部署, 可以查看[多模态部署文档](docs/source/Multi-Modal/MLLM部署文档.md).
 - 2024.05.31: 支持Mini-Internvl多模态模型, 使用model_type `mini-internvl-chat-2b-v1_5`和`mini-internvl-chat-4b-v1_5`来训练.
+<details><summary>更多</summary>
+
 - 2024.05.24: 支持Phi3多模态模型, 使用model_type `phi3-vision-128k-instruct`来训练.
 - 2024.05.22: 支持DeepSeek-V2-lite系列模型, model_type为 `deepseek-v2-lite`和`deekseek-v2-lite-chat`
 - 2024.05.22: 支持TeleChat-12b-v2模型和量化版本, model_type为 `telechat-12b-v2`和`telechat-12b-v2-gptq-int4`
@@ -78,8 +84,6 @@ SWIFT具有丰富的文档体系，如有使用问题请请查看[这里](https:
 - 2024.04.29: 支持InternVL-Chat-V1.5的推理与微调, 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/internvl最佳实践.md).
 - 🔥2024.04.26: 支持**LISA** 和 **unsloth**训练！指定 `--lisa_activated_layers=2` 来开启LISA（显存使用降低至全参训练的30%），指定 `--tuner_backend unsloth` 来使用unsloth，用更少的显存（30%或更少）更快的速度（5x）训练一个超大模型！
 - 🔥2024.04.26: 支持Qwen1.5-110B和Qwen1.5-110B-Chat模型的推理与微调, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_110b_chat/lora_ddp_ds/sft.sh)来开始训练！
-<details><summary>更多</summary>
-
 - 2024.04.24: 支持Phi3系列模型的推理与微调. 包括: [phi3-4b-4k-instruct](examples/pytorch/llm/scripts/phi3_4b_4k_instruct/lora), phi3-4b-128k-instruct.
 - 2024.04.22: 支持**chinese-llama-alpaca-2**系列模型的推理与微调和部署等. 包括：chinese-llama-2-1.3b, chinese-llama-2-7b, chinese-llama-2-13b, chinese-alpaca-2-1.3b, chinese-alpaca-2-7b和chinese-alpaca-2-13b以及对应的16k和64k长文本模型.
 - 2024.04.22: 支持Llama3 GPTQ-Int4, GPTQ-Int8, AWQ系列模型的推理与微调. 支持chatglm3-6b-128k, Openbuddy-llama3的推理与微调.
@@ -384,6 +388,7 @@ swift sft \
 
 #### 多机多卡
 ```shell
+# 如果多机共用磁盘请在各机器sh中额外指定`--save_on_each_node false`.
 # node0
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
 NNODES=2 \
@@ -503,7 +508,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
 | 模型类型                                            | 模型介绍                                                     | 语言       | 模型大小                  | 模型类型                                      |
 | --------------------------------------------------- | ------------------------------------------------------------ |----------| ------------------------- |-------------------------------------------|
 | Qwen<br>Qwen1.5<br>Qwen2                              | [通义千问1.0和1.5系列模型](https://github.com/QwenLM)        | 中文<br>英文 | 0.5B-110B<br>包含量化版本     | base模型<br>chat模型<br>MoE模型<br>代码模型             |                          |
-| ChatGLM2<br>ChatGLM3<br>Codegeex2<br>GLM4             | [智谱ChatGLM系列模型](https://github.com/THUDM/)             | 中文<br>英文 | 6B-9B                        | base模型<br>chat模型<br>代码模型<br>长文本模型             |
+| ChatGLM2<br>ChatGLM3<br>Codegeex2<br>GLM4<br>Codegeex4           | [智谱ChatGLM系列模型](https://github.com/THUDM/)             | 中文<br>英文 | 6B-9B                        | base模型<br>chat模型<br>代码模型<br>长文本模型             |
 | Baichuan<br>Baichuan2                                  | [百川1和百川2](https://github.com/baichuan-inc)              | 中文<br>英文 | 7B-13B<br>包含量化版本         | base模型<br>chat模型                          |
 | Yuan2                                               | [浪潮源系列模型](https://github.com/IEIT-Yuan)               | 中文<br>英文 | 2B-102B                   | instruct模型                                |
 | XVerse                                              | [元象系列模型](https://github.com/xverse-ai)                 | 中文<br>英文 | 7B-65B                    | base模型<br>chat模型<br>长文本模型<br>MoE模型             |                |
@@ -547,14 +552,14 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
 | Qwen-VL                                   | [通义千问视觉模型](https://github.com/QwenLM)                                      | 中文<br>英文 | 7B<br>包含量化版本    | base模型<br>chat模型 |
 | Qwen-Audio                                | [通义千问语音模型](https://github.com/QwenLM)                                      | 中文<br>英文 | 7B              | base模型<br>chat模型 |
 | YI-VL                                     | [01AI的YI系列视觉模型](https://github.com/01-ai)                                  | 中文<br>英文 | 6B-34B          | chat模型          |
-| XComposer2                                | [浦江实验室书生浦语视觉模型](https://github.com/InternLM/InternLM)                      | 中文<br>英文 | 7B              | chat模型          |
+| XComposer2<br>XComposer2.5                | [浦江实验室书生浦语视觉模型](https://github.com/InternLM/InternLM-XComposer)                      | 中文<br>英文 | 7B              | chat模型          |
 | DeepSeek-VL                               | [幻方系列视觉模型](https://github.com/deepseek-ai)                                 | 中文<br>英文 | 1.3B-7B         | chat模型          |
 | MiniCPM-V<br>MiniCPM-V-2<br>MiniCPM-V-2_5 | [OpenBmB MiniCPM视觉模型](https://github.com/OpenBMB/MiniCPM)                  | 中文<br>英文 | 3B-9B           | chat模型          |
 | CogVLM<br>CogVLM2<br>CogAgent<br>GLM4V   | [智谱ChatGLM视觉问答和Agent模型](https://github.com/THUDM/)                         | 中文<br>英文 | 9B-19B         | chat模型          |
 | Llava1.5<br>Llava1.6                       | [Llava系列模型](https://github.com/haotian-liu/LLaVA)                          | 英文 | 7B-34B          | chat模型 |
-| Llava-Next                                | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT)                   | 中文<br>英文 | 8B-110B         | chat模型 |
+| Llava-Next<br>Llava-Next-Video                   | [Llava-Next系列模型](https://github.com/LLaVA-VL/LLaVA-NeXT)                   | 中文<br>英文 | 7B-110B         | chat模型 |
 | mPLUG-Owl                                 | [mPLUG-Owl系列模型](https://github.com/X-PLUG/mPLUG-Owl)                       | 英文 | 11B             | chat模型 |
-| InternVL                                  | [InternVL](https://github.com/OpenGVLab/InternVL)                          | 中文<br>英文 | 2B-25.5B<br>包含量化版本 | chat模型 |
+| InternVL<br>Mini-Internvl<br>Internvl2                                  | [InternVL](https://github.com/OpenGVLab/InternVL)                          | 中文<br>英文 | 2B-25.5B<br>包含量化版本 | chat模型 |
 | Llava-llama3                              | [xtuner](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) | 英文 | 8B              | chat model |
 | Phi3-Vision                                | 微软              | 英文 | 4B              | chat model |
 | PaliGemma                                  | Google              | 英文 | 3B              | chat model |
@@ -616,6 +621,19 @@ CUDA_VISIBLE_DEVICES=0 swift deploy \
 | 华为昇腾NPU           |                         |
 
 
+### 环境变量
+
+- DATASET_ENABLE_CACHE：在预处理数据集时启用缓存，您可以使用`1/True`或`0/False`，默认值为`False`
+- WEBUI_SHARE：共享web-ui，可以使用`1/True`或`0/False`，默认值为`False`
+- SWIFT_UI_LANG：web-ui语言，您可以使用`en`或`zh`，默认值为`zh`
+- WEBUI_SERVER：web-ui可访问的IP`0.0.0.0`表示所有路由，`127.0.0.1`仅用于本地网络。默认值为`127.0.0.1`
+- WEBUI_PORT：web-ui端口
+- USE_HF：使用huggingface endpoint或ModelScope endpoint下载模型和数据集。您可以使用`1/True`或`0/False`，默认值为`False`
+- FORCE_REDOWNLOAD：强制重新下载数据集
+
+其他变量如`CUDA_VISIBLE_DEVICES`也支持，但未在此列出。
+
+
 ## 📃文档
 
 ### 文档编译
diff --git "a/docs/source/LLM/LLM\345\276\256\350\260\203\346\226\207\346\241\243.md" "b/docs/source/LLM/LLM\345\276\256\350\260\203\346\226\207\346\241\243.md"
index 3c0bd46da..59cf81c9e 100644
--- "a/docs/source/LLM/LLM\345\276\256\350\260\203\346\226\207\346\241\243.md"
+++ "b/docs/source/LLM/LLM\345\276\256\350\260\203\346\226\207\346\241\243.md"
@@ -100,6 +100,7 @@ swift sft \
     --output_dir output \
 
 # 多机多卡
+# 如果多机共用磁盘请在各机器sh中额外指定`--save_on_each_node false`.
 # node0
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NNODES=2 \
@@ -246,6 +247,7 @@ print(f'history: {history}')
 
 使用**数据集**评估:
 ```bash
+# 如果要推理所有数据集样本, 请额外指定`--show_dataset_sample -1`
 # 直接推理
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' \
diff --git "a/docs/source/LLM/VLLM\346\216\250\347\220\206\345\212\240\351\200\237\344\270\216\351\203\250\347\275\262.md" "b/docs/source/LLM/VLLM\346\216\250\347\220\206\345\212\240\351\200\237\344\270\216\351\203\250\347\275\262.md"
index 1d5424857..ecb613f63 100644
--- "a/docs/source/LLM/VLLM\346\216\250\347\220\206\345\212\240\351\200\237\344\270\216\351\203\250\347\275\262.md"
+++ "b/docs/source/LLM/VLLM\346\216\250\347\220\206\345\212\240\351\200\237\344\270\216\351\203\250\347\275\262.md"
@@ -208,6 +208,7 @@ CUDA_VISIBLE_DEVICES=0 swift export \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
 
 # 使用数据集评估
+# 如果要推理所有数据集样本, 请额外指定`--show_dataset_sample -1`
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' \
     --infer_backend vllm \
diff --git "a/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md" "b/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
index 33b847c2d..f82b7ea71 100644
--- "a/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
+++ "b/docs/source/LLM/\345\221\275\344\273\244\350\241\214\345\217\202\346\225\260.md"
@@ -30,7 +30,7 @@
 - `--add_output_dir_suffix`: 默认为`True`, 表示会在`output_dir`的目录后拼接上`model_type`和微调版本号的后缀. 如果要避免此行为, 你可以设置为`False`.
 - `--ddp_backend`: 表示分布式的后端支持, 默认是`None`. 你可以选择的值包括: 'nccl', 'gloo', 'mpi', 'ccl'.
 - `--seed`: 全局的seed, 默认使用`42`. 用于复现训练效果.
-- `--resume_from_checkpoint`: 用于断点续训, 默认为`None`. 你可以将其设置为checkpoint的路径, 例如: `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, 来进行断点续训. 支持调节`--resume_only_model`在断点续训时只读取模型文件.
+- `--resume_from_checkpoint`: 用于断点续训, 默认为`None`. 你可以将其设置为checkpoint的路径, 例如: `--resume_from_checkpoint output/qwen-7b-chat/vx-xxx/checkpoint-xxx`, 来进行断点续训. 支持调节`--resume_only_model`在断点续训时只读取模型文件.
 - `--resume_only_model`: 默认为`False`, 即为严格的断点续训, 这会读取模型、优化器和lr_scheduler的权重和各个设备存储的随机种子, 并将从上次训练暂停的stpes后继续计数进行训练. 如果设置为`True`, 则只读取模型的权重.
 - `--dtype`: 基模型载入时的torch_dtype, 默认为`'AUTO'`, 即智能选择dtype: 如果机器不支持bf16, 则使用fp16, 如果`MODEL_MAPPING`中对应模型有指定torch_dtype, 则使用其对应dtype, 否则使用bf16. 你可以选择的值包括: 'bf16', 'fp16', 'fp32'.
 - `--dataset`: 用于选择训练的数据集, 默认为`[]`. 可以选择的数据集可以查看[支持的数据集](支持的模型和数据集.md#数据集). 如果需要使用多个数据集进行训练, 你可以使用','或者' '进行分割, 例如: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. 支持Modelscope Hub/HuggingFace Hub/本地路径、subsets选择与数据集采样, 每个数据集指定格式如下: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`, 最简只需要指定dataset_name、dataset_id或者dataset_path即可. 自定义数据集可以查看[数据集的自定义与拓展文档](自定义与拓展.md#自定义数据集).
@@ -87,6 +87,7 @@
 - `--predict_with_generate`: 评估时是否使用生成式的方式, 默认为`False`. 如果设置为False, 则使用`loss`进行评估. 如果设置为True, 则使用`ROUGE-L`等指标进行评估. 使用生成式评估耗费的时间很长, 请谨慎选择.
 - `--lr_scheduler_type`: 默认值为`'cosine'`, 你可以选择: 'linear', 'cosine', 'constant'等.
 - `--warmup_ratio`: warmup占用总的训练steps的比例, 默认为`0.05`.
+- `--warmup_steps`: warmup的步数, 默认为`0`. 如果设置`warmup_steps>0`, 则覆盖warmup_ratio.
 - `--eval_steps`: 每训练多少steps进行评估, 默认为`50`.
 - `--save_steps`: 每训练多少个steps进行保存, 默认为`None`, 即设置为`eval_steps`.
 - `--save_only_model`: 是否只保存模型参数, 而不存储断点续训所需的中间状态, 默认为`None`, 即如果`sft_type`为'lora'并且不使用deepspeed(`deepspeed`为`None`), 设置为False, 否则设置为True(e.g. 使用了全参数微调或者使用了deepspeed).
diff --git "a/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" "b/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
index 05b320827..9ebf7437f 100644
--- "a/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
+++ "b/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md"
@@ -110,9 +110,10 @@
 |chatglm3-6b-32k|[ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/chatglm3-6b-32k](https://huggingface.co/THUDM/chatglm3-6b-32k)|
 |chatglm3-6b-128k|[ZhipuAI/chatglm3-6b-128k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/chatglm3-6b-128k](https://huggingface.co/THUDM/chatglm3-6b-128k)|
 |codegeex2-6b|[ZhipuAI/codegeex2-6b](https://modelscope.cn/models/ZhipuAI/codegeex2-6b/summary)|query_key_value|chatglm-generation|&#x2718;|&#x2714;|transformers<4.34|coding|[THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b)|
-|glm4-9b|[ZhipuAI/glm-4-9b](https://modelscope.cn/models/ZhipuAI/glm-4-9b/summary)|query_key_value|chatglm-generation|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b](https://huggingface.co/THUDM/glm-4-9b)|
-|glm4-9b-chat|[ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)|
-|glm4-9b-chat-1m|[ZhipuAI/glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)|
+|glm4-9b|[ZhipuAI/glm-4-9b](https://modelscope.cn/models/ZhipuAI/glm-4-9b/summary)|query_key_value|chatglm-generation|&#x2714;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b](https://huggingface.co/THUDM/glm-4-9b)|
+|glm4-9b-chat|[ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/summary)|query_key_value|chatglm3|&#x2714;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)|
+|glm4-9b-chat-1m|[ZhipuAI/glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m/summary)|query_key_value|chatglm3|&#x2714;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)|
+|codegeex4-9b-chat|[ZhipuAI/codegeex4-all-9b](https://modelscope.cn/models/ZhipuAI/codegeex4-all-9b/summary)|query_key_value|codegeex4|&#x2714;|&#x2714;|transformers<4.42|coding|[THUDM/codegeex4-all-9b](https://huggingface.co/THUDM/codegeex4-all-9b)|
 |llama2-7b|[modelscope/Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)|
 |llama2-7b-chat|[modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;||-|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|
 |llama2-13b|[modelscope/Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[meta-llama/Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf)|
@@ -339,14 +340,23 @@
 |llama3-llava-next-8b|[AI-Modelscope/llama3-llava-next-8b](https://modelscope.cn/models/AI-Modelscope/llama3-llava-next-8b/summary)|q_proj, k_proj, v_proj|llama-llava-next|&#x2714;|&#x2718;||vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
 |llava-next-72b|[AI-Modelscope/llava-next-72b](https://modelscope.cn/models/AI-Modelscope/llava-next-72b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
 |llava-next-110b|[AI-Modelscope/llava-next-110b](https://modelscope.cn/models/AI-Modelscope/llava-next-110b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
+|llava-next-video-7b-instruct|[huangjintao/LLaVA-NeXT-Video-7B-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf)|
+|llava-next-video-7b-32k-instruct|[huangjintao/LLaVA-NeXT-Video-7B-32K-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-32K-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-32K-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-32K-hf)|
+|llava-next-video-7b-dpo-instruct|[huangjintao/LLaVA-NeXT-Video-7B-DPO-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-DPO-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-DPO-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-DPO-hf)|
+|llava-next-video-34b-instruct|[huangjintao/LLaVA-NeXT-Video-34B-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-34B-hf/summary)|q_proj, k_proj, v_proj|llava-next-video-yi|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-34B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-34B-hf)|
 |yi-vl-6b-chat|[01ai/Yi-VL-6B](https://modelscope.cn/models/01ai/Yi-VL-6B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-6B](https://huggingface.co/01-ai/Yi-VL-6B)|
 |yi-vl-34b-chat|[01ai/Yi-VL-34B](https://modelscope.cn/models/01ai/Yi-VL-34B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)|
 |llava-llama-3-8b-v1_1|[AI-ModelScope/llava-llama-3-8b-v1_1-transformers](https://modelscope.cn/models/AI-ModelScope/llava-llama-3-8b-v1_1-transformers/summary)|q_proj, k_proj, v_proj|llava-llama-instruct|&#x2714;|&#x2718;|transformers>=4.36|vision|[xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)|
 |internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|wqkv|internlm-xcomposer2|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
+|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|wqkv|internlm-xcomposer2_5|&#x2714;|&#x2718;||vision, video|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
 |internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
 |internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
 |mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
 |mini-internvl-chat-4b-v1_5|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5/summary)|qkv_proj|internvl-phi3|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)|
+|internvl2-2b|[OpenGVLab/InternVL2-2B](https://modelscope.cn/models/OpenGVLab/InternVL2-2B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B)|
+|internvl2-4b|[OpenGVLab/InternVL2-4B](https://modelscope.cn/models/OpenGVLab/InternVL2-4B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B)|
+|internvl2-8b|[OpenGVLab/InternVL2-8B](https://modelscope.cn/models/OpenGVLab/InternVL2-8B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)|
+|internvl2-26b|[OpenGVLab/InternVL2-26B](https://modelscope.cn/models/OpenGVLab/InternVL2-26B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-26B](https://huggingface.co/OpenGVLab/InternVL2-26B)|
 |deepseek-vl-1_3b-chat|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
 |deepseek-vl-7b-chat|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
 |paligemma-3b-pt-224|[AI-ModelScope/paligemma-3b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-224/summary)|q_proj, k_proj, v_proj|paligemma|&#x2714;|&#x2718;|transformers>=4.41|vision|[google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224)|
diff --git a/docs/source/Multi-Modal/index.md b/docs/source/Multi-Modal/index.md
index cedb5575b..a9b25d2d1 100644
--- a/docs/source/Multi-Modal/index.md
+++ b/docs/source/Multi-Modal/index.md
@@ -16,7 +16,7 @@
 
 
 一轮对话只能包含一张图片（可能可以不含图片）:
-1. [Llava最佳实践](llava最佳实践.md)
+1. [Llava最佳实践](llava最佳实践.md), [LLava Video最佳实践](llava-video最佳实践.md)
 2. [Yi-VL最佳实践.md](yi-vl最佳实践.md)
 3. [mPLUG-Owl2最佳实践](mplug-owl2最佳实践.md)
 4. [florence最佳实践](florence最佳实践.md)
diff --git "a/docs/source/Multi-Modal/internlm-xcomposer2\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/internlm-xcomposer2\346\234\200\344\275\263\345\256\236\350\267\265.md"
index 469b5d2ba..92a65a6d1 100644
--- "a/docs/source/Multi-Modal/internlm-xcomposer2\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/Multi-Modal/internlm-xcomposer2\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -1,5 +1,12 @@
 
-# Internlm-Xcomposer2 最佳实践
+# Internlm-Xcomposer2 & Internlm-Xcomposer2.5 最佳实践
+
+本篇文档涉及的模型如下:
+
+- [internlm-xcomposer2-7b-chat](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)
+- [internlm-xcomposer2_5-7b-chat](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)
+
+以下实践以`internlm-xcomposer2-7b-chat`为例，你也可以通过指定`--model_type`切换为其他模型.
 
 ## 目录
 - [环境准备](#环境准备)
@@ -10,12 +17,14 @@
 
 ## 环境准备
 ```shell
-pip install 'ms-swift[llm]' -U
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
 ```
 
 ## 推理
 
-推理[internlm-xcomposer2-7b-chat](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary):
+推理internlm-xcomposer2-7b-chat:
 ```shell
 # Experimental environment: A10, 3090, V100, ...
 # 21GB GPU memory
@@ -43,10 +52,6 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type internlm-xcomposer2-7b-chat
 湖面平静如明镜。
 小舟轻荡波光里，
 灯火微摇映水乡。
---------------------------------------------------
-<<< clear
-<<< <img>https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/ocr.png</img>对图片进行OCR
-很抱歉，我无法对您提供的图片进行OCR。如果您需要文本识别服务，您可以上传图片到其他支持OCR服务的平台，或者您可以尝试使用一些在线OCR工具。
 """
 ```
 
diff --git "a/docs/source/Multi-Modal/internvl\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/internvl\346\234\200\344\275\263\345\256\236\350\267\265.md"
index 5c6715e72..cdfa5333e 100644
--- "a/docs/source/Multi-Modal/internvl\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/Multi-Modal/internvl\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -1,5 +1,18 @@
 
 # InternVL 最佳实践
+本篇文档涉及的模型如下:
+
+- [internvl-chat-v1_5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)
+- [internvl-chat-v1_5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)
+- [mini-internvl-chat-2b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)
+- [mini-internvl-chat-4b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)
+- [internvl2-2b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-2B)
+- [internvl2-4b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-4B)
+- [internvl2-8b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-8B)
+- [internvl2-26b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-26B)
+
+
+以下实践以`internvl-chat-v1_5`为例，你也可以通过指定`--model_type`切换为其他模型.
 
 ## 目录
 - [环境准备](#环境准备)
@@ -18,10 +31,6 @@ pip install Pillow
 
 ## 推理
 
-推理[internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)和[internvl-chat-v1.5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)
-
-下面教程以`internvl-chat-v1.5`为例，你可以修改`--model_type internvl-chat-v1_5-int8`来选择int8版本的模型，使用`mini-internvl-chat-2b-v1_5`或
-`mini-internvl-chat-4b-v1_5`来使用Mini-Internvl
 
 **注意**
 - 如果要使用本地模型文件，加上参数 `--model_id_or_path /path/to/model`
@@ -126,13 +135,13 @@ import os
 os.environ['CUDA_VISIBLE_DEVICES'] = '0'
 
 from swift.llm import (
-    get_model_tokenizer, get_template, inference, ModelType,
+    get_model_tokenizer, get_template, inference,
     get_default_template_type, inference_stream
 )
 from swift.utils import seed_everything
 import torch
 
-model_type = ModelType.internvl_chat_v1_5
+model_type = "internvl-chat-v1_5"
 template_type = get_default_template_type(model_type)
 print(f'template_type: {template_type}')
 model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,
diff --git "a/docs/source/Multi-Modal/llava-video\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/llava-video\346\234\200\344\275\263\345\256\236\350\267\265.md"
new file mode 100644
index 000000000..e47531290
--- /dev/null
+++ "b/docs/source/Multi-Modal/llava-video\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -0,0 +1,145 @@
+
+
+# Llava Video 最佳实践
+
+## 目录
+- [环境准备](#环境准备)
+- [推理](#推理)
+- [微调](#微调)
+- [微调后推理](#微调后推理)
+
+
+## 环境准备
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+```
+
+## 推理
+```shell
+# Experimental environment: A10
+# 20GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava-next-video-7b-instruct
+```
+
+输出: (支持传入本地路径或URL)
+```python
+"""
+<<< 你是谁
+Input a video path or URL <<<
+我是 Assistant，一个大型语言模型。 我被训练来回答各种问题，包括提供信息、提供建议、提供帮助等等。 我可以回答你关于各种话题的问题，但如果你有具体问题，请告诉我，我会尽力回答。
+--------------------------------------------------
+<<< clear
+<<< 描述这段视频
+Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
+这段视频展示了一个小孩在躺在床上，正在玩一本书。她穿着粉色的玩具裤和绿色的玩具裙，穿着眼镜。她的手在书上摸索，她的脸上带着微笑，看起来很开心。她的头发是金色的，整个场景充满了温馨和轻松的氛围。
+--------------------------------------------------
+<<< clear
+<<< Describe this video.
+Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/fire.mp4
+In the video, a person is seen holding a bag of chips and a lighter. The person then proceeds to light the chips on fire, creating a small fire. The fire is contained within the bag, and the person appears to be enjoying the fire as they watch it burn. The video is a simple yet intriguing display of pyromania, where the person is fascinated by the fire and enjoys watching it burn. The use of the bag as a container for the fire adds an element of danger to the scene, as it could potentially cause the fire to spread or cause injury. Overall, the video is a brief yet captivating display of pyromania and the allure of fire.
+--------------------------------------------------
+<<< clear
+<<< 描述这张图片
+Input a video path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
+这张图片是一张照片，显示了一只充满活力和可爱的猫咪。它的头部和脸部呈现出细腻的白色和柔和的灰色斑点，给人一种非常可爱的感觉。猫咪的眼睛非常大，充满了生机和好奇，它们的色彩是深蓝色，与猫咪的眼睛通常的颜色相反。猫咪的耳朵看起来很小，即使它们是很大的猫咪，也很常见。它的身体看起来很健康，毛发柔软而光滑，呈现出一种非常柔和的外观。
+--------------------------------------------------
+<<< clear
+<<< 图中有几只羊
+Input a video path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
+在这张图中，有四只羊。
+```
+
+**单样本推理**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = 'llava-next-video-7b-instruct'
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+videos = ['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']
+query = '描述这段视频'
+response, _ = inference(model, template, query, videos=videos)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# 流式
+videos = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']
+query = '图中有几只羊'
+gen = inference_stream(model, template, query, videos=videos)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, _ in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+"""
+query: 描述这段视频
+response: 这段视频展示了一个小孩在床上享受一本书的愉悦。她穿着一件简单的纱衣，头戴着眼镜，手轻轻地摸索着书页。她的表情充满了兴奋和惊喜，她的眼睛时不时地眨眼地看着书页，仿佛在探索一个新的世界。她的姿势和动作都充满了轻松和自然，让人感觉到她在享受这个简单而美好的时刻。
+query: 图中有几只羊
+response: 在这张图像中，有四只羊。
+"""
+```
+
+
+## 微调
+多模态大模型微调通常使用**自定义数据集**进行微调. 这里展示可直接运行的demo:
+
+LoRA微调:
+
+(默认只对LLM部分的qkv进行lora微调. 如果你想对所有linear含vision模型部分都进行微调, 可以指定`--lora_target_modules ALL`.)
+```shell
+# Experimental environment: A10, 3090, V100...
+# 21GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type llava-next-video-7b-instruct \
+    --dataset video-chatgpt \
+```
+
+[自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
+
+(每轮对话需包含一段视频/图片或不含视频/图片, 支持传入本地路径或URL)
+
+```jsonl
+{"query": "55555", "response": "66666", "videos": ["video_path"]}
+{"query": "eeeee", "response": "fffff", "videos": ["video_path"]}
+{"query": "EEEEE", "response": "FFFFF", "videos": ["image_path"]}
+```
+
+## 微调后推理
+直接推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/llava-next-video-7b-instruct/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true
+```
+
+**merge-lora**并推理:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir "output/llava-next-video-7b-instruct/vx-xxx/checkpoint-xxx" \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir "output/llava-next-video-7b-instruct/vx-xxx/checkpoint-xxx-merged" \
+    --load_dataset_config true
+```
diff --git "a/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md"
index e9ec1d2fb..92ab08d10 100644
--- "a/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/Multi-Modal/llava\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -200,7 +200,7 @@ LoRA微调:
 # Experimental environment: A10, 3090, V100...
 # 21GB GPU memory
 CUDA_VISIBLE_DEVICES=0 swift sft \
-    --model_type llava1_6-mistral-7b-instruct\
+    --model_type llava1_6-mistral-7b-instruct \
     --dataset coco-en-2-mini \
 
 # Experimental environment: 2*A100...
@@ -215,7 +215,7 @@ CUDA_VISIBLE_DEVICES=0,1 swift sft \
 # Experimental environment: 4 * A100
 # 4 * 70 GPU memory
 NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
-    --model_type llava1_6-mistral-7b-instruct\
+    --model_type llava1_6-mistral-7b-instruct \
     --dataset coco-en-2-mini \
     --sft_type full \
     --deepspeed default-zero2
@@ -230,7 +230,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
 
 [自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
 
-(只支持单轮对话, 每轮对话必须包含一张图片, 支持传入本地路径或URL)
+(每轮对话须包含一张图片或不含图片, 支持传入本地路径或URL)
 
 ```jsonl
 {"query": "55555", "response": "66666", "images": ["image_path"]}
diff --git "a/docs/source/Multi-Modal/yi-vl\346\234\200\344\275\263\345\256\236\350\267\265.md" "b/docs/source/Multi-Modal/yi-vl\346\234\200\344\275\263\345\256\236\350\267\265.md"
index 0209cbb4c..d97a9f62e 100644
--- "a/docs/source/Multi-Modal/yi-vl\346\234\200\344\275\263\345\256\236\350\267\265.md"
+++ "b/docs/source/Multi-Modal/yi-vl\346\234\200\344\275\263\345\256\236\350\267\265.md"
@@ -167,7 +167,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
 
 [自定义数据集](../LLM/自定义与拓展.md#-推荐命令行参数的形式)支持json, jsonl样式, 以下是自定义数据集的例子:
 
-(支持多轮对话, 每轮对话必须包含一张图片, 支持传入本地路径或URL)
+(支持多轮对话, 每轮对话须包含一张图片或不含图片, 支持传入本地路径或URL)
 
 ```jsonl
 {"query": "55555", "response": "66666", "images": ["image_path"]}
diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
index ae0fb8ede..2489c4551 100644
--- a/docs/source_en/LLM/Command-line-parameters.md
+++ b/docs/source_en/LLM/Command-line-parameters.md
@@ -29,7 +29,7 @@
 - `--add_output_dir_suffix`: Default is `True`, indicating that a suffix of `model_type` and fine-tuning version number will be appended to the `output_dir` directory. Set to `False` to avoid this behavior.
 - `--ddp_backend`: Backend support for distributed training, default is `None`. Options include: 'nccl', 'gloo', 'mpi', 'ccl'.
 - `--seed`: Global seed, default is `42`. Used to reproduce training results.
-- `--resume_from_checkpoint`: Used for resuming training from a checkpoint, default is `None`. You can set it to the path of the checkpoint, for example: `'output/qwen-7b-chat/vx-xxx/checkpoint-xxx'`, to resume training from that point. Supports adjusting `--resume_only_model` to only read the model file during checkpoint continuation.
+- `--resume_from_checkpoint`: Used for resuming training from a checkpoint, default is `None`. You can set it to the path of the checkpoint, for example: `--resume_from_checkpoint output/qwen-7b-chat/vx-xxx/checkpoint-xxx`, to resume training from that point. Supports adjusting `--resume_only_model` to only read the model file during checkpoint continuation.
 - `--resume_only_model`: Default is `False`, which means strict checkpoint continuation, this will read the weights of the model, optimizer, lr_scheduler, and the random seeds stored on each device, and continue training from the last paused steps. If set to `True`, it will only read the weights of the model.
 - `--dtype`: torch_dtype when loading base model, default is `'AUTO'`, i.e. intelligently select dtype: if machine does not support bf16, use fp16; if `MODEL_MAPPING` specifies torch_dtype for corresponding model, use its dtype; otherwise use bf16. Options include: 'bf16', 'fp16', 'fp32'.
 - `--dataset`: Used to select the training dataset, default is `[]`. You can see the list of available datasets [here](Supported-models-datasets.md#Datasets). If you need to train with multiple datasets, you can use ',' or ' ' to separate them, for example: `--dataset alpaca-en,alpaca-zh` or `--dataset alpaca-en alpaca-zh`. It supports Modelscope Hub/HuggingFace Hub/local paths, subset selection, and dataset sampling. The specified format for each dataset is as follows: `[HF or MS::]{dataset_name} or {dataset_id} or {dataset_path}[:subset1/subset2/...][#dataset_sample]`. The simplest case requires specifying only dataset_name, dataset_id, or dataset_path. Customizing datasets can be found in the [Customizing and Extending Datasets document](Customization.md#custom-dataset)
@@ -88,6 +88,7 @@
 - `--predict_with_generate`: Whether to use generation for evaluation, default is `False`. If set to False, evaluate using `loss`. If set to True, evaluate using `ROUGE-L` and other metrics. Generative evaluation takes a long time, choose carefully.
 - `--lr_scheduler_type`: Default is `'cosine'`, options: 'linear', 'cosine', 'constant', etc.
 - `--warmup_ratio`: Proportion of warmup in total training steps, default is `0.05`.
+- `--warmup_steps`: The number of warmup steps, default is `0`. If warmup_steps > 0 is set, it overrides warmup_ratio.
 - `--eval_steps`: Evaluate every this many steps, default is `50`.
 - `--save_steps`: Save every this many steps, default is `None`, i.e. set to `eval_steps`.
 - `--save_only_model`: Whether to save only model parameters, without saving intermediate states needed for checkpoint resuming, default is `None`, i.e. if `sft_type` is 'lora' and not using deepspeed (`deepspeed` is `None`), set to False, otherwise set to True (e.g. using full fine-tuning or deepspeed).
diff --git a/docs/source_en/LLM/LLM-fine-tuning.md b/docs/source_en/LLM/LLM-fine-tuning.md
index 282c2fa56..1060777b3 100644
--- a/docs/source_en/LLM/LLM-fine-tuning.md
+++ b/docs/source_en/LLM/LLM-fine-tuning.md
@@ -96,6 +96,7 @@ swift sft \
     --output_dir output \
 
 # Multi-machine multi-card
+# If multiple machines share a disk, please additionally specify `--save_on_each_node false`.
 # node0
 CUDA_VISIBLE_DEVICES=0,1,2,3 \
 NNODES=2 \
@@ -241,6 +242,7 @@ print(f'history: {history}')
 
 Using **Dataset** for evaluation:
 ```bash
+# If you want to infer all dataset samples, please additionally specify `--show_dataset_sample -1`.
 # Direct inference
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' \
diff --git a/docs/source_en/LLM/Supported-models-datasets.md b/docs/source_en/LLM/Supported-models-datasets.md
index 033dc5baf..c7b7661c3 100644
--- a/docs/source_en/LLM/Supported-models-datasets.md
+++ b/docs/source_en/LLM/Supported-models-datasets.md
@@ -110,9 +110,10 @@ The table below introcudes all models supported by SWIFT:
 |chatglm3-6b-32k|[ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/chatglm3-6b-32k](https://huggingface.co/THUDM/chatglm3-6b-32k)|
 |chatglm3-6b-128k|[ZhipuAI/chatglm3-6b-128k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/chatglm3-6b-128k](https://huggingface.co/THUDM/chatglm3-6b-128k)|
 |codegeex2-6b|[ZhipuAI/codegeex2-6b](https://modelscope.cn/models/ZhipuAI/codegeex2-6b/summary)|query_key_value|chatglm-generation|&#x2718;|&#x2714;|transformers<4.34|coding|[THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b)|
-|glm4-9b|[ZhipuAI/glm-4-9b](https://modelscope.cn/models/ZhipuAI/glm-4-9b/summary)|query_key_value|chatglm-generation|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b](https://huggingface.co/THUDM/glm-4-9b)|
-|glm4-9b-chat|[ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)|
-|glm4-9b-chat-1m|[ZhipuAI/glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m/summary)|query_key_value|chatglm3|&#x2718;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)|
+|glm4-9b|[ZhipuAI/glm-4-9b](https://modelscope.cn/models/ZhipuAI/glm-4-9b/summary)|query_key_value|chatglm-generation|&#x2714;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b](https://huggingface.co/THUDM/glm-4-9b)|
+|glm4-9b-chat|[ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat/summary)|query_key_value|chatglm3|&#x2714;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat)|
+|glm4-9b-chat-1m|[ZhipuAI/glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m/summary)|query_key_value|chatglm3|&#x2714;|&#x2714;|transformers<4.42|-|[THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)|
+|codegeex4-9b-chat|[ZhipuAI/codegeex4-all-9b](https://modelscope.cn/models/ZhipuAI/codegeex4-all-9b/summary)|query_key_value|codegeex4|&#x2714;|&#x2714;|transformers<4.42|coding|[THUDM/codegeex4-all-9b](https://huggingface.co/THUDM/codegeex4-all-9b)|
 |llama2-7b|[modelscope/Llama-2-7b-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-ms/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)|
 |llama2-7b-chat|[modelscope/Llama-2-7b-chat-ms](https://modelscope.cn/models/modelscope/Llama-2-7b-chat-ms/summary)|q_proj, k_proj, v_proj|llama|&#x2714;|&#x2714;||-|[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|
 |llama2-13b|[modelscope/Llama-2-13b-ms](https://modelscope.cn/models/modelscope/Llama-2-13b-ms/summary)|q_proj, k_proj, v_proj|default-generation|&#x2714;|&#x2714;||-|[meta-llama/Llama-2-13b-hf](https://huggingface.co/meta-llama/Llama-2-13b-hf)|
@@ -339,14 +340,23 @@ The table below introcudes all models supported by SWIFT:
 |llama3-llava-next-8b|[AI-Modelscope/llama3-llava-next-8b](https://modelscope.cn/models/AI-Modelscope/llama3-llava-next-8b/summary)|q_proj, k_proj, v_proj|llama-llava-next|&#x2714;|&#x2718;||vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
 |llava-next-72b|[AI-Modelscope/llava-next-72b](https://modelscope.cn/models/AI-Modelscope/llava-next-72b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
 |llava-next-110b|[AI-Modelscope/llava-next-110b](https://modelscope.cn/models/AI-Modelscope/llava-next-110b/summary)|q_proj, k_proj, v_proj|llava-qwen-instruct|&#x2714;|&#x2718;||vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
+|llava-next-video-7b-instruct|[huangjintao/LLaVA-NeXT-Video-7B-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf)|
+|llava-next-video-7b-32k-instruct|[huangjintao/LLaVA-NeXT-Video-7B-32K-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-32K-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-32K-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-32K-hf)|
+|llava-next-video-7b-dpo-instruct|[huangjintao/LLaVA-NeXT-Video-7B-DPO-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-7B-DPO-hf/summary)|q_proj, k_proj, v_proj|llava-next-video|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-7B-DPO-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-DPO-hf)|
+|llava-next-video-34b-instruct|[huangjintao/LLaVA-NeXT-Video-34B-hf](https://modelscope.cn/models/huangjintao/LLaVA-NeXT-Video-34B-hf/summary)|q_proj, k_proj, v_proj|llava-next-video-yi|&#x2714;|&#x2718;|transformers>=4.42, av|video|[llava-hf/LLaVA-NeXT-Video-34B-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-34B-hf)|
 |yi-vl-6b-chat|[01ai/Yi-VL-6B](https://modelscope.cn/models/01ai/Yi-VL-6B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-6B](https://huggingface.co/01-ai/Yi-VL-6B)|
 |yi-vl-34b-chat|[01ai/Yi-VL-34B](https://modelscope.cn/models/01ai/Yi-VL-34B/summary)|q_proj, k_proj, v_proj|yi-vl|&#x2714;|&#x2718;|transformers>=4.34|vision|[01-ai/Yi-VL-34B](https://huggingface.co/01-ai/Yi-VL-34B)|
 |llava-llama-3-8b-v1_1|[AI-ModelScope/llava-llama-3-8b-v1_1-transformers](https://modelscope.cn/models/AI-ModelScope/llava-llama-3-8b-v1_1-transformers/summary)|q_proj, k_proj, v_proj|llava-llama-instruct|&#x2714;|&#x2718;|transformers>=4.36|vision|[xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers)|
 |internlm-xcomposer2-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)|wqkv|internlm-xcomposer2|&#x2714;|&#x2718;||vision|[internlm/internlm-xcomposer2-7b](https://huggingface.co/internlm/internlm-xcomposer2-7b)|
+|internlm-xcomposer2_5-7b-chat|[Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)|wqkv|internlm-xcomposer2_5|&#x2714;|&#x2718;||vision, video|[internlm/internlm-xcomposer2d5-7b](https://huggingface.co/internlm/internlm-xcomposer2d5-7b)|
 |internvl-chat-v1_5|[AI-ModelScope/InternVL-Chat-V1-5](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5)|
 |internvl-chat-v1_5-int8|[AI-ModelScope/InternVL-Chat-V1-5-int8](https://modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL-Chat-V1-5-int8](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-int8)|
 |mini-internvl-chat-2b-v1_5|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5/summary)|wqkv|internvl|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-2B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)|
 |mini-internvl-chat-4b-v1_5|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5/summary)|qkv_proj|internvl-phi3|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/Mini-InternVL-Chat-4B-V1-5](https://huggingface.co/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)|
+|internvl2-2b|[OpenGVLab/InternVL2-2B](https://modelscope.cn/models/OpenGVLab/InternVL2-2B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-2B](https://huggingface.co/OpenGVLab/InternVL2-2B)|
+|internvl2-4b|[OpenGVLab/InternVL2-4B](https://modelscope.cn/models/OpenGVLab/InternVL2-4B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-4B](https://huggingface.co/OpenGVLab/InternVL2-4B)|
+|internvl2-8b|[OpenGVLab/InternVL2-8B](https://modelscope.cn/models/OpenGVLab/InternVL2-8B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-8B](https://huggingface.co/OpenGVLab/InternVL2-8B)|
+|internvl2-26b|[OpenGVLab/InternVL2-26B](https://modelscope.cn/models/OpenGVLab/InternVL2-26B/summary)|wqkv|internvl2|&#x2714;|&#x2718;|transformers>=4.35, timm|vision|[OpenGVLab/InternVL2-26B](https://huggingface.co/OpenGVLab/InternVL2-26B)|
 |deepseek-vl-1_3b-chat|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
 |deepseek-vl-7b-chat|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat/summary)|q_proj, k_proj, v_proj|deepseek-vl|&#x2714;|&#x2718;||vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
 |paligemma-3b-pt-224|[AI-ModelScope/paligemma-3b-pt-224](https://modelscope.cn/models/AI-ModelScope/paligemma-3b-pt-224/summary)|q_proj, k_proj, v_proj|paligemma|&#x2714;|&#x2718;|transformers>=4.41|vision|[google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224)|
diff --git a/docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md b/docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md
index 9004059df..5d2fce03a 100644
--- a/docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md
+++ b/docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md
@@ -205,6 +205,7 @@ CUDA_VISIBLE_DEVICES=0 swift export \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx' --merge_lora true
 
 # Evaluate using dataset
+# If you want to infer all dataset samples, please additionally specify `--show_dataset_sample -1`.
 CUDA_VISIBLE_DEVICES=0 swift infer \
     --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged' \
     --infer_backend vllm \
diff --git a/docs/source_en/Multi-Modal/florence-best-pratice.md b/docs/source_en/Multi-Modal/florence-best-pratice.md
index 47aa2b2b8..262d1bf88 100644
--- a/docs/source_en/Multi-Modal/florence-best-pratice.md
+++ b/docs/source_en/Multi-Modal/florence-best-pratice.md
@@ -146,7 +146,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
 
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json, jsonl formats. Here is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
 
 
 **Caption/VQA** task
diff --git a/docs/source_en/Multi-Modal/index.md b/docs/source_en/Multi-Modal/index.md
index 2d5c1abf2..618420092 100644
--- a/docs/source_en/Multi-Modal/index.md
+++ b/docs/source_en/Multi-Modal/index.md
@@ -16,7 +16,7 @@ A single round of dialogue can contain multiple images (or no images):
 
 
 A single round of dialogue can only contain one image:
-1. [Llava Best Practice](llava-best-practice.md)
+1. [Llava Best Practice](llava-best-practice.md), [LLava Video Best Practice](llava-video-best-practice.md)
 2. [Yi-VL Best Practice.md](yi-vl-best-practice.md)
 3. [Florence Best Practice.md](florence-best-pratice.md)
 
diff --git a/docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md b/docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md
index 16e90e603..25801918c 100644
--- a/docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md
+++ b/docs/source_en/Multi-Modal/internlm-xcomposer2-best-practice.md
@@ -1,5 +1,12 @@
 # Internlm-Xcomposer2 Best Practice
 
+The document corresponds to the following models:
+
+- [internlm-xcomposer2-7b-chat](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary)
+- [internlm-xcomposer2_5-7b-chat](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b/summary)
+
+The following practice takes `internlm-xcomposer2-7b-chat` as an example, and you can also switch to other models by specifying `--model_type`.
+
 ## Table of Contents
 - [Environment Preparation](#environment-preparation)
 - [Inference](#inference)
@@ -8,12 +15,14 @@
 
 ## Environment Preparation
 ```shell
-pip install 'ms-swift[llm]' -U
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
 ```
 
 ## Inference
 
-Inference for [internlm-xcomposer2-7b-chat](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm-xcomposer2-7b/summary):
+Inference for internlm-xcomposer2-7b-chat:
 ```shell
 # Experimental environment: A10, 3090, V100, ...
 # 21GB GPU memory
@@ -133,7 +142,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
     --dataset coco-en-mini \
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json and jsonl formats. Here's an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json and jsonl formats. Here's an example of a custom dataset:
 
 (Supports multi-turn conversations, each turn can contain multiple images or no images, supports passing local paths or URLs. This model does not support merge-lora)
 
diff --git a/docs/source_en/Multi-Modal/internvl-best-practice.md b/docs/source_en/Multi-Modal/internvl-best-practice.md
index 3d61ad565..09b17f992 100644
--- a/docs/source_en/Multi-Modal/internvl-best-practice.md
+++ b/docs/source_en/Multi-Modal/internvl-best-practice.md
@@ -1,4 +1,16 @@
 # InternVL Best Practice
+The document corresponds to the following models:
+
+- [internvl-chat-v1_5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)
+- [internvl-chat-v1_5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary)
+- [mini-internvl-chat-2b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-2B-V1-5)
+- [mini-internvl-chat-4b-v1_5](https://www.modelscope.cn/models/OpenGVLab/Mini-InternVL-Chat-4B-V1-5)
+- [internvl2-2b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-2B)
+- [internvl2-4b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-4B)
+- [internvl2-8b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-8B)
+- [internvl2-26b](https://www.modelscope.cn/models/OpenGVLab/InternVL2-26B)
+
+The following practice takes `internvl-chat-v1_5` as an example, and you can also switch to other models by specifying `--model_type`.
 
 ## Table of Contents
 - [Environment Setup](#environment-setup)
@@ -16,13 +28,6 @@ pip install Pillow
 
 ## Inference
 
-Inference for [internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary)
-(To use a local model file, add the argument `--model_id_or_path /path/to/model`)
-
-Inference with [internvl-chat-v1.5](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5/summary) and [internvl-chat-v1.5-int8](https://www.modelscope.cn/models/AI-ModelScope/InternVL-Chat-V1-5-int8/summary).
-
-The tutorial below takes `internvl-chat-v1.5` as an example, and you can change to `--model_type internvl-chat-v1_5-int8` to select the INT8 version of the model. Alternatively, select the Mini-Internvl model by choosing either `mini-internvl-chat-2b-v1_5` or `mini-internvl-chat-4b-v1_5`.
-
 **Note**
 - If you want to use a local model file, add the argument --model_id_or_path /path/to/model.
 - If your GPU does not support flash attention, use the argument --use_flash_attn false. And for int8 models, it is necessary to specify `dtype --bf16` during inference, otherwise the output may be garbled.
@@ -106,13 +111,13 @@ import os
 os.environ['CUDA_VISIBLE_DEVICES'] = '0'
 
 from swift.llm import (
-    get_model_tokenizer, get_template, inference, ModelType,
+    get_model_tokenizer, get_template, inference,
     get_default_template_type, inference_stream
 )
 from swift.utils import seed_everything
 import torch
 
-model_type = ModelType.internvl_chat_v1_5
+model_type = "internvl-chat-v1_5"
 template_type = get_default_template_type(model_type)
 print(f'template_type: {template_type}')
 
@@ -217,7 +222,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
     --max_length 4096
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json, jsonl formats. Here is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
 
 (Only single-turn dialogue is supported. Each turn of dialogue must contain one image. Local paths or URLs can be passed in.)
 
diff --git a/docs/source_en/Multi-Modal/llava-best-practice.md b/docs/source_en/Multi-Modal/llava-best-practice.md
index 5ca7ee7c1..116522e9c 100644
--- a/docs/source_en/Multi-Modal/llava-best-practice.md
+++ b/docs/source_en/Multi-Modal/llava-best-practice.md
@@ -217,7 +217,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 swift sft \
     --sft_type full \
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json, jsonl formats. Here is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
 
 (Only single-turn dialogue is supported. Each turn of dialogue must contain one image. Local paths or URLs can be passed in.)
 
diff --git a/docs/source_en/Multi-Modal/llava-video-best-practice.md b/docs/source_en/Multi-Modal/llava-video-best-practice.md
new file mode 100644
index 000000000..1dcf4d147
--- /dev/null
+++ b/docs/source_en/Multi-Modal/llava-video-best-practice.md
@@ -0,0 +1,144 @@
+# Llava Video Best Practice
+
+## Table of Contents
+- [Environment Setup](#environment-setup)
+- [Inference](#inference)
+- [Fine-tuning](#fine-tuning)
+- [Inference after Fine-tuning](#inference-after-fine-tuning)
+
+## Environment Setup
+```shell
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+```
+
+## Inference
+```shell
+# Experimental environment: A10
+# 20GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift infer --model_type llava-next-video-7b-instruct
+```
+
+Output: (supports passing in local path or URL)
+```python
+"""
+<<< who are you
+Input a video path or URL <<<
+I am Vicuna, a language model trained by researchers from Large Model Systems Organization (LMSYS).
+--------------------------------------------------
+<<< clear
+<<< Describe this video.
+Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4
+In the video, a young child is seen sitting on a bed, engrossed in reading a book. The child is wearing glasses and appears to be enjoying the book. The bed is covered with a white blanket, and there are some toys scattered around the room. The child's focus on the book suggests that they are deeply immersed in the story. The room appears to be a comfortable and cozy space, with the child's playful demeanor adding to the overall warmth of the scene.
+--------------------------------------------------
+<<< clear
+<<< Describe this video.
+Input a video path or URL <<< https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/fire.mp4
+In the video, we see a person's hands holding a bag of chips. The person is standing in front of a fire pit, which is surrounded by a wooden fence. The fire pit is filled with wood, and there is a small fire burning in it. The person is holding the bag of chips over the fire pit, and we can see the flames from the fire reflected on the bag. The person then opens the bag and throws the chips onto the fire, causing them to sizzle and pop as they land on the burning wood. The sound of the chips hitting the fire can be heard clearly in the video. Overall, the video captures a simple yet satisfying moment of someone enjoying a snack while surrounded by the warmth and light of a fire pit.
+--------------------------------------------------
+<<< clear
+<<< Describe this image.
+Input a video path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
+This is a close-up photograph of a kitten with a soft, blurred background. The kitten has a light brown coat with darker brown stripes and patches, typical of a calico pattern. Its eyes are wide open, and its nose is pink, which is common for young kittens. The kitten's whiskers are visible, and its ears are perked up, suggesting alertness. The image has a shallow depth of field, with the kitten in focus and the background out of focus, creating a bokeh effect.
+--------------------------------------------------
+<<< clear
+<<< How many sheep are in the picture?
+Input a video path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png
+There are four sheep in the picture.
+"""
+```
+
+**Single Sample Inference**
+
+```python
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '0'
+
+from swift.llm import (
+    get_model_tokenizer, get_template, inference, ModelType,
+    get_default_template_type, inference_stream
+)
+from swift.utils import seed_everything
+import torch
+
+model_type = 'llava-next-video-7b-instruct'
+template_type = get_default_template_type(model_type)
+print(f'template_type: {template_type}')
+
+model, tokenizer = get_model_tokenizer(model_type, torch.float16,
+                                       model_kwargs={'device_map': 'auto'})
+model.generation_config.max_new_tokens = 256
+template = get_template(template_type, tokenizer)
+seed_everything(42)
+
+videos = ['https://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/baby.mp4']
+query = 'Describe this video.'
+response, _ = inference(model, template, query, videos=videos)
+print(f'query: {query}')
+print(f'response: {response}')
+
+# Streaming
+videos = ['http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/animal.png']
+query = 'How many sheep are in the picture?'
+gen = inference_stream(model, template, query, videos=videos)
+print_idx = 0
+print(f'query: {query}\nresponse: ', end='')
+for response, _ in gen:
+    delta = response[print_idx:]
+    print(delta, end='', flush=True)
+    print_idx = len(response)
+print()
+
+"""
+query: Describe this video.
+response: In the video, a young child is seen sitting on a bed, engrossed in reading a book. The child is wearing a pair of glasses, which adds a touch of innocence to the scene. The child's focus is entirely on the book, indicating a sense of curiosity and interest in the content. The bed, covered with a white blanket, provides a cozy and comfortable setting for the child's reading session. The overall atmosphere of the video is one of tranquility and peacefulness, as the child enjoys a quiet moment of reading.
+query: How many sheep are in the picture?
+response: There are four sheep in the picture.
+"""
+```
+
+
+## Fine-tuning
+Multimodal large model fine-tuning usually uses **custom datasets** for fine-tuning. Here is a demo that can be run directly:
+
+LoRA fine-tuning:
+
+(By default, only the qkv of the LLM part is fine-tuned using LoRA. If you want to fine-tune all linear layers including the vision model part, you can specify `--lora_target_modules ALL`.)
+```shell
+# Experimental environment: A10, 3090, V100...
+# 21GB GPU memory
+CUDA_VISIBLE_DEVICES=0 swift sft \
+    --model_type llava-next-video-7b-instruct \
+    --dataset video-chatgpt \
+```
+
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl formats. Here is an example of a custom dataset:
+
+(Each round of conversation needs to include a video/image or not include a video/image, supports local path or URL input.)
+
+```jsonl
+{"query": "55555", "response": "66666", "videos": ["video_path"]}
+{"query": "eeeee", "response": "fffff", "videos": ["video_path"]}
+{"query": "EEEEE", "response": "FFFFF", "videos": ["image_path"]}
+```
+
+
+## Inference after Fine-tuning
+Direct inference:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir output/llava-next-video-7b-instruct/vx-xxx/checkpoint-xxx \
+    --load_dataset_config true
+```
+
+**merge-lora** and inference:
+```shell
+CUDA_VISIBLE_DEVICES=0 swift export \
+    --ckpt_dir "output/llava-next-video-7b-instruct/vx-xxx/checkpoint-xxx" \
+    --merge_lora true
+
+CUDA_VISIBLE_DEVICES=0 swift infer \
+    --ckpt_dir "output/llava-next-video-7b-instruct/vx-xxx/checkpoint-xxx-merged" \
+    --load_dataset_config true
+```
diff --git a/docs/source_en/Multi-Modal/minicpm-v-best-practice.md b/docs/source_en/Multi-Modal/minicpm-v-best-practice.md
index 979ca790e..9675f5c28 100644
--- a/docs/source_en/Multi-Modal/minicpm-v-best-practice.md
+++ b/docs/source_en/Multi-Modal/minicpm-v-best-practice.md
@@ -135,7 +135,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
     --dataset coco-en-2-mini \
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json and jsonl formats. Here is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json and jsonl formats. Here is an example of a custom dataset:
 
 (Supports multi-turn conversations, but the total round of conversations can only contain one image. Supports local path or URL input.)
 
diff --git a/docs/source_en/Multi-Modal/qwen-audio-best-practice.md b/docs/source_en/Multi-Modal/qwen-audio-best-practice.md
index c20aa7737..833de350f 100644
--- a/docs/source_en/Multi-Modal/qwen-audio-best-practice.md
+++ b/docs/source_en/Multi-Modal/qwen-audio-best-practice.md
@@ -128,7 +128,7 @@ NPROC_PER_NODE=4 CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
     --deepspeed default-zero2
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  supports json, jsonl styles, the following is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) supports json, jsonl styles, the following is an example of a custom dataset:
 
 (Supports multi-turn conversations, supports each turn of conversation containing multiple or no audio segments, supports passing local paths or URLs)
 
diff --git a/docs/source_en/Multi-Modal/qwen-vl-best-practice.md b/docs/source_en/Multi-Modal/qwen-vl-best-practice.md
index 7faf1b16e..fd9014fc0 100644
--- a/docs/source_en/Multi-Modal/qwen-vl-best-practice.md
+++ b/docs/source_en/Multi-Modal/qwen-vl-best-practice.md
@@ -149,7 +149,7 @@ CUDA_VISIBLE_DEVICES=0,1 swift sft \
     --sft_type full \
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json and jsonl formats. Here is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json and jsonl formats. Here is an example of a custom dataset:
 
 (Supports multi-turn dialogues, where each turn can contain multiple images or no images, and supports passing in local paths or URLs)
 
diff --git a/docs/source_en/Multi-Modal/yi-vl-best-practice.md b/docs/source_en/Multi-Modal/yi-vl-best-practice.md
index d679cdde5..88e63333e 100644
--- a/docs/source_en/Multi-Modal/yi-vl-best-practice.md
+++ b/docs/source_en/Multi-Modal/yi-vl-best-practice.md
@@ -150,7 +150,7 @@ CUDA_VISIBLE_DEVICES=0 swift sft \
     --dataset coco-en-2-mini \
 ```
 
-[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments)  support json, jsonl format, here is an example of a custom dataset:
+[Custom datasets](../LLM/Customization.md#-Recommended-Command-line-arguments) support json, jsonl format, here is an example of a custom dataset:
 
 (Multi-turn dialogue is supported, each turn must include an image, which can be passed as a local path or URL)
 
diff --git a/requirements/framework.txt b/requirements/framework.txt
index 05a49bbcf..e777b6e11 100644
--- a/requirements/framework.txt
+++ b/requirements/framework.txt
@@ -2,6 +2,7 @@ accelerate
 aiohttp
 binpacking
 dacite
+datasets<2.19
 jieba
 matplotlib
 modelscope>=1.14
diff --git a/swift/llm/infer.py b/swift/llm/infer.py
index acaf2faaa..ff2ad7b8b 100644
--- a/swift/llm/infer.py
+++ b/swift/llm/infer.py
@@ -15,7 +15,7 @@
 from swift.tuners import Swift
 from swift.utils import (append_to_jsonl, get_logger, get_main, get_model_info, read_multi_line, seed_everything,
                          show_layers)
-from .utils import (DeployArguments, InferArguments, Template, get_additional_saved_files, get_dataset,
+from .utils import (DeployArguments, InferArguments, MediaTag, Template, get_additional_saved_files, get_dataset,
                     get_model_tokenizer, get_template, inference, inference_stream, is_adapter, is_quant_model,
                     sample_dataset, set_generation_config)
 
@@ -242,16 +242,18 @@ def prepare_model_template(args: InferArguments,
     return model, template
 
 
-def read_media_file(infer_kwargs: Dict[str, Any], infer_media_type: Literal['none', 'round', 'dialogue']) -> None:
-    text = 'Input a media path or URL <<< '
-    images = infer_kwargs.get('images') or []
+def read_media_file(infer_kwargs: Dict[str, Any], infer_media_type: Literal['none', 'round', 'dialogue'],
+                    media_type: Literal['image', 'video', 'audio']) -> None:
+    media_key = MediaTag.media_keys[media_type]
+    a_an = 'an' if media_type[0] in {'i', 'a'} else 'a'
+    text = f'Input {a_an} {media_type} path or URL <<< '
+    media_files = infer_kwargs.get(media_key) or []
     if infer_media_type == 'none':
         return
-    if infer_media_type == 'round' or len(images) == 0:
-        image = input(text)
-        images += [image or None]
-    if len(images) > 0:
-        infer_kwargs['images'] = images
+    if infer_media_type == 'round' or len(media_files) == 0:
+        media_files += [input(text) or None]
+    if len(media_files) > 0:
+        infer_kwargs[media_key] = media_files
 
 
 def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
@@ -365,7 +367,7 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                 history = []
                 infer_kwargs = {}
 
-            read_media_file(infer_kwargs, args.infer_media_type)
+            read_media_file(infer_kwargs, args.infer_media_type, args.media_type)
             infer_kwargs['truncation_strategy'] = args.truncation_strategy
             if system is None and template.use_default_system:
                 system = template.default_system
@@ -407,9 +409,10 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                 'response': response,
                 'history': history,
             }
-            images = infer_kwargs.get('images')
-            if images is not None:
-                obj['images'] = images
+            for media_key in MediaTag.media_keys.values():
+                media_files = infer_kwargs.get(media_key)
+                if media_files is not None:
+                    obj[media_key] = media_files
             history = new_history
             if jsonl_path is not None:
                 append_to_jsonl(jsonl_path, obj)
@@ -456,15 +459,16 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                 request = {'query': data['query']}
                 history = data.get('history')
                 system = data.get('system')
-                images = data.get('images')
                 if history is None:
                     history = []
                 request['history'] = history
                 if system is None and template.use_default_system:
                     system = template.default_system
                 request['system'] = system
-                if images is not None:
-                    request['images'] = images
+                for media_key in MediaTag.media_keys.values():
+                    media_files = data.get(media_key)
+                    if media_files is not None:
+                        request[media_key] = media_files
                 request['truncation_strategy'] = args.truncation_strategy
                 request_list.append(request)
             resp_list = inference_vllm(llm_engine, template, request_list, use_tqdm=True)
@@ -480,9 +484,10 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                     'label': request.pop('label', None),
                     'history': request['history'],
                 }
-                images = request.get('images')
-                if images is not None:
-                    obj['images'] = images
+                for media_key in MediaTag.media_keys.values():
+                    media_files = request.get(media_key)
+                    if media_files is not None:
+                        obj[media_key] = media_files
                 if jsonl_path is not None:
                     append_to_jsonl(jsonl_path, obj)
                 result.append(obj)
@@ -493,7 +498,6 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                 kwargs = {'query': data['query']}
                 history = data.get('history')
                 system = data.get('system')
-                images = data.get('images')
                 tools = data.get('tools')
                 objects = data.get('objects')
                 if args.verbose and system is not None:
@@ -504,8 +508,10 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                 if system is None and template.use_default_system:
                     system = template.default_system
                 kwargs['system'] = system
-                if images is not None:
-                    kwargs['images'] = images
+                for media_key in MediaTag.media_keys.values():
+                    media_files = data.get(media_key)
+                    if media_files is not None:
+                        kwargs[media_key] = media_files
                 if tools is not None:
                     kwargs['tools'] = tools
                 if objects is not None:
@@ -534,17 +540,22 @@ def llm_infer(args: InferArguments) -> Dict[str, List[Dict[str, Any]]]:
                     'label': label,
                     'history': kwargs['history'],
                 }
-                if images is not None:
-                    obj['images'] = images
+                for media_key in MediaTag.media_keys.values():
+                    media_files = kwargs.get(media_key)
+                    if media_files is not None:
+                        obj[media_key] = media_files
                 if jsonl_path is not None:
                     append_to_jsonl(jsonl_path, obj)
                 result.append(obj)
                 if args.verbose:
                     print()
                     print(f'[LABELS]{label}')
-                    if images is not None:
-                        print(f'[IMAGES]{images}')
+                    for media_key in MediaTag.media_keys.values():
+                        media_files = kwargs.get(media_key)
+                        if media_files is not None:
+                            print(f'[{media_key.upper()}]{media_files}')
                     print('-' * 50, flush=True)
+
     if jsonl_path is not None:
         logger.info(f'save_result_path: {jsonl_path}')
     if not args.eval_human and args.show_dataset_sample == 10:  # is default
diff --git a/swift/llm/utils/__init__.py b/swift/llm/utils/__init__.py
index 4a48f334b..3a99a6faf 100644
--- a/swift/llm/utils/__init__.py
+++ b/swift/llm/utils/__init__.py
@@ -6,6 +6,7 @@
 from .dataset import (DATASET_MAPPING, DatasetName, HfDataset, get_dataset, get_dataset_from_repo,
                       load_dataset_from_local, load_ms_dataset, register_dataset, register_dataset_info,
                       register_local_dataset, sample_dataset)
+from .media import MediaCache, MediaTag
 from .model import (MODEL_MAPPING, GetModelTokenizerFunction, LoRATM, ModelType, get_additional_saved_files,
                     get_default_lora_target_modules, get_default_template_type, get_model_tokenizer,
                     get_model_tokenizer_from_repo, get_model_tokenizer_with_flash_attn, register_model)
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
index 3cb933c5c..da755902b 100644
--- a/swift/llm/utils/argument.py
+++ b/swift/llm/utils/argument.py
@@ -5,7 +5,7 @@
 import platform
 import sys
 from dataclasses import dataclass, field
-from typing import Any, List, Literal, Optional, Set, Tuple, Union
+from typing import Any, Dict, List, Literal, Optional, Set, Tuple, Union
 
 import json
 import numpy as np
@@ -26,6 +26,7 @@
 from .client_utils import get_model_list_client
 from .dataset import (DATASET_MAPPING, _dataset_name_exists, get_dataset, parse_dataset_name,
                       register_dataset_info_file, sample_dataset)
+from .media import MediaTag
 from .model import (MODEL_MAPPING, dtype_mapping, get_additional_saved_files, get_default_lora_target_modules,
                     get_default_template_type)
 from .template import TEMPLATE_MAPPING
@@ -578,7 +579,9 @@ class SftArguments(ArgumentsBase):
     max_grad_norm: float = 0.5
     predict_with_generate: bool = False
     lr_scheduler_type: str = 'cosine'
+    lr_scheduler_kwargs: Optional[str] = None  # json
     warmup_ratio: float = 0.05
+    warmup_steps: int = 0  # Overrides any effect of `warmup_ratio` if warmup_steps > 0
 
     eval_steps: int = 50
     save_steps: Optional[int] = None
@@ -730,6 +733,12 @@ def _prepare_target_modules(self, target_modules) -> List[str]:
             self.lora_use_all = True
         return target_modules
 
+    def handle_lr_scheduler_kwargs(self):
+        if self.lr_scheduler_kwargs is None:
+            self.lr_scheduler_kwargs = {}
+        elif isinstance(self.lr_scheduler_kwargs, str):
+            self.lr_scheduler_kwargs = json.loads(self.lr_scheduler_kwargs)
+
     def _prepare_modules_to_save(self, modules_to_save) -> List[str]:
         if isinstance(modules_to_save, str):
             modules_to_save = [modules_to_save]
@@ -780,6 +789,7 @@ def __post_init__(self) -> None:
         self.set_model_type()
         self.check_flash_attn()
         self.handle_generation_config()
+        self.handle_lr_scheduler_kwargs()
         self.is_multimodal = self._is_multimodal(self.model_type)
 
         self.lora_use_embedding = False
@@ -973,7 +983,9 @@ def _init_training_args(self) -> None:
             num_train_epochs=self.num_train_epochs,
             max_steps=self.max_steps,
             lr_scheduler_type=self.lr_scheduler_type,
+            lr_scheduler_kwargs=self.lr_scheduler_kwargs,
             warmup_ratio=self.warmup_ratio,
+            warmup_steps=self.warmup_steps,
             logging_steps=self.logging_steps,
             save_strategy=self.save_strategy,
             save_steps=self.save_steps,
@@ -1237,6 +1249,8 @@ def handle_infer_backend(self):
             self.stream = False
             logger.info('Setting self.stream: False')
         self.infer_media_type = template_info.get('infer_media_type', 'none')
+        self.media_type = template_info.get('media_type', 'image')
+        self.media_key = MediaTag.media_keys.get(self.media_type, 'images')
         if self.merge_device_map is None:
             self.merge_device_map = 'cpu'
 
diff --git a/swift/llm/utils/dataset.py b/swift/llm/utils/dataset.py
index 36100266a..21a103ba5 100644
--- a/swift/llm/utils/dataset.py
+++ b/swift/llm/utils/dataset.py
@@ -27,6 +27,8 @@
                          TextGenerationPreprocessor, preprocess_sharegpt)
 from .utils import download_dataset
 
+dataset_enable_cache = strtobool(os.environ.get('DATASET_ENABLE_CACHE', 'False'))
+
 
 def _update_fingerprint_mac(*args, **kwargs):
     mac = _find_local_mac().replace(':', '')
@@ -146,6 +148,8 @@ class DatasetName:
     # for qwen-audio
     aishell1_zh = 'aishell1-zh'
     aishell1_zh_mini = 'aishell1-zh-mini'
+    # for video
+    video_chatgpt = 'video-chatgpt'
 
     # rlhf
     hh_rlhf = 'hh-rlhf'
@@ -375,7 +379,7 @@ def _post_preprocess(
             train_sample = dataset_sample - val_sample
             assert isinstance(val_sample, int)
             train_dataset, val_dataset = train_dataset.train_test_split(
-                test_size=val_sample, seed=get_seed(random_state), load_from_cache_file=False).values()
+                test_size=val_sample, seed=get_seed(random_state), load_from_cache_file=dataset_enable_cache).values()
 
         assert train_sample > 0
         train_dataset = sample_dataset(train_dataset, train_sample, random_state)
@@ -442,7 +446,8 @@ def preprocess_row(row):
             return {'image': [], 'conversations': []}
         return {'image': [image]}
 
-    dataset = dataset.map(preprocess_row, load_from_cache_file=False).filter(lambda row: row['conversations'])
+    dataset = dataset.map(
+        preprocess_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['conversations'])
     return ConversationsPreprocessor(
         user_role='human', assistant_role='gpt', media_type='image', error_strategy='delete')(
             dataset)
@@ -487,7 +492,7 @@ def preprocess_row(row):
         else:
             return {'images': []}
 
-    return dataset.map(preprocess_row, load_from_cache_file=False).filter(lambda row: row['images'])
+    return dataset.map(preprocess_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['images'])
 
 
 def get_mantis_dataset(dataset_id: str,
@@ -572,7 +577,7 @@ def preprocess_image(example):
             example['images'] = []
         return example
 
-    dataset = dataset.map(preprocess_image, load_from_cache_file=False).filter(lambda row: row['images'])
+    dataset = dataset.map(preprocess_image, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['images'])
     return ConversationsPreprocessor(
         user_role='user',
         assistant_role='assistant',
@@ -663,7 +668,7 @@ def preprocess(row):
             'query': np.random.choice(caption_prompt),
         }
 
-    return dataset.map(preprocess, load_from_cache_file=False)
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache)
 
 
 register_dataset(
@@ -713,6 +718,37 @@ def _preprocess_aishell1_dataset(dataset: HfDataset) -> HfDataset:
     is_main=False)
 
 
+def _preprocess_video_chatgpt(dataset: HfDataset) -> HfDataset:
+    url = 'https://modelscope.cn/datasets/huangjintao/VideoChatGPT/resolve/master/videos.zip'
+    local_dir = MediaCache.download(url, 'video_chatgpt')
+    local_dir = os.path.join(local_dir, 'Test_Videos')
+    # only `.mp4`
+    mp4_set = [file[:-4] for file in os.listdir(local_dir) if file.endswith('mp4')]
+    query = []
+    response = []
+    videos = []
+    for d in dataset:
+        if d['video_name'] not in mp4_set:
+            continue
+        video_path = os.path.join(local_dir, f"{d['video_name']}.mp4")
+        assert os.path.exists(video_path)
+        question = d['question'] or d['question_1'] or d['question_2']
+        assert question is not None
+        query.append(question)
+        response.append(d['answer'])
+        videos.append([video_path])
+    return HfDataset.from_dict({'query': query, 'response': response, 'videos': videos})
+
+
+register_dataset(
+    DatasetName.video_chatgpt,
+    'huangjintao/VideoChatGPT', ['Generic', 'Temporal', 'Consistency'],
+    _preprocess_video_chatgpt,
+    get_dataset_from_repo,
+    split=['test'],
+    tags=['chat', 'multi-modal', 'video', '🔥'])
+
+
 def _repair_agent_conversations(conversations: str, use_mini: bool) -> Optional[List[Dict[str, str]]]:
     if use_mini:
         pattern = r'\d\. {"plugin_name": "(.+?)"'
@@ -758,7 +794,7 @@ def map_row(row):
         return response
 
     dataset = AlpacaPreprocessor()(dataset)
-    return dataset.map(map_row, load_from_cache_file=False)
+    return dataset.map(map_row, load_from_cache_file=dataset_enable_cache)
 
 
 register_dataset(
@@ -785,7 +821,7 @@ def map_row(row):
             title = match.group(1)
         return {'response': title}
 
-    return dataset.map(map_row, load_from_cache_file=False).filter(lambda row: row['response'])
+    return dataset.map(map_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['response'])
 
 
 register_dataset(
@@ -966,7 +1002,8 @@ def reorganize_row(row):
             'history': history,
         }
 
-    return dataset.map(reorganize_row, load_from_cache_file=False).filter(lambda row: row['query'] is not None)
+    return dataset.map(
+        reorganize_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['query'] is not None)
 
 
 register_dataset(
@@ -1031,7 +1068,7 @@ def row_can_be_parsed(row):
             return False
 
     return dataset.filter(row_can_be_parsed).map(
-        reorganize_row, load_from_cache_file=False).filter(lambda row: row['query'])
+        reorganize_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['query'])
 
 
 register_dataset(
@@ -1101,7 +1138,8 @@ def preprocess_image(example):
         return example
 
     dataset = dataset.map(
-        preprocess_image, load_from_cache_file=False).filter(lambda example: example['images'] is not None)
+        preprocess_image,
+        load_from_cache_file=dataset_enable_cache).filter(lambda example: example['images'] is not None)
     processer = ConversationsPreprocessor(
         user_role='human', assistant_role='gpt', media_type='image', media_key='images', error_strategy='delete')
     return processer(dataset)
@@ -1146,8 +1184,8 @@ def preprocess(row):
             return {'response': '', 'image': None}
 
     return dataset.map(
-        preprocess,
-        load_from_cache_file=False).filter(lambda row: row.get('response')).rename_columns({'image': 'images'})
+        preprocess, load_from_cache_file=dataset_enable_cache).filter(lambda row: row.get('response')).rename_columns(
+            {'image': 'images'})
 
 
 def preprocess_refcoco_unofficial_caption(dataset):
@@ -1173,7 +1211,7 @@ def preprocess(row):
             res['response'] = ''
         return res
 
-    return dataset.map(preprocess, load_from_cache_file=False).filter(lambda row: row.get('response'))
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache).filter(lambda row: row.get('response'))
 
 
 register_dataset(
@@ -1218,7 +1256,7 @@ def preprocess(row):
             res['response'] = ''
         return res
 
-    return dataset.map(preprocess, load_from_cache_file=False).filter(lambda row: row.get('response'))
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache).filter(lambda row: row.get('response'))
 
 
 register_dataset(
@@ -1287,7 +1325,8 @@ def preprocess_image(example):
         return example
 
     dataset = dataset.map(
-        preprocess_image, load_from_cache_file=False).filter(lambda example: example['images'] is not None)
+        preprocess_image,
+        load_from_cache_file=dataset_enable_cache).filter(lambda example: example['images'] is not None)
     processer = ConversationsPreprocessor(
         user_role='human', assistant_role='gpt', media_type='image', media_key='images', error_strategy='delete')
     return processer(dataset)
@@ -1350,7 +1389,7 @@ def preprocess(row):
         else:
             return {'image': ''}
 
-    dataset = dataset.map(preprocess, load_from_cache_file=False).filter(lambda row: row['image'])
+    dataset = dataset.map(preprocess, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['image'])
     return ConversationsPreprocessor(
         user_role='human', assistant_role='gpt', media_type='image', error_strategy='delete')(
             dataset)
@@ -1376,7 +1415,7 @@ def reorganize_row(row):
             'rejected_response': row['answer_en'],
         }
 
-    return dataset.map(reorganize_row, load_from_cache_file=False)
+    return dataset.map(reorganize_row, load_from_cache_file=dataset_enable_cache)
 
 
 def process_ultrafeedback_kto(dataset: HfDataset):
@@ -1388,7 +1427,7 @@ def reorganize_row(row):
             'label': row['label'],
         }
 
-    return dataset.map(reorganize_row, load_from_cache_file=False)
+    return dataset.map(reorganize_row, load_from_cache_file=dataset_enable_cache)
 
 
 register_dataset(
@@ -1430,7 +1469,8 @@ def preprocess_row(row):
             'response': output,
         }
 
-    return dataset.map(preprocess_row, load_from_cache_file=False).filter(lambda row: row['query'] and row['response'])
+    return dataset.map(
+        preprocess_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['query'] and row['response'])
 
 
 register_dataset(
@@ -1459,7 +1499,7 @@ def preprocess_row(row):
             'response': response,
         }
 
-    return dataset.map(preprocess_row, load_from_cache_file=False)
+    return dataset.map(preprocess_row, load_from_cache_file=dataset_enable_cache)
 
 
 register_dataset(
@@ -1501,7 +1541,7 @@ def preprocess(row):
             'query': query,
         }
 
-    return dataset.map(preprocess, load_from_cache_file=False).rename_column('image', 'images')
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache).rename_column('image', 'images')
 
 
 register_dataset(
@@ -1524,7 +1564,7 @@ def preprocess(row):
             'query': query,
         }
 
-    return dataset.map(preprocess, load_from_cache_file=False).rename_column('image', 'images')
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache).rename_column('image', 'images')
 
 
 register_dataset(
@@ -1548,7 +1588,7 @@ def preprocess(row):
             'query': query,
         }
 
-    return dataset.map(preprocess, load_from_cache_file=False).rename_column('image', 'images')
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache).rename_column('image', 'images')
 
 
 register_dataset(
@@ -1570,7 +1610,8 @@ def preprocess_row(row):
         return {'query': query, 'response': f'{solution}\nSo the final answer is:{response}'}
 
     return dataset.map(
-        preprocess_row, load_from_cache_file=False).filter(lambda row: row['image']).rename_columns({'image': 'images'})
+        preprocess_row,
+        load_from_cache_file=dataset_enable_cache).filter(lambda row: row['image']).rename_columns({'image': 'images'})
 
 
 register_dataset(
@@ -1624,7 +1665,7 @@ def preprocess_row(row):
 
         return {'images': images, 'response': response, 'objects': json.dumps(objects or [], ensure_ascii=False)}
 
-    return dataset.map(preprocess_row, load_from_cache_file=False).filter(lambda row: row['objects'])
+    return dataset.map(preprocess_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['objects'])
 
 
 register_dataset(
@@ -1651,7 +1692,7 @@ def preprocess_row(row):
         else:
             return {'query': '', 'response': '', 'images': ''}
 
-    return dataset.map(preprocess_row, load_from_cache_file=False).filter(lambda row: row['query'])
+    return dataset.map(preprocess_row, load_from_cache_file=dataset_enable_cache).filter(lambda row: row['query'])
 
 
 register_dataset(
@@ -1684,7 +1725,7 @@ def preprocess_row(row):
         return {'messages': rounds}
 
     dataset = dataset.map(
-        preprocess_row, load_from_cache_file=False).map(
+        preprocess_row, load_from_cache_file=dataset_enable_cache).map(
             ConversationsPreprocessor(
                 user_role='user',
                 assistant_role='assistant',
@@ -1694,7 +1735,7 @@ def preprocess_row(row):
                 media_key='images',
                 media_type='image',
             ).preprocess,
-            load_from_cache_file=False)
+            load_from_cache_file=dataset_enable_cache)
     return dataset
 
 
@@ -1751,8 +1792,8 @@ def preprocess(row):
         }
 
     return dataset.map(
-        preprocess,
-        load_from_cache_file=False).filter(lambda r: r['source'] != 'toxic-dpo-v0.2' and r['query'] is not None)
+        preprocess, load_from_cache_file=dataset_enable_cache).filter(
+            lambda r: r['source'] != 'toxic-dpo-v0.2' and r['query'] is not None)
 
 
 register_dataset(
@@ -1778,7 +1819,7 @@ def preprocess(row):
             'response': response,
         }
 
-    return dataset.map(preprocess, load_from_cache_file=False)
+    return dataset.map(preprocess, load_from_cache_file=dataset_enable_cache)
 
 
 register_dataset(
@@ -2080,7 +2121,7 @@ def reorganize_row(row):
             'response': convs[-1]['value']
         }
 
-    return dataset.map(reorganize_row, load_from_cache_file=False)
+    return dataset.map(reorganize_row, load_from_cache_file=dataset_enable_cache)
 
 
 register_dataset(
@@ -2260,6 +2301,7 @@ def _preprocess_self_cognition_dataset(
         if dataset is None:
             res_d_list.append(dataset)
             continue
+        query = []
         response = []
         for d in dataset:
             if d['tag'] == 'zh':
@@ -2267,9 +2309,13 @@ def _preprocess_self_cognition_dataset(
             else:
                 model_n, model_a = model_name[1], model_author[1]
 
+            q = d['query'].replace('{{NAME}}', model_n).replace('{{AUTHOR}}', model_a)
             r = d['response'].replace('{{NAME}}', model_n).replace('{{AUTHOR}}', model_a)
+            query.append(q)
             response.append(r)
-        dataset = dataset.remove_columns('response').add_column('response', response).remove_columns('tag')
+        dataset = dataset.remove_columns('response').add_column('response', response)
+        dataset = dataset.remove_columns('query').add_column('query', query)
+        dataset = dataset.remove_columns('tag')
         res_d_list.append(dataset)
     return tuple(res_d_list)
 
diff --git a/swift/llm/utils/media.py b/swift/llm/utils/media.py
index 3b83942bd..59321d1e1 100644
--- a/swift/llm/utils/media.py
+++ b/swift/llm/utils/media.py
@@ -1,10 +1,10 @@
 import os
 import shutil
-from typing import Any, Dict, List, Literal, Optional, Union
+from typing import Any, Dict, Literal, Optional, Union
 
 import numpy as np
+from modelscope.hub.utils.utils import get_cache_dir
 
-from swift.hub.utils.utils import get_cache_dir
 from swift.utils import get_logger
 
 logger = get_logger()
@@ -125,10 +125,24 @@ def get_url(media_type):
         return f'{MediaCache.URL_PREFIX}{media_type}.{extension}'
 
     @staticmethod
-    def download(media_type, media_name=None):
-        from swift.utils import safe_ddp_context
+    def download(media_type_or_url: str, local_alias: Optional[str] = None):
+        """Download and extract a resource from a http link.
+
+        Args:
+            media_type_or_url: `str`, Either belongs to the `media_type_urls` listed in the class field, or a
+                remote url to download and extract. Be aware that, this media type or url
+                needs to contain a zip or tar file.
+            local_alias: `Options[str]`, The local alias name for the `media_type_or_url`. If the first arg is a
+            media_type listed in this class, local_alias can leave None. else please pass in a name for the url.
+            The local dir contains the extracted files will be: {cache_dir}/{local_alias}
+
+        Returns:
+            The local dir contains the extracted files.
+        """
+        from swift.utils import safe_ddp_context, FileLockContext
         with safe_ddp_context():
-            return MediaCache._safe_download(media_type=media_type, media_name=media_name)
+            with FileLockContext(media_type_or_url):
+                return MediaCache._safe_download(media_type=media_type_or_url, media_name=local_alias)
 
     @staticmethod
     def _safe_download(media_type, media_name=None):
diff --git a/swift/llm/utils/model.py b/swift/llm/utils/model.py
index 8bfe6ea53..02619ab15 100644
--- a/swift/llm/utils/model.py
+++ b/swift/llm/utils/model.py
@@ -15,6 +15,7 @@
 import transformers
 from modelscope import (AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
                         GenerationConfig, GPTQConfig, snapshot_download)
+from modelscope.hub.utils.utils import get_cache_dir
 from packaging import version
 from torch import Tensor
 from torch import dtype as Dtype
@@ -25,7 +26,6 @@
 from transformers.utils.versions import require_version
 
 from swift import get_logger
-from swift.hub.utils.utils import get_cache_dir
 from swift.utils import get_dist_setting, safe_ddp_context, subprocess_run, use_torchacc
 from .template import TemplateType
 from .utils import get_max_model_len, is_unsloth_available
@@ -151,6 +151,7 @@ class ModelType:
     glm4_9b = 'glm4-9b'
     glm4_9b_chat = 'glm4-9b-chat'
     glm4_9b_chat_1m = 'glm4-9b-chat-1m'
+    codegeex4_9b_chat = 'codegeex4-9b-chat'
     # llama2
     llama2_7b = 'llama2-7b'
     llama2_7b_chat = 'llama2-7b-chat'
@@ -198,6 +199,11 @@ class ModelType:
     llama3_llava_next_8b = 'llama3-llava-next-8b'
     llava_next_72b = 'llava-next-72b'
     llava_next_110b = 'llava-next-110b'
+    # llava_next_video
+    llava_next_video_7b_instruct = 'llava-next-video-7b-instruct'
+    llava_next_video_7b_32k_instruct = 'llava-next-video-7b-32k-instruct'
+    llava_next_video_7b_dpo_instruct = 'llava-next-video-7b-dpo-instruct'
+    llava_next_video_34b_instruct = 'llava-next-video-34b-instruct'
     # yi
     yi_6b = 'yi-6b'
     yi_6b_200k = 'yi-6b-200k'
@@ -260,11 +266,16 @@ class ModelType:
     internlm2_math_20b_chat = 'internlm2-math-20b-chat'
     # internlm-xcomposer2
     internlm_xcomposer2_7b_chat = 'internlm-xcomposer2-7b-chat'
+    internlm_xcomposer2_5_7b_chat = 'internlm-xcomposer2_5-7b-chat'
     # internvl
     internvl_chat_v1_5 = 'internvl-chat-v1_5'
     internvl_chat_v1_5_int8 = 'internvl-chat-v1_5-int8'
     mini_internvl_chat_2b_v1_5 = 'mini-internvl-chat-2b-v1_5'
     mini_internvl_chat_4b_v1_5 = 'mini-internvl-chat-4b-v1_5'
+    internvl2_2b = 'internvl2-2b'
+    internvl2_4b = 'internvl2-4b'
+    internvl2_8b = 'internvl2-8b'
+    internvl2_26b = 'internvl2-26b'
     # deepseek
     deepseek_7b = 'deepseek-7b'
     deepseek_7b_chat = 'deepseek-7b-chat'
@@ -1426,41 +1437,6 @@ def remove_property(tokenizer_cls: Type[PreTrainedTokenizerBase], tokenizer_conf
             setattr(tokenizer_cls, k, tokenizer_config[k])
 
 
-@register_model(
-    ModelType.glm4_9b,
-    'ZhipuAI/glm-4-9b',
-    LoRATM.chatglm,
-    TemplateType.chatglm_generation,
-    support_vllm=True,
-    requires=['transformers<4.42'],
-    hf_model_id='THUDM/glm-4-9b')
-@register_model(
-    ModelType.glm4_9b_chat,
-    'ZhipuAI/glm-4-9b-chat',
-    LoRATM.chatglm,
-    TemplateType.chatglm3,
-    support_vllm=True,
-    function_kwargs={'kv_cache_patch': True},
-    requires=['transformers<4.42'],
-    hf_model_id='THUDM/glm-4-9b-chat')
-@register_model(
-    ModelType.glm4_9b_chat_1m,
-    'ZhipuAI/glm-4-9b-chat-1m',
-    LoRATM.chatglm,
-    TemplateType.chatglm3,
-    support_vllm=True,
-    function_kwargs={'kv_cache_patch': True},
-    requires=['transformers<4.42'],
-    hf_model_id='THUDM/glm-4-9b-chat-1m')
-@register_model(
-    ModelType.glm4v_9b_chat,
-    'ZhipuAI/glm-4v-9b',
-    LoRATM.glm4v,
-    TemplateType.glm4v,
-    eos_token='<|endoftext|>',
-    requires=['transformers<4.42'],
-    tags=['multi-modal', 'vision'],
-    hf_model_id='THUDM/glm-4v-9b')
 @register_model(
     ModelType.codefuse_codegeex2_6b_chat,
     'codefuse-ai/CodeFuse-CodeGeeX2-6B',
@@ -1532,7 +1508,6 @@ def get_model_tokenizer_chatglm(model_dir: str,
                                 model_kwargs: Dict[str, Any],
                                 load_model: bool = True,
                                 **kwargs):
-    kv_cache_patch = kwargs.pop('kv_cache_patch', False)
     if model_kwargs.get('quantization_config') is not None:
         model_kwargs['quantization_config'].llm_int8_skip_modules = ['output_layer']
     # fix transformers>=4.34 bug
@@ -1544,7 +1519,6 @@ def get_model_tokenizer_chatglm(model_dir: str,
         remove_property(tokenizer_cls, tokenizer_config)
         kwargs['tokenizer'] = tokenizer_cls.from_pretrained(model_dir, trust_remote_code=True)
     model, tokenizer = get_model_tokenizer_from_repo(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)
-    tokenizer.init_kwargs['image_size'] = 1120
     if model is not None:
         from torch.nn import CrossEntropyLoss
         __old_forward = CrossEntropyLoss.forward
@@ -1555,17 +1529,90 @@ def cross_entropy_forward(self, inputs: Tensor, target: Tensor) -> Tensor:
 
         CrossEntropyLoss.forward = cross_entropy_forward
 
-        if kv_cache_patch:
-            device = next(model.parameters()).device.type
+    return model, tokenizer
+
+
+@register_model(
+    ModelType.codegeex4_9b_chat,
+    'ZhipuAI/codegeex4-all-9b',
+    LoRATM.chatglm,
+    TemplateType.codegeex4,
+    support_vllm=True,
+    support_flash_attn=True,
+    tags=['coding'],
+    requires=['transformers<4.42'],
+    hf_model_id='THUDM/codegeex4-all-9b')
+@register_model(
+    ModelType.glm4_9b,
+    'ZhipuAI/glm-4-9b',
+    LoRATM.chatglm,
+    TemplateType.chatglm_generation,
+    support_vllm=True,
+    support_flash_attn=True,
+    requires=['transformers<4.42'],
+    hf_model_id='THUDM/glm-4-9b')
+@register_model(
+    ModelType.glm4_9b_chat,
+    'ZhipuAI/glm-4-9b-chat',
+    LoRATM.chatglm,
+    TemplateType.chatglm3,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers<4.42'],
+    hf_model_id='THUDM/glm-4-9b-chat')
+@register_model(
+    ModelType.glm4_9b_chat_1m,
+    'ZhipuAI/glm-4-9b-chat-1m',
+    LoRATM.chatglm,
+    TemplateType.chatglm3,
+    support_flash_attn=True,
+    support_vllm=True,
+    requires=['transformers<4.42'],
+    hf_model_id='THUDM/glm-4-9b-chat-1m')
+def get_model_tokenizer_glm4(model_dir: str,
+                             torch_dtype: Dtype,
+                             model_kwargs: Dict[str, Any],
+                             load_model: bool = True,
+                             **kwargs):
+    model_config = AutoConfig.from_pretrained(model_dir, trust_remote_code=True)
+    use_flash_attn = kwargs.pop('use_flash_attn', False)
+    if use_flash_attn:
+        model_config._attn_implementation = 'flash_attention_2'
+    else:
+        model_config._attn_implementation = 'eager'
+    return get_model_tokenizer_chatglm(
+        model_dir, torch_dtype, model_kwargs, load_model, model_config=model_config, **kwargs)
+
 
-            def _output_device_map_hook(module, input, output):
-                kv_cache = output[1]
-                if kv_cache is not None and isinstance(kv_cache, torch.Tensor):
-                    kv_cache = kv_cache.to(f'{device}:0')
-                return output[0], kv_cache
+@register_model(
+    ModelType.glm4v_9b_chat,
+    'ZhipuAI/glm-4v-9b',
+    LoRATM.glm4v,
+    TemplateType.glm4v,
+    eos_token='<|endoftext|>',
+    requires=['transformers<4.42'],
+    tags=['multi-modal', 'vision'],
+    hf_model_id='THUDM/glm-4v-9b')
+def get_model_tokenizer_glm4v(model_dir: str,
+                              torch_dtype: Dtype,
+                              model_kwargs: Dict[str, Any],
+                              load_model: bool = True,
+                              **kwargs):
+    model, tokenizer = get_model_tokenizer_glm4(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)
+    # fix device_map 4
+    n_gpu = torch.cuda.device_count()
+    local_world_size = get_dist_setting()[3]
+
+    def _output_device_map_hook(module, input, output):
+        return output.to(input[0].device)
 
-            for layer in model.transformer.encoder.layers:
-                layer.register_forward_hook(_output_device_map_hook)
+    if n_gpu // local_world_size >= 4:
+        for layer in model.transformer.vision.transformer.layers:
+            layer.mlp.register_forward_hook(_output_device_map_hook)
+            layer.post_attention_layernorm.register_forward_hook(_output_device_map_hook)
+        device = next(model.transformer.vision.linear_proj.parameters()).device
+        model.transformer.vision.boi.data = model.transformer.vision.boi.to(device)
+        model.transformer.vision.eoi.data = model.transformer.vision.eoi.to(device)
     return model, tokenizer
 
 
@@ -3598,6 +3645,46 @@ def _new_forward(*args, **kwargs):
     placeholder_tokens=['<IMG_CONTEXT>'],
     tags=['multi-modal', 'vision'],
     hf_model_id='OpenGVLab/Mini-InternVL-Chat-4B-V1-5')
+@register_model(
+    ModelType.internvl2_2b,
+    'OpenGVLab/InternVL2-2B',
+    LoRATM.internlm2,
+    TemplateType.internvl2,
+    requires=['transformers>=4.35', 'timm'],
+    support_flash_attn=True,
+    placeholder_tokens=['<IMG_CONTEXT>'],
+    tags=['multi-modal', 'vision'],
+    hf_model_id='OpenGVLab/InternVL2-2B')
+@register_model(
+    ModelType.internvl2_4b,
+    'OpenGVLab/InternVL2-4B',
+    LoRATM.internlm2,
+    TemplateType.internvl2,
+    requires=['transformers>=4.35', 'timm'],
+    support_flash_attn=True,
+    placeholder_tokens=['<IMG_CONTEXT>'],
+    tags=['multi-modal', 'vision'],
+    hf_model_id='OpenGVLab/InternVL2-4B')
+@register_model(
+    ModelType.internvl2_8b,
+    'OpenGVLab/InternVL2-8B',
+    LoRATM.internlm2,
+    TemplateType.internvl2,
+    requires=['transformers>=4.35', 'timm'],
+    support_flash_attn=True,
+    placeholder_tokens=['<IMG_CONTEXT>'],
+    tags=['multi-modal', 'vision'],
+    hf_model_id='OpenGVLab/InternVL2-8B')
+@register_model(
+    ModelType.internvl2_26b,
+    'OpenGVLab/InternVL2-26B',
+    LoRATM.internlm2,
+    TemplateType.internvl2,
+    requires=['transformers>=4.35', 'timm'],
+    support_flash_attn=True,
+    placeholder_tokens=['<IMG_CONTEXT>'],
+    tags=['multi-modal', 'vision'],
+    hf_model_id='OpenGVLab/InternVL2-26B')
 def get_model_tokenizer_internvl(model_dir: str,
                                  torch_dtype: Dtype,
                                  model_kwargs: Dict[str, Any],
@@ -3701,6 +3788,15 @@ def new_get_rank(group=None):
     return model, tokenizer
 
 
+@register_model(
+    ModelType.internlm_xcomposer2_5_7b_chat,
+    'Shanghai_AI_Laboratory/internlm-xcomposer2d5-7b',
+    LoRATM.internlm2,
+    TemplateType.internlm_xcomposer2_5,
+    eos_token='<|im_end|>',
+    support_flash_attn=True,
+    tags=['multi-modal', 'vision'],
+    hf_model_id='internlm/internlm-xcomposer2d5-7b')
 @register_model(
     ModelType.internlm_xcomposer2_7b_chat,
     'Shanghai_AI_Laboratory/internlm-xcomposer2-7b',
@@ -5091,6 +5187,56 @@ def get_model_tokenizer_llava_next_yi(*args, **kwargs):
     return model, tokenizer
 
 
+@register_model(
+    ModelType.llava_next_video_7b_dpo_instruct,
+    'huangjintao/LLaVA-NeXT-Video-7B-DPO-hf',
+    LoRATM.llama,
+    TemplateType.llava_next_video,
+    support_flash_attn=True,
+    requires=['transformers>=4.42', 'av'],
+    tags=['multi-modal', 'video'],
+    hf_model_id='llava-hf/LLaVA-NeXT-Video-7B-DPO-hf')
+@register_model(
+    ModelType.llava_next_video_7b_32k_instruct,
+    'huangjintao/LLaVA-NeXT-Video-7B-32K-hf',
+    LoRATM.llama,
+    TemplateType.llava_next_video,
+    support_flash_attn=True,
+    requires=['transformers>=4.42', 'av'],
+    tags=['multi-modal', 'video'],
+    hf_model_id='llava-hf/LLaVA-NeXT-Video-7B-32K-hf')
+@register_model(
+    ModelType.llava_next_video_7b_instruct,
+    'huangjintao/LLaVA-NeXT-Video-7B-hf',
+    LoRATM.llama,
+    TemplateType.llava_next_video,
+    support_flash_attn=True,
+    requires=['transformers>=4.42', 'av'],
+    tags=['multi-modal', 'video'],
+    hf_model_id='llava-hf/LLaVA-NeXT-Video-7B-hf')
+def get_model_tokenizer_llava_next_video(*args, **kwargs):
+    from transformers import LlavaNextVideoForConditionalGeneration
+    kwargs['automodel_class'] = LlavaNextVideoForConditionalGeneration
+    return get_model_tokenizer_llava_hf(*args, **kwargs)
+
+
+@register_model(
+    ModelType.llava_next_video_34b_instruct,
+    'huangjintao/LLaVA-NeXT-Video-34B-hf',
+    LoRATM.llama,
+    TemplateType.llava_next_video_yi,
+    support_flash_attn=True,
+    requires=['transformers>=4.42', 'av'],
+    tags=['multi-modal', 'video'],
+    hf_model_id='llava-hf/LLaVA-NeXT-Video-34B-hf')
+def get_model_tokenizer_llava_next_video_yi(*args, **kwargs):
+    model, tokenizer = get_model_tokenizer_llava_next_video(*args, **kwargs)
+    if model is not None:
+        model.config.video_token_index = 64003
+        model.config.image_token_index = 64004
+    return model, tokenizer
+
+
 @register_model(
     ModelType.llama3_llava_next_8b,
     'AI-Modelscope/llama3-llava-next-8b',
diff --git a/swift/llm/utils/preprocess.py b/swift/llm/utils/preprocess.py
index a0df4b4c9..f8bec1439 100644
--- a/swift/llm/utils/preprocess.py
+++ b/swift/llm/utils/preprocess.py
@@ -1,14 +1,17 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import ast
+import os
 from typing import Any, Callable, Dict, List, Literal, Optional, Union
 
 from datasets import Dataset as HfDataset
 from tqdm import tqdm
+from transformers.utils import strtobool
 
 from .media import MediaTag
 from .template import History
 
 PreprocessFunc = Callable[[HfDataset], HfDataset]
+dataset_enable_cache = strtobool(os.environ.get('DATASET_ENABLE_CACHE', 'False'))
 
 
 def _reduce_columns(cls: type) -> type:
@@ -158,13 +161,13 @@ def preprocess(self, d: Dict[str, Any]) -> Dict[str, Any]:
         medias = self.parse_medias(d)
         self.media_replacer(row, medias)
         if self.media_type:
-            if not isinstance(self.media_key, str):
-                row[self.media_name] = medias
+            row[self.media_name] = medias
         return row
 
     def __call__(self, dataset: HfDataset) -> HfDataset:
         dataset = dataset.map(
-            self.preprocess, load_from_cache_file=False).filter(lambda row: row.get('response') is not None)
+            self.preprocess,
+            load_from_cache_file=dataset_enable_cache).filter(lambda row: row.get('response') is not None)
         if self.media_type and isinstance(self.media_key, str) and self.media_key != self.media_name:
             dataset = dataset.rename_columns({self.media_key: self.media_name})
         return dataset
@@ -248,8 +251,7 @@ def preprocess(self, d: Dict[str, Any]) -> Dict[str, Any]:
             medias = self.parse_medias(d)
             self.media_replacer(row, medias)
             if self.media_type:
-                if not isinstance(self.media_key, str):
-                    row[self.media_name] = medias
+                row[self.media_name] = medias
             return row
         except (AssertionError, SyntaxError):
             if self.error_strategy == 'raise':
@@ -259,7 +261,8 @@ def preprocess(self, d: Dict[str, Any]) -> Dict[str, Any]:
 
     def __call__(self, dataset: HfDataset) -> HfDataset:
         dataset = dataset.map(
-            self.preprocess, load_from_cache_file=False).filter(lambda row: row.get('response') is not None)
+            self.preprocess,
+            load_from_cache_file=dataset_enable_cache).filter(lambda row: row.get('response') is not None)
         if self.media_type and isinstance(self.media_key, str) and self.media_key != self.media_name:
             dataset = dataset.rename_columns({self.media_key: self.media_name})
         return dataset
@@ -303,8 +306,7 @@ def preprocess(self, d: Dict[str, Any]) -> Dict[str, Any]:
             medias = self.parse_medias(d)
             self.media_replacer(row, medias)
             if self.media_type:
-                if not isinstance(self.media_key, str):
-                    row[self.media_name] = medias
+                row[self.media_name] = medias
         except Exception:
             if self.error_strategy == 'raise':
                 raise ValueError(f'conversations: {conversations}')
@@ -313,7 +315,8 @@ def preprocess(self, d: Dict[str, Any]) -> Dict[str, Any]:
         return row
 
     def __call__(self, dataset: HfDataset):
-        dataset = dataset.map(self.preprocess, load_from_cache_file=False).filter(lambda d: d.get('response'))
+        dataset = dataset.map(
+            self.preprocess, load_from_cache_file=dataset_enable_cache).filter(lambda d: d.get('response'))
         if self.media_type and isinstance(self.media_key, str) and self.media_key != self.media_name:
             dataset = dataset.rename_columns({self.media_key: self.media_name})
         return dataset
diff --git a/swift/llm/utils/template.py b/swift/llm/utils/template.py
index 3b5045665..25dd45b2b 100644
--- a/swift/llm/utils/template.py
+++ b/swift/llm/utils/template.py
@@ -2,15 +2,17 @@
 import re
 from copy import deepcopy
 from io import BytesIO
-from typing import Any, Dict, List, Literal, Optional, Tuple, Union
+from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, TypeVar, Union
 
 import json
+import numpy as np
 import requests
 import torch
 import torch.nn.functional as F
 from torch import Tensor
 from torch.nn.utils.rnn import pad_sequence
 from transformers import PreTrainedTokenizerBase, StoppingCriteria
+from transformers.dynamic_module_utils import get_class_from_dynamic_module
 
 from swift.llm.agent.utils import calculate_loss_scale, get_tools_prompt
 from swift.torchacc_utils import pad_and_split_batch
@@ -38,6 +40,7 @@ class TemplateType:
     baichuan = 'baichuan'
     chatglm2 = 'chatglm2'
     chatglm3 = 'chatglm3'
+    codegeex4 = 'codegeex4'
     llama = 'llama'  # llama2
     llama3 = 'llama3'
     llava1_5 = 'llava1_5'
@@ -47,12 +50,16 @@ class TemplateType:
     llava_llama_instruct = 'llava-llama-instruct'
     llava_qwen_instruct = 'llava-qwen-instruct'
     llama_llava_next = 'llama-llava-next'
+    llava_next_video = 'llava-next-video'
+    llava_next_video_yi = 'llava-next-video-yi'
     openbuddy = 'openbuddy'
     openbuddy2 = 'openbuddy2'
     internlm = 'internlm'
     internlm2 = 'internlm2'
     internlm_xcomposer2 = 'internlm-xcomposer2'
+    internlm_xcomposer2_5 = 'internlm-xcomposer2_5'
     internvl = 'internvl'
+    internvl2 = 'internvl2'
     internvl_phi3 = 'internvl-phi3'
     florence = 'florence'
     yi = 'yi'
@@ -674,6 +681,9 @@ def data_collator(self, batch: List[Dict[str, Any]], padding_to: Optional[int] =
             if len(image_sizes) > 0:
                 res['image_sizes'] = torch.concat(image_sizes)
 
+        pixel_values_videos = [b['pixel_values_videos'] for b in batch if b.get('pixel_values_videos') is not None]
+        if len(pixel_values_videos) > 0:
+            res['pixel_values_videos'] = torch.concat(pixel_values_videos)
         if loss_scale is not None:
             res['loss_scale'] = loss_scale
         return res
@@ -721,6 +731,8 @@ def generate_ids_to_response(
         if isinstance(self.suffix[-1], list) and (not is_finished or is_finished
                                                   and generate_ids[-len(self.suffix[-1]):] == self.suffix[-1]):
             generate_ids = generate_ids[:-len(self.suffix[-1])]
+        if not is_finished or is_finished and generate_ids[-1:] == [self.tokenizer.eos_token_id]:
+            generate_ids = generate_ids[:-1]
         response = tokenizer.decode(generate_ids, **tokenizer_kwargs)
         if first_num_space is not None:
             # Avoid the occurrence of repeated words in sentence.
@@ -736,7 +748,11 @@ def generate_ids_to_response(
                 response = response[cur_num_space - first_num_space:]
         if isinstance(self.suffix[-1],
                       str) and (not is_finished or is_finished and response[-len(self.suffix[-1]):] == self.suffix[-1]):
-            response = response[:-len(self.suffix[-1])]
+            idx = max(len(response) - len(self.suffix[-1]), 0)
+            # To avoid response length being shorter than previous response length during streaming.
+            if print_idx is not None:
+                idx = max(idx, print_idx[0])
+            response = response[:idx]
 
         if print_idx is not None:
             old_print_idx = print_idx[0]
@@ -866,7 +882,7 @@ class QwenAudioGenerationTemplate(_QwenAudioTemplateMixin, DefaultGenerationTemp
     pass
 
 
-register_template(TemplateType.qwen_audio, QwenAudioTemplate(), lazy_tokenize=True)
+register_template(TemplateType.qwen_audio, QwenAudioTemplate(), lazy_tokenize=True, media_type='audio')
 register_template(
     TemplateType.qwen_audio_generation, QwenAudioGenerationTemplate(), lazy_tokenize=True, is_generation=True)
 
@@ -913,12 +929,33 @@ def _load_image(img_path: Union[str, 'PIL.Image.Image']) -> 'PIL.Image.Image':
     return image
 
 
-def _read_batch(path_list: List[Union[str, 'PIL.Image.Image', None]]) -> List['PIL.Image.Image']:
+def _load_video(video_path: str) -> np.ndarray:
+    import av
+    container = av.open(video_path)
+    total_frames = container.streams.video[0].frames
+    indices = np.arange(0, total_frames, total_frames / 8).astype(int)
+    frames = []
+    container.seek(0)
+    start_index = indices[0]
+    end_index = indices[-1]
+    for i, frame in enumerate(container.decode(video=0)):
+        if i > end_index:
+            break
+        if i >= start_index and i in indices:
+            frames.append(frame)
+    return np.stack([x.to_ndarray(format='rgb24') for x in frames])
+
+
+_T = TypeVar('_T')
+
+
+def _read_batch(path_list: List[Union[str, 'PIL.Image.Image', None]],
+                load_func: Callable[[str], _T] = _load_image) -> List[_T]:
     res = []
     for path in path_list:
         if path is None:  # ignore None
             continue
-        res.append(_load_image(path))
+        res.append(load_func(path))
     return res
 
 
@@ -1046,6 +1083,13 @@ def data_collator(self, batch: List[Dict[str, Any]], padding_to: Optional[int] =
     TemplateType.chatglm3,
     GLMTemplate([], ['<|user|>\n{{QUERY}}<|assistant|>\n'], [], ['<|user|>'], None, ['<|system|>\n{{SYSTEM}}']))
 
+codegeex4_system = '你是一位智能编程助手，你叫CodeGeeX。你会为用户回答关于编程、代码、计算机方面的任何问题，并提供格式规范、可以执行、准确安全的代码，并在必要时提供详细的解释。'
+
+register_template(
+    TemplateType.codegeex4,
+    GLMTemplate([], ['<|user|>\n{{QUERY}}<|assistant|>\n'], [], ['<|endoftext|>'], codegeex4_system,
+                ['<|system|>\n{{SYSTEM}}']))
+
 register_template(
     TemplateType.deepseek,
     Template([['bos_token_id']], ['User: {{QUERY}}\n\nAssistant:'], [['eos_token_id']], [['eos_token_id']], None,
@@ -1141,14 +1185,15 @@ def replace_img_tag(query: str, history: History, replace_token: str) -> Tuple[s
     return new_query, new_history, images_path
 
 
-class InternLMXComposer2(Template):
-    INTERNLM_XCOMPOSER2_SYSTEM = (
+class InternLMXComposer2Template(Template):
+    INTERNLM_XCOMPOSER_SYSTEM = (
         'You are an AI assistant whose name is InternLM-XComposer (浦语·灵笔).\n'
         '- InternLM-XComposer (浦语·灵笔) is a conversational language model that is developed by '
         'Shanghai AI Laboratory (上海人工智能实验室). '
         'It is designed to be helpful, honest, and harmless.\n'
         '- InternLM-XComposer (浦语·灵笔) can understand and communicate fluently in the language chosen '
         'by the user such as English and 中文.')
+    is_v2_5 = False
 
     def __init__(self):
         prefix = ['<s>']
@@ -1156,7 +1201,7 @@ def __init__(self):
         chat_sep = ['[UNUSED_TOKEN_145]\n']
         suffix = ['[UNUSED_TOKEN_145]']
         system_prefix = ['<s>[UNUSED_TOKEN_146]system\n{{SYSTEM}}[UNUSED_TOKEN_145]\n']
-        super().__init__(prefix, prompt, chat_sep, suffix, self.INTERNLM_XCOMPOSER2_SYSTEM, system_prefix)
+        super().__init__(prefix, prompt, chat_sep, suffix, self.INTERNLM_XCOMPOSER_SYSTEM, system_prefix)
 
     def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any]]:
         example = example.copy()
@@ -1165,21 +1210,37 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
             history = []
         example['query'], example['history'], images_path = replace_img_tag(example['query'], history, '</s>')
         inputs, _ = super().encode(example)
+        if len(inputs) == 0:
+            return inputs, {}
         dtype = self.model.dtype
         images_path.extend(example.get('images') or [])
         images = _read_batch(images_path)
-        for i, image in enumerate(images):
-            image = self.model.vis_processor(image)
-            images[i] = image.to(dtype)
-        if len(inputs) == 0:
-            return inputs, {}
+        if self.is_v2_5:
+            hd_num = 24
+            Image_transform = get_class_from_dynamic_module('ixc_utils.Image_transform', self.tokenizer.model_dir)
+            if len(images) > 1:
+                hd_num = 6
+            for i, image in enumerate(images):
+                image = Image_transform(image, hd_num=hd_num)
+                image = self.model.vis_processor(image)
+                image = image.to(dtype)
+                image = self.model.img2emb(image[None])[0]
+                assert image.shape[0] == 1
+                images[i] = image[0]
+        else:
+            for i, image in enumerate(images):
+                image = self.model.vis_processor(image)
+                images[i] = image.to(dtype)
         inputs.pop('loss_scale', None)
         input_ids = inputs['input_ids']
         labels = inputs['labels']
-        if len(images) > 0:  # # ignore <s>
+        if len(images) > 0:  # ignore <s>
             input_ids = input_ids[1:]
             if labels is not None:
                 labels = labels[1:]
+            if not self.is_v2_5:
+                images = torch.stack(images, dim=0)
+                images = self.model.encode_img(images)
         input_ids.append(2)  # add dummy </s>
         if labels is not None:
             labels.append(2)
@@ -1190,11 +1251,6 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
         wrap_im_mask = []
         pre_i, i, idx = 0, 0, 0
         device = self.model.device
-        if len(images) > 0:
-            images = torch.stack(images, dim=0)
-            images = self.model.encode_img(images)
-        else:
-            images = None
         internlm2_model = self.model.model
         if not hasattr(internlm2_model, 'tok_embeddings'):
             internlm2_model = internlm2_model.model
@@ -1205,10 +1261,16 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
                 res_inputs_embeds.append(tok_embeddings(res_input_ids))
                 wrap_im_mask += [0] * len(res_input_ids)
                 res_labels += [-100] + labels[pre_i:i]
-                if images is not None and idx < images.shape[0]:
-                    res_inputs_embeds.append(images[idx])
-                    wrap_im_mask += [1] * images.shape[1]
-                    res_labels += [-100] * images.shape[1]
+                if self.is_v2_5:
+                    if len(images) > 0 and idx < len(images):
+                        res_inputs_embeds.append(images[idx])
+                        wrap_im_mask += [1] * images[idx].shape[0]
+                        res_labels += [-100] * images[idx].shape[0]
+                else:
+                    if len(images) > 0 and idx < images.shape[0]:
+                        res_inputs_embeds.append(images[idx])
+                        wrap_im_mask += [1] * images.shape[1]
+                        res_labels += [-100] * images.shape[1]
                 idx += 1
                 i += 1
                 pre_i = i
@@ -1234,7 +1296,29 @@ def get_generate_ids(generate_ids: Tensor, input_token_len: int) -> List[int]:
 
 register_template(
     TemplateType.internlm_xcomposer2,
-    InternLMXComposer2(),
+    InternLMXComposer2Template(),
+    use_model=True,
+    lazy_tokenize=True,
+    dataloader_num_workers=0,
+    dataloader_pin_memory=False)
+
+
+class InternLMXComposer2_5Template(InternLMXComposer2Template):
+    INTERNLM_XCOMPOSER_SYSTEM = (
+        'You are an AI assistant whose name is InternLM-XComposer (浦语·灵笔).\n'
+        '- InternLM-XComposer (浦语·灵笔) is a multi-modality conversational language model '
+        'that is developed by Shanghai AI Laboratory (上海人工智能实验室). '
+        'It is designed to be helpful, honest, and harmless.\n'
+        '- InternLM-XComposer (浦语·灵笔) can understand and communicate fluently in the language chosen '
+        'by the user such as English and 中文.\n'
+        '- InternLM-XComposer (浦语·灵笔) is capable of comprehending and articulating responses effectively '
+        'based on the provided image.')
+    is_v2_5 = True
+
+
+register_template(
+    TemplateType.internlm_xcomposer2_5,
+    InternLMXComposer2_5Template(),
     use_model=True,
     lazy_tokenize=True,
     dataloader_num_workers=0,
@@ -1300,6 +1384,14 @@ def get_generate_ids(generate_ids: Tensor, input_token_len: int) -> List[int]:
         return generate_ids[0].tolist()
 
 
+class Internvl2Template(InternvlTemplate):
+
+    def __init__(self):
+        self.system = '你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。'
+        Template.__init__(self, [], ['<|im_start|>user\n{{QUERY}}<|im_end|><|im_start|>assistant\n'], ['<|im_end|>'],
+                          ['<|im_end|>'], self.system, ['<|im_start|>system\n{{SYSTEM}}<|im_end|>'])
+
+
 class InternvlPhi3Template(InternvlTemplate):
     system = 'You are an AI assistant whose name is Phi-3.'
 
@@ -1326,6 +1418,15 @@ def __init__(self):
     dataloader_num_workers=0,
     dataloader_pin_memory=False)
 
+register_template(
+    TemplateType.internvl2,
+    Internvl2Template(),
+    use_model=True,
+    lazy_tokenize=True,
+    infer_media_type='dialogue',
+    dataloader_num_workers=0,
+    dataloader_pin_memory=False)
+
 
 class FlorenceTemplate(Template):
 
@@ -1502,6 +1603,60 @@ def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any
         return inputs, {}
 
 
+class LlavaVideoTemplate(Template):
+
+    def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index, example) -> List[Context]:
+        assert media_type == 'video'
+        media_file = example['videos'][index]
+        if media_file.rsplit('.', 1)[-1] in {'jpg', 'png'}:
+            return ['<image>\n']
+        else:
+            return ['<video>\n']
+
+    def encode(self, example: Dict[str, Any]) -> Tuple[Dict[str, Any], Dict[str, Any]]:
+        inputs, _ = super().encode(example)
+        if len(inputs) == 0:
+            return inputs, {}
+        media_files = example.get('videos') or []
+        images_path, videos_path = [], []
+        for media_file in media_files:
+            if media_file is None:
+                continue
+            if media_file.rsplit('.', 1)[1] in {'jpg', 'png'}:
+                images_path.append(media_file)
+            else:
+                videos_path.append(media_file)
+        if len(videos_path) > 0:
+            videos = _read_batch(videos_path, _load_video)
+            video_processor = self.tokenizer.processor.video_processor
+            video_inputs = video_processor(videos, return_tensors='pt').to(self.model.dtype)
+            inputs['pixel_values_videos'] = video_inputs['pixel_values_videos']
+        if len(images_path) > 0:
+            images = _read_batch(images_path)
+            image_processor = self.tokenizer.processor.image_processor
+            image_inputs = image_processor(images, return_tensors='pt').to(self.model.dtype)
+            inputs['pixel_values'] = image_inputs['pixel_values']
+            inputs['image_sizes'] = image_inputs['image_sizes']
+        return inputs, {}
+
+
+register_template(
+    TemplateType.llava_next_video,
+    LlavaVideoTemplate(['<s>{{SYSTEM}} '], ['USER: {{QUERY}} ASSISTANT:'], [' '], ['</s>']),
+    use_model=True,
+    infer_media_type='round',
+    media_type='video',
+    lazy_tokenize=True)
+
+register_template(
+    TemplateType.llava_next_video_yi,
+    LlavaVideoTemplate(['{{SYSTEM}} '], ['USER: {{QUERY}} ASSISTANT:'], [' '], ['<|im_end|>']),
+    use_model=True,
+    infer_media_type='round',
+    media_type='video',
+    lazy_tokenize=True)
+
+
 class Llava1_5Template(LlavaHfTemplate):
 
     def __init__(self):
diff --git a/swift/llm/utils/utils.py b/swift/llm/utils/utils.py
index 78f87f152..46d0dc556 100644
--- a/swift/llm/utils/utils.py
+++ b/swift/llm/utils/utils.py
@@ -569,6 +569,8 @@ def _prepare_inputs(model: PreTrainedModel,
         'history': history,
         'system': system,
         'images': images or [],  # for vl. str.
+        'audios': kwargs.pop('audios', None) or [],
+        'videos': kwargs.pop('videos', None) or [],
         'tools': kwargs.pop('tools', None),
         'objects': kwargs.pop('objects', None),
     }
diff --git a/swift/tuners/utils.py b/swift/tuners/utils.py
index 0ec600416..8b1752253 100644
--- a/swift/tuners/utils.py
+++ b/swift/tuners/utils.py
@@ -14,12 +14,12 @@
 import numpy as np
 import torch
 from modelscope import snapshot_download
+from modelscope.hub.utils.utils import get_cache_dir
 from packaging import version
 from peft.utils import CONFIG_NAME
 from peft.utils import ModulesToSaveWrapper as _ModulesToSaveWrapper
 from peft.utils import _get_submodules
 
-from swift.hub.utils.utils import get_cache_dir
 from swift.tuners.module_mapping import ModelKeys
 from swift.utils.constants import BIN_EXTENSIONS
 from swift.utils.logger import get_logger
diff --git a/swift/ui/app.py b/swift/ui/app.py
index 7281a5545..a1df651ce 100644
--- a/swift/ui/app.py
+++ b/swift/ui/app.py
@@ -2,6 +2,7 @@
 
 import gradio as gr
 from packaging import version
+from transformers.utils import strtobool
 
 from swift.ui.base import all_langs
 from swift.ui.llm_eval.llm_eval import LLMEval
@@ -74,4 +75,4 @@ def run_ui():
         inbrowser=True,
         server_port=port if port is None else int(port),
         height=800,
-        share=bool(int(os.environ.get('WEBUI_SHARE', '0'))))
+        share=strtobool(os.environ.get('WEBUI_SHARE', '0')))
diff --git a/swift/ui/llm_eval/eval.py b/swift/ui/llm_eval/eval.py
index 2fdab173f..41272da5f 100644
--- a/swift/ui/llm_eval/eval.py
+++ b/swift/ui/llm_eval/eval.py
@@ -115,17 +115,22 @@ class Eval(BaseUI):
     def do_build_ui(cls, base_tab: Type['BaseUI']):
         try:
             from llmuses.backend.opencompass import OpenCompassBackendManager
-        except ImportError as e:
-            logger.error('You are using web-ui, please '
-                         'install requirements by `pip install llmuses ms-opencompass -U`')
-            raise e
+            eval_dataset_list = OpenCompassBackendManager.list_datasets()
+        except ImportError:
+            eval_dataset_list = [
+                'AX_b', 'winogrande', 'mmlu', 'afqmc', 'COPA', 'commonsenseqa', 'CMRC', 'lcsts', 'nq', 'ocnli_fc',
+                'math', 'mbpp', 'DRCD', 'TheoremQA', 'CB', 'ReCoRD', 'lambada', 'tnews', 'flores', 'humaneval', 'AX_g',
+                'ceval', 'bbh', 'BoolQ', 'MultiRC', 'piqa', 'csl', 'ARC_c', 'agieval', 'cmnli', 'strategyqa', 'gsm8k',
+                'summedits', 'eprstmt', 'WiC', 'cluewsc', 'Xsum', 'ocnli', 'triviaqa', 'hellaswag', 'race', 'bustm',
+                'RTE', 'C3', 'GaokaoBench', 'storycloze', 'ARC_e', 'siqa', 'obqa', 'WSC', 'chid'
+            ]
 
         with gr.Row():
             gr.Textbox(elem_id='name', scale=20)
             gr.Dropdown(
                 elem_id='eval_dataset',
                 is_list=True,
-                choices=OpenCompassBackendManager.list_datasets(),
+                choices=eval_dataset_list,
                 multiselect=True,
                 allow_custom_value=True,
                 scale=20)
diff --git a/swift/ui/llm_train/hyper.py b/swift/ui/llm_train/hyper.py
index a910ddf9f..e86ad9815 100644
--- a/swift/ui/llm_train/hyper.py
+++ b/swift/ui/llm_train/hyper.py
@@ -12,8 +12,8 @@ class Hyper(BaseUI):
     locale_dict = {
         'hyper_param': {
             'label': {
-                'zh': '超参数设置,更多参数在高级参数设置',
-                'en': 'Hyper settings, more params in Advanced settings',
+                'zh': '超参数设置(更多参数->高级参数设置)',
+                'en': 'Hyper settings(more params->Advanced settings)',
             },
         },
         'batch_size': {
diff --git a/swift/ui/llm_train/runtime.py b/swift/ui/llm_train/runtime.py
index 80c0120d4..6e5a7bf6a 100644
--- a/swift/ui/llm_train/runtime.py
+++ b/swift/ui/llm_train/runtime.py
@@ -302,6 +302,14 @@ def update_log(cls, task):
             ret.append(gr.update(visible=True, label=p['name']))
         return ret
 
+    @classmethod
+    def get_initial(cls, line):
+        tqdm_starts = ['Train:', 'Map:', 'Val:', 'Filter:']
+        for start in tqdm_starts:
+            if line.startswith(start):
+                return start
+        return None
+
     @classmethod
     def wait(cls, logging_dir, task):
         if not logging_dir:
@@ -334,6 +342,15 @@ def wait(cls, logging_dir, task):
                     else:
                         latest_data = ''
                     lines.extend(latest_lines)
+                    start = cls.get_initial(lines[-1])
+                    if start:
+                        i = len(lines) - 2
+                        while i >= 0:
+                            if lines[i].startswith(start):
+                                del lines[i]
+                                i -= 1
+                            else:
+                                break
                     yield ['\n'.join(lines)] + Runtime.plot(task)
         except IOError:
             pass
diff --git a/swift/utils/__init__.py b/swift/utils/__init__.py
index 2c8d6e1b7..8622b4a8c 100644
--- a/swift/utils/__init__.py
+++ b/swift/utils/__init__.py
@@ -10,6 +10,6 @@
 from .torch_utils import (activate_model_parameters, broadcast_string, freeze_model_parameters, get_dist_setting,
                           get_model_info, is_ddp_plus_mp, is_dist, is_local_master, is_master, is_mp, is_on_same_device,
                           show_layers, time_synchronize, torchacc_trim_graph, use_torchacc)
-from .utils import (add_version_to_work_dir, check_json_format, get_pai_tensorboard_dir, is_pai_training_job,
-                    lower_bound, parse_args, read_multi_line, safe_ddp_context, seed_everything, subprocess_run,
-                    test_time, upper_bound)
+from .utils import (FileLockContext, add_version_to_work_dir, check_json_format, get_pai_tensorboard_dir,
+                    is_pai_training_job, lower_bound, parse_args, read_multi_line, safe_ddp_context, seed_everything,
+                    subprocess_run, test_time, upper_bound)
diff --git a/swift/utils/utils.py b/swift/utils/utils.py
index 258a82843..da2d261c2 100644
--- a/swift/utils/utils.py
+++ b/swift/utils/utils.py
@@ -1,5 +1,7 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import datetime as dt
+import fcntl
+import hashlib
 import os
 import random
 import re
@@ -11,6 +13,7 @@
 
 import numpy as np
 import torch.distributed as dist
+from modelscope.hub.utils.utils import get_cache_dir
 from transformers import HfArgumentParser, enable_full_determinism, set_seed
 
 from .logger import get_logger
@@ -20,6 +23,51 @@
 logger = get_logger()
 
 
+class FileLockContext:
+
+    cache_dir = os.path.join(get_cache_dir(), 'lockers')
+
+    def __init__(self, origin_symbol: str, timeout: int = 60 * 30):
+        self.origin_symbol = origin_symbol
+        self.file_path = hashlib.md5(origin_symbol.encode('utf-8')).hexdigest() + '.lock'
+        self.file_path = os.path.join(FileLockContext.cache_dir, self.file_path)
+        self.file_handle = None
+        self.timeout = timeout
+
+    def acquire(self):
+        """Acquire the lock, optionally waiting until it is available."""
+        start_time = time.time()
+        while True:
+            try:
+                os.makedirs(FileLockContext.cache_dir, exist_ok=True)
+                open(self.file_path, 'a').close()
+                self.file_handle = open(self.file_path, 'w')
+                fcntl.flock(self.file_handle, fcntl.LOCK_EX)
+                return True
+            except IOError as e:
+                if self.file_handle:
+                    self.file_handle.close()
+                    self.file_handle = None
+                if self.timeout and (time.time() - start_time) >= self.timeout:
+                    raise IOError(f'Cannot acquire the file lock from {self.origin_symbol} '
+                                  f'as the timeout reaches: {self.timeout} seconds') from e
+                time.sleep(1)
+
+    def release(self):
+        """Release the lock."""
+        if self.file_handle:
+            fcntl.flock(self.file_handle, fcntl.LOCK_UN)
+            self.file_handle.close()
+            self.file_handle = None
+
+    def __enter__(self):
+        self.acquire()
+        return self
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        self.release()
+
+
 @contextmanager
 def safe_ddp_context():
     if is_dist() and not is_local_master():
diff --git a/swift/version.py b/swift/version.py
index 39794319c..c13676a03 100644
--- a/swift/version.py
+++ b/swift/version.py
@@ -1,5 +1,5 @@
 # Make sure to modify __release_datetime__ to release time when making official release.
-__version__ = '2.2.0'
+__version__ = '2.2.1'
 # default release datetime for branches under active development is set
 # to be a time far-far-away-into-the-future
 __release_datetime__ = '2099-10-13 08:56:12'