diff --git "a/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" "b/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" index e8ab9de00..b82839ff3 100644 --- "a/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" +++ "b/docs/source/LLM/\346\224\257\346\214\201\347\232\204\346\250\241\345\236\213\345\222\214\346\225\260\346\215\256\351\233\206.md" @@ -484,6 +484,7 @@ |chart-qa|[swift/ChartQA](https://modelscope.cn/datasets/swift/ChartQA/summary)||28299|43.1±5.5, min=29, max=77|en, vqa, quality|[HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA)| |chinese-c4|[None](https://modelscope.cn/datasets/None/summary)||-|Dataset is too huge, please click the original link to view the dataset stat.|pretrain, zh, quality|[shjwudp/chinese-c4](https://huggingface.co/datasets/shjwudp/chinese-c4)| |cinepile|[swift/cinepile](https://modelscope.cn/datasets/swift/cinepile/summary)||-|Dataset is too huge, please click the original link to view the dataset stat.|vqa, en, youtube, video|[tomg-group-umd/cinepile](https://huggingface.co/datasets/tomg-group-umd/cinepile)| +|classical-chinese-translate|[swift/classical_chinese_translate](https://modelscope.cn/datasets/swift/classical_chinese_translate/summary)||6655|344.0±76.4, min=61, max=815|chat, play-ground|-| |codealpaca-20k|[AI-ModelScope/CodeAlpaca-20k](https://modelscope.cn/datasets/AI-ModelScope/CodeAlpaca-20k/summary)||20016|100.2±60.1, min=29, max=1776|code, en|[HuggingFaceH4/CodeAlpaca_20K](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K)| |cosmopedia|[None](https://modelscope.cn/datasets/None/summary)|auto_math_text
khanacademy
openstax
stanford
stories
web_samples_v1
web_samples_v2
wikihow|-|Dataset is too huge, please click the original link to view the dataset stat.|multi-domain, en, qa|[HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)| |cosmopedia-100k|[swift/cosmopedia-100k](https://modelscope.cn/datasets/swift/cosmopedia-100k/summary)||100000|1024.5±243.1, min=239, max=2981|multi-domain, en, qa|[HuggingFaceTB/cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k)| diff --git a/docs/source_en/LLM/Supported-models-datasets.md b/docs/source_en/LLM/Supported-models-datasets.md index d2e6f57f6..fe1d71105 100644 --- a/docs/source_en/LLM/Supported-models-datasets.md +++ b/docs/source_en/LLM/Supported-models-datasets.md @@ -484,6 +484,7 @@ The table below introduces the datasets supported by SWIFT: |chart-qa|[swift/ChartQA](https://modelscope.cn/datasets/swift/ChartQA/summary)||28299|43.1±5.5, min=29, max=77|en, vqa, quality|[HuggingFaceM4/ChartQA](https://huggingface.co/datasets/HuggingFaceM4/ChartQA)| |chinese-c4|[None](https://modelscope.cn/datasets/None/summary)||-|Dataset is too huge, please click the original link to view the dataset stat.|pretrain, zh, quality|[shjwudp/chinese-c4](https://huggingface.co/datasets/shjwudp/chinese-c4)| |cinepile|[swift/cinepile](https://modelscope.cn/datasets/swift/cinepile/summary)||-|Dataset is too huge, please click the original link to view the dataset stat.|vqa, en, youtube, video|[tomg-group-umd/cinepile](https://huggingface.co/datasets/tomg-group-umd/cinepile)| +|classical-chinese-translate|[swift/classical_chinese_translate](https://modelscope.cn/datasets/swift/classical_chinese_translate/summary)||6655|344.0±76.4, min=61, max=815|chat, play-ground|-| |codealpaca-20k|[AI-ModelScope/CodeAlpaca-20k](https://modelscope.cn/datasets/AI-ModelScope/CodeAlpaca-20k/summary)||20016|100.2±60.1, min=29, max=1776|code, en|[HuggingFaceH4/CodeAlpaca_20K](https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca_20K)| |cosmopedia|[None](https://modelscope.cn/datasets/None/summary)|auto_math_text
khanacademy
openstax
stanford
stories
web_samples_v1
web_samples_v2
wikihow|-|Dataset is too huge, please click the original link to view the dataset stat.|multi-domain, en, qa|[HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)| |cosmopedia-100k|[swift/cosmopedia-100k](https://modelscope.cn/datasets/swift/cosmopedia-100k/summary)||100000|1024.5±243.1, min=239, max=2981|multi-domain, en, qa|[HuggingFaceTB/cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k)| diff --git a/swift/llm/data/dataset_info.json b/swift/llm/data/dataset_info.json index df33f3f52..6375824c5 100644 --- a/swift/llm/data/dataset_info.json +++ b/swift/llm/data/dataset_info.json @@ -149,6 +149,18 @@ "tags": ["vqa", "en", "youtube", "video"], "huge_dataset": true }, + "classical-chinese-translate": { + "dataset_id": "swift/classical_chinese_translate", + "conversations": { + "user_role": "user", + "assistant_role": "assistant", + "conversations_key": "conversations", + "from_key": "from", + "value_key": "value", + "error_strategy": "delete" + }, + "tags": ["chat", "play-ground"] + }, "tagengo-gpt4": { "dataset_id": "swift/tagengo-gpt4", "hf_dataset_id": "lightblue/tagengo-gpt4", diff --git a/swift/llm/utils/preprocess.py b/swift/llm/utils/preprocess.py index 21a8582d4..a0df4b4c9 100644 --- a/swift/llm/utils/preprocess.py +++ b/swift/llm/utils/preprocess.py @@ -84,8 +84,8 @@ def parse_medias(self, d: Dict[str, Any]): @property def empty_row(self): empty_row = { - 'query': '', - 'response': '', + 'query': None, + 'response': None, 'tools': None, 'system': None, 'history': None,