Skip to content

yishn/chinese-tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chinese-tokenizer Build Status

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT. You can try it out at the demo page. The code for the demo page can be found in the gh-pages branch of this repository.

How this works

This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.

Installation

Use npm to install:

npm install chinese-tokenizer --save

Usage

Make sure to provide the CC-CEDICT data.

const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

Output:

[
  {
    "text": "我",
    "traditional": "我",
    "simplified": "我",
    "position": { "offset": 0, "line": 1, "column": 1 },
    "matches": [
      {
        "pinyin": "wo3",
        "pinyinPretty": "wǒ",
        "english": "I/me/my"
      }
    ]
  },
  {
    "text": "是",
    "traditional": "是",
    "simplified": "是",
    "position": { "offset": 1, "line": 1, "column": 2 },
    "matches": [
      {
        "pinyin": "shi4",
        "pinyinPretty": "shì",
        "english": "is/are/am/yes/to be"
      }
    ]
  },
  {
    "text": "中國人",
    "traditional": "中國人",
    "simplified": "中国人",
    "position": { "offset": 2, "line": 1, "column": 3 },
    "matches": [
      {
        "pinyin": "Zhong1 guo2 ren2",
        "pinyinPretty": "Zhōng guó rén",
        "english": "Chinese person"
      }
    ]
  },
  {
    "text": "。",
    "traditional": "。",
    "simplified": "。",
    "position": { "offset": 5, "line": 1, "column": 6 },
    "matches": []
  }
]

API

chineseTokenizer.loadFile(path)

Reads the CC-CEDICT file from given path and returns a tokenize function based on the dictionary.

chineseTokenizer.load(content)

Parses CC-CEDICT string content from content and returns a tokenize function based on the dictionary.

tokenize(text)

Tokenizes the given text string and returns an array with tokens of the following form:

{
  "text": <string>,
  "traditional": <string>,
  "simplified": <string>,
  "position": { "offset": <number>, "line": <number>, "column": <number> },
  "matches": [
    {
      "pinyin": <string>,
      "pinyinPretty": <string>,
      "english": <string>
    },
    ...
  ]
}