add byte pair encoder

tojo-soraai · Jan 8, 2022 · 065f9b8 · 065f9b8
1 parent d4b8e0b
commit 065f9b8
Show file tree

Hide file tree

Showing 5 changed files with 473 additions and 351 deletions.
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ For more details on the available models, see the [project documentation](https:
 ## Available models
 <details>
   <summary>Click to expand!</summary>
-  
+
 1. **Gaussian mixture model**
     - EM training
 
@@ -168,6 +168,7 @@ For more details on the available models, see the [project documentation](https:
     - Feature standardization
     - One-hot encoding / decoding
     - Huffman coding / decoding
+    - Byte pair encoding / decoding
     - Term frequency-inverse document frequency (TF-IDF) encoding
     - MFCC encoding
 

diff --git a/docs/numpy_ml.preprocessing.nlp.rst b/docs/numpy_ml.preprocessing.nlp.rst
@@ -1,6 +1,14 @@
 Natural language processing
 ###########################
 
+``BytePairEncoder``
+-------------------
+
+.. autoclass:: numpy_ml.preprocessing.nlp.BytePairEncoder
+	:members:
+	:undoc-members:
+	:inherited-members:
+
 ``HuffmanEncoder``
 ------------------
 
@@ -48,12 +56,27 @@ Natural language processing
 
 .. autofunction:: numpy_ml.preprocessing.nlp.strip_punctuation
 
+``tokenize_words``
+-------------------
+
+.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_words
+
+``tokenize_whitespace``
+------------------------
+
+.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_whitespace
+
 ``tokenize_chars``
 -------------------
 
 .. autofunction:: numpy_ml.preprocessing.nlp.tokenize_chars
 
-``tokenize_words``
--------------------
+``tokenize_bytes_raw``
+-----------------------
 
-.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_words
+.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_bytes_raw
+
+``bytes_to_chars``
+-----------------------
+
+.. autofunction:: numpy_ml.preprocessing.nlp.bytes_to_chars
diff --git a/numpy_ml/README.md b/numpy_ml/README.md
@@ -140,6 +140,7 @@ This repo includes code for the following models:
     - Feature standardization
     - One-hot encoding / decoding
     - Huffman coding / decoding
+    - Byte pair encoding / decoding
     - Term frequency-inverse document frequency (TF-IDF) encoding
     - MFCC encoding
 

diff --git a/numpy_ml/preprocessing/README.md b/numpy_ml/preprocessing/README.md
@@ -6,6 +6,7 @@ The preprocessing module implements common data preprocessing routines.
     - Word and character tokenization
     - Punctuation and stop-word removal
     - Vocabulary / unigram count objects
+    - Byte-pair encoding ([Gage, 1994](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM); [Sennrich, Haddow, & Birch, 2015](https://arxiv.org/pdf/1508.07909.pdf))
     - [Huffman tree](https://en.wikipedia.org/wiki/Huffman_coding) encoding / decoding
     - Term frequency-inverse document frequency ([tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) encoding