Skip to content

Commit

Permalink
add byte pair encoder
Browse files Browse the repository at this point in the history
  • Loading branch information
ddbourgin committed Jan 8, 2022
1 parent d4b8e0b commit 065f9b8
Show file tree
Hide file tree
Showing 5 changed files with 473 additions and 351 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ For more details on the available models, see the [project documentation](https:
## Available models
<details>
<summary>Click to expand!</summary>

1. **Gaussian mixture model**
- EM training

Expand Down Expand Up @@ -168,6 +168,7 @@ For more details on the available models, see the [project documentation](https:
- Feature standardization
- One-hot encoding / decoding
- Huffman coding / decoding
- Byte pair encoding / decoding
- Term frequency-inverse document frequency (TF-IDF) encoding
- MFCC encoding

Expand Down
29 changes: 26 additions & 3 deletions docs/numpy_ml.preprocessing.nlp.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
Natural language processing
###########################

``BytePairEncoder``
-------------------

.. autoclass:: numpy_ml.preprocessing.nlp.BytePairEncoder
:members:
:undoc-members:
:inherited-members:

``HuffmanEncoder``
------------------

Expand Down Expand Up @@ -48,12 +56,27 @@ Natural language processing

.. autofunction:: numpy_ml.preprocessing.nlp.strip_punctuation

``tokenize_words``
-------------------

.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_words

``tokenize_whitespace``
------------------------

.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_whitespace

``tokenize_chars``
-------------------

.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_chars

``tokenize_words``
-------------------
``tokenize_bytes_raw``
-----------------------

.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_words
.. autofunction:: numpy_ml.preprocessing.nlp.tokenize_bytes_raw

``bytes_to_chars``
-----------------------

.. autofunction:: numpy_ml.preprocessing.nlp.bytes_to_chars
1 change: 1 addition & 0 deletions numpy_ml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ This repo includes code for the following models:
- Feature standardization
- One-hot encoding / decoding
- Huffman coding / decoding
- Byte pair encoding / decoding
- Term frequency-inverse document frequency (TF-IDF) encoding
- MFCC encoding

Expand Down
1 change: 1 addition & 0 deletions numpy_ml/preprocessing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ The preprocessing module implements common data preprocessing routines.
- Word and character tokenization
- Punctuation and stop-word removal
- Vocabulary / unigram count objects
- Byte-pair encoding ([Gage, 1994](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM); [Sennrich, Haddow, & Birch, 2015](https://arxiv.org/pdf/1508.07909.pdf))
- [Huffman tree](https://en.wikipedia.org/wiki/Huffman_coding) encoding / decoding
- Term frequency-inverse document frequency ([tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)) encoding

Expand Down
Loading

0 comments on commit 065f9b8

Please sign in to comment.