Simple-to-use scoring function for arbitrarily tokenized texts.
-
Updated
Sep 12, 2024 - Python
Simple-to-use scoring function for arbitrarily tokenized texts.
A causal intervention framework to learn robust and interpretable character representations inside subword-based language models
The concept of DAWGs is based on: Blumer, A. et al. (1985). The smallest automation recognizing the subwords of a text. Theoretical Computer Science, 40, 31–55.
A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
johnny - a neural network graph based DEPendency Parser
This repository contains source code implementation of assignments for NTU's MSAI course AI6127 on Deep Neural Networks for Natural Language Processing (2019 Sem 2).
Korean text normalization and language preparation package for LM in Kaldi-based ASR system
An implementation of subword division algorithm proposed in T. Mikolov (2012).
Keyword Search Recipe for Subword ASR
Unsupervised Word Segmentation using Minimum Description Length for Neural Machine Translation (NMT)
Subword-augmented Embedding for Cloze Reading Comprehension (COLING 2018)
Effective Subword Segmentation for Text Comprehension (TASLP 2019)
Subword Neural Machine Translation
Add a description, image, and links to the subword topic page so that developers can more easily learn about it.
To associate your repository with the subword topic, visit your repo's landing page and select "manage topics."