outline of research
three premises
i build my proposal on three interrelated premises:
1. korean space-tokenization is not effective
english can be space-split rather effectively:
{Eight|out|of|ten|South|Korean|trading|companies|intend|to|take|part|in|projects|with|North|Korea}
but in korean, due to josa and agglutination, it is less simple.
{검역본부는|전날|수입|컨테이너를|점검하다}
vs
{검역|본부|는|전날|수입|컨테이너|를|점검|하다}
if we split on spaces, the number of resulting tokens, which i define as 어절 eojeol, can explode rapidly - “infinite vocabulary”.
also, we lose connection between free morphemes of the 어절:
사랑 != 사랑은 != 사랑을 != 사랑한다 != 사랑해요
etc. etc.
and practically, we need to use a trained sequence model to do morphological analysis, rather than a simple approach like space-tokenization. this model needs to be trained (either by us or by creators), and is specific to one language (and arguably one domain).
2. english sub-word approaches to NER don’t translate to korean well
one common technique for english is using character-CNN for each word to create an intra-word feature vector, concatenate with word-level embedding vector, and then assign a word-level tag. (see Huang et al. 2015; Ma & Hovy 2016).
however, this approach relies on having a word (or morpheme)-level segmentation - see premise #1.
if we try to use 어절-level, we get what Misawa et al. term boundary clash, where one token contains more than one entity label.
트럼프가:
트럼프 + 가
PERSON NONE <- problem
so assuming we have a good morpheme-level tokenization without boundary clash, we still have a problem: the character-CNN approach is not as effective, due to relatively short length of morphemes in korean, japanese (~2 ‘syllables’ or 음절, which i call ‘characters’).
another option would be character-only input, but is that the best?
3. pure-character models are not as good as augmented models
just like pure word-embedding models do worse than models with word embeddings + other features (see Huang et al. 2015, Ma & Hovy 2016, previous papers for discrete engineered features), Misawa et al. 2017 show that supplementing character embeddings increases performance.
but better than CNN approach is just adding the embedding of the morpheme that the character comes from. so they use character embeddings plus embeddings from the morpheme that contains that character. and they assign tags to each character (korean: 음절). so input looks like:
char 안 녕 하 세 요
+ + + + +
morph 안녕 안녕 하세 하세 요
but this also means we need good morpheme analyzer.
…or does it?
proposition
i propose adopting a data-driven tokenizer, in particularly, google’s sentencepiece, as a substitute. i refer to the outputs of this tokenizer as wordpieces, also called wordparts
data-driven tokenization is popular in speech recognition models, where tools like morfessor find data-driven morphemes to do language modeling on. but outside of google’s various systems, notably google NMT, i haven’t seen the technique applied to text much.
benefits
- data-driven! = unsupervised, easy
- greedy algorithm = fast and more consistent
- language-agnostic = can tokenize many languages at once
- fixed vocabulary = no expanding/infinite vocab
the final point is potentially the most critical. if we can create a fixed-vocabulary tokenization scheme, then we won’t have to deal with out-of-vocabulary elements (in reality, we may have a few but in all practicality it is removed). so a fixed-size model can be trained on any data.
furthermore, if this can be used to process multiple languages at one time then we reduce the amount of components in multi-language system.
oh and unlike morfessor
, google sentencpiece
respects whitespace. so transformation and inverse-transformation is equal.
questions
- does it have the same benefits over character-model as morpheme?
- can it be used in multi-language scenario (like google nmt)?
experiments
ps: these examples come from CC open corpus
so i compare four input formats:
characters : this is the korean 음절s only
['2', '0', '0', '3', '년', ' ', '6', '월', ' ', '1', '4', '일', ' ', '사', '직', ' ', '두', '산', '전', ' ', '이', '후', ' ', '박', '명', '환', '에', '게', ' ', '당', '했', '던', ' ', '1', '0', '연', '패', ' ', '사', '슬', '을', ' ', '거', '의', ' ', '5', '년', ' ', '만', '에', ' ', '끊', '는', ' ', '의', '미', '있', '는', ' ', '승', '리', '였', '다', ' ']
then i test characters plus secondary embedding:
mecab/juman morphemes : these are tokens generated by morpheme analyzers for korean and japanese, respectively:
['2003년', '6월', '14일', '사직', '두산전', '이후', '박명환에게', '당했던', '10연패', '사슬을', '거의', '5년', '만에', '끊는', '의미있는', '승리였다', '.']
sentencepiece
wordpieces : data-driven wordpiece tokens. while some align with linguistic morpheme boundaries, some don’t. so we cannot label these at wordpiece-level, because boundary clash.
['▁2003', '년', '▁6', '월', '▁14', '일', '▁사직', '▁두산전', '▁이후', '▁박', '명', '환', '에게', '▁당', '했던', '▁10', '연패', '▁사', '슬', '을', '▁거의', '▁5', '년', '▁만에', '▁끊', '는', '▁의미', '있는', '▁승리', '였다', '▁', '.']
truncated-vocabulary mecab/juman morphemes : this is about network size. because the number of morphemes » number of wordpieces, for a more equitable comparison, i truncated the morpheme vocabulary down to the size of the wordpiece vocabulary. low-frequency words were replaced with a generic out-of-vocabulary UNK token.
note: this is simulated output
for example, specific dates like 14일 and tokens containing names like 박명환 are probably low-frequency and so are removed.
['2003년', '6월', 'UNK', '사직', '두산전', '이후', 'UNK', '당했던', '10연패', '사슬을', '거의', '5년', '만에', '끊는', '의미있는', '승리였다', '.']
results
for korean, the proposed wordpiece model matches the performance of the mecab morpheme model.
for japanese, the proposed wordpiece model performs worse than juman-morpheme, but still outperforms character-only. however all results are poor due to the size of dataset vs. vocabulary elements.
multilingual test
for multilingual, only character vs character & wordpiece is tested.
because, if we want to use mecab-ko or juman, we need a language-detection method first. and we need both systems installed & running. so it’s an artificial situation.
so in this case, we use a single sentencepiece model trained on both korean and japanese. and we train and evaluate on both.
this model also outperforms the baseline character model. but if we analyze the results from each language, we also note that the results are the same.
conclusion
this approach does not remove tokenization, but proposes a way to create a fixed-size token inventory, that provides a number of benefits (no OOV, embeddings trained more evenly during training because less sparsity)
bonus: it can be realistically applied to situation with multiple languages without suffering performance penalty, and without and language classification, or language-dependent modules.
discussion
while my model outperforms recent papers, i contribute some of this gain to be from my data size. oppositely, i believe the poor japanese performance is due to low dataset size.
i believe there should be free, open, easily accessible, and sufficiently large datasets in korean and japanese that can serve as a baseline test for inter-study comparison much like the CoNLL 2002 and CoNLL 2003 datasets are used for english.
future research may try applying a similar approach to languages with shared alphabets.