2
Is there anything that makes training a translation task easy?
(lemmy.dbzer0.com)
Welcome to Free Open-Source Artificial Intelligence!
We are a community dedicated to forwarding the availability and access to:
Free Open Source Artificial Intelligence (F.O.S.A.I.)
If you’re working with a well known language, then you can probably use NLTK to tokenize your words. Word2vec is also helpful if you want a word embedding approach. https://github.com/nltk/nltk
Thanks for the tips. After doing a bunch of searching, I found that what I needed was BPE, or byte-pair encoding. This allows the token set to contain sub-word sequences, which lets the tokenizer represent a unique constant like
0x0373
as['__sow', '0x', '03', '73', '__eow']
.