Many use cases of natural language processing (NLP) like text classification or sentiment analysis can be successfully implemented with today’s machine learning technology. In numerous NLP tutorials, the components of processing pipelines usually have a kind of batteries-included character and don’t show many options or customizations. While this simplicity certainly makes it easier to get started, this approach is almost always limited to processing English-language content. Because this is the default, you don’t need lots of options. But what if your data is of a different language? I focused on using German, but the methods shown here should work for other corpora or language models as well.
There is a growing list of tools that are ready to be used with non-English texts. The most obvious example amongst them is spaCy, a library from Explosion AI, who are based in Berlin. It is often used as a preprocessing tool for tokenization and maybe to create POS or dependency tags. Since version 2, spaCy ships with at least basic language models for eight languages, including German. Their usage is explained in the spaCy docs and quite simple. But if used directly from torchtext, the integration can be a bit tricky.
Initializing specific spaCy models from torchtext
You may have seen something like TEXT = data.Field(tokenize='spacy')
. You can
use a string here to indicate you want spaCy (or moses from NLTK or revtok) as
tokenizer, but you can’t provide any spaCy-specific options here. To get more
control, you have to provide a function.
Since torchtext 0.3.1, there seems to be a
new keyword tokenizer_language
to address this type of problem. But there are other reasons to dig deeper here.
In the example above, spaCy only does tokenization. Even this is specific to
German, but still a kind of basic use case. With the German language model of
spaCy, you could also access lemmatization easily with a function like
if you wanted to train your model on lemmas only.
Loading of custom word vectors in torchtext
When dealing with pre-trained word vectors for token representation in your
neuronal network, a similar situation presents itself. There is a keyword-based way to
load vectors like TEXT.build_vocab(trn_ds, vectors='glove600B')
. This is
simple for those pre-trained vectors that torchtext is able to handle. In case
you want to use some custom vectors or simply publicly available data not known
by torchtext (like the
German word vectors from Spinning Bytes),
the custom vector initialization looks like this:
By the way, since the keyword-based method actually fetches the word vector data from a remote server if not already cached, you may want to completely avoid this on production systems, even with GloVe vectors. However, I honestly don’t see many advantages in passing string constants around for essential programm functionality anyway.
Whenever you’re loading word vectors, don’t forget to copy those vectors into the embedding layer of your model later.
Further information
I assembled a Gist with a working basic script.
Especially for processing German content, the links collected by GermEval Shared Task 2018 are really comprehensive and helpful. Thanks for that.
If you are interested in spaCy and the language models, keep an eye on version 2.1 that will be shipping with even more and larger models. There is a nighty build of 2.1a available for download to play with. With respect to tokenization and lemmatization, I did see differences compared to 2.0. However, there was no clear indication for a huge improvement in my simple trials. I was not able to get the new models of 2.1a up and running with the code of 2.0, that worked only in combination with the nightly build. Please keep in mind that current language models in spaCy were trained with texts from Wikipedia and are explicitly considered less than a perfect fit for social media conversations.
Versions used
- spaCy 2.0.16
- pytorch 0.4.1
- torchtext 0.3.1