Is there a sensible notion of 'character embeddings'?

There are several popular word embeddings available (e.g., Fasttext and GloVe); In short, those embeddings are a tool to encode words along with a sensible notion of semantics attached to those words (i.e. words with similar sematics are nearly parallel).

Question:

Is there a similar notion of character embedding?

By 'character embedding' I understand an algorithm that allow us to encode characters in order to capture some syntactic similarity (i.e. similarity of character shapes or contexts).

Topic embeddings word-embeddings nlp

Category Data Science


Yes, absolutely.

First it's important to understand that word embeddings accurately represent the semantics of the word because they are trained on the context of the word, i.e the words close to the target word. This is just another application of the old principle of distributional semantics.

Characters embeddings are usually trained the same way, which means that the embedding vectors also represent the "usual neighbours" of a character. This can have various applications in string similarity, word tokenization, stylometry (representing an author's writing style), and probably more. For example, in languages with accentuated characters the embedding for é would be closely similar to the one for e; m and n would be closer than x and f .

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.