Is there a sensible notion of 'character embeddings'?

Question

Is there a sensible notion of 'character embeddings'?

Ramiro Hum-Sah

2022年5月16日 09:34

There are several popular word embeddings available (e.g., Fasttext and GloVe); In short, those embeddings are a tool to encode words along with a sensible notion of semantics attached to those words (i.e. words with similar sematics are nearly parallel).

Question:

Is there a similar notion of character embedding?

By 'character embedding' I understand an algorithm that allow us to encode characters in order to capture some syntactic similarity (i.e. similarity of character shapes or contexts).

Topic embeddings word-embeddings nlp

Category Data Science

Erwan · Accepted Answer · 2022年5月16日 09:34

Yes, absolutely.

First it's important to understand that word embeddings accurately represent the semantics of the word because they are trained on the context of the word, i.e the words close to the target word. This is just another application of the old principle of distributional semantics.

Characters embeddings are usually trained the same way, which means that the embedding vectors also represent the "usual neighbours" of a character. This can have various applications in string similarity, word tokenization, stylometry (representing an author's writing style), and probably more. For example, in languages with accentuated characters the embedding for é would be closely similar to the one for e; m and n would be closer than x and f .

Is there a sensible notion of 'character embeddings'?

About