Why is n-grams language independent?

Question

Why is n-grams language independent?

Bharathi

2020年4月29日 13:26

I don't understand how n-grams are language independent. I've read that by using character n-grams of a word than the word itself as dimensions of a vector space model, we can skip the language-dependent pre-processing such as stemming and stop word removal.

Can someone please provide reasoning for this?

Topic vector-space-models stanford-nlp ngrams nlp

Category Data Science

S van Balen · Accepted Answer · 2020年4月29日 13:26

The way I read this, you are actually asking two questions:

How do character n-grams help to encode knowledge that is often encoded with the help of techniques such as stemming ?
Why are n-grams language independent? I'm not totally sure on what you mean by this one, but I'll take a stab at it

Character N-grams

Some languages (most languages that I know of, but some more than others) have grammatic rules that change the morphology of a word: one house, two houses. In a vanilla Vector Space Model house and houses are not identical and form two dimensions of your model. As if they were as different as house and apple.

We know that English applies these morphological operations on the words and we can counter that by bending the words back into their 'stem'.

For instance, the Snowball stemmer would bend them back to a token that does collide:

House -> Hous

Houses -> Hous

Note: Hous is not what we think of as a stem, hence my use of the quotes around 'stem'

N-gramming is basically splitting text into all subsequences sequences of length N. If we apply that (N=3, forgetting about start and stop for simplicity) on our example strings we would get something like: >

House -> Hou, ous, use

Houses -> Hou, ous, use, ses

Note that we end up with 4 new dimensions, 3 of which collide. This reduces our previous found need for stemming. We could argue that it still doesn't collide fully, but then again, we might not want it to.

Language Independence

I don't think n-grams are language independent. Certainly not on the word level: some languages are more regular and some languages are more context-free. Some languages follow a pattern that looks more like adding words to the end as you go, and those languages are probably more suited for modelling with n-grams. And this probably holds too for character-based n-grams.

However, as previously argued, they add flexibility to the V.S.M. when it comes to handling morphology. That morphology would normally require 'manually' encoding knowledge about the language at hand. With this more or less out of the way your system needs less configuration to the language at hand, which makes it less dependent on the language.

Why is n-grams language independent?

Character N-grams

Language Independence

About