The way I read this, you are actually asking two questions:
- How do character n-grams help to encode knowledge that is often encoded with the help of techniques such as stemming ?
- Why are n-grams language independent? I'm not totally sure on what you mean by this one, but I'll take a stab at it
Character N-grams
Some languages (most languages that I know of, but some more than others) have grammatic rules that change the morphology of a word: one house, two houses. In a vanilla Vector Space Model house and houses are not identical and form two dimensions of your model. As if they were as different as house and apple.
We know that English applies these morphological operations on the words and we can counter that by bending the words back into their 'stem'.
For instance, the Snowball stemmer would bend them back to a token that does collide:
- House -> Hous
- Houses -> Hous
Note: Hous is not what we think of as a stem, hence my use of the quotes around 'stem'
N-gramming is basically splitting text into all subsequences sequences of length N. If we apply that (N=3, forgetting about start and stop for simplicity) on our example strings we would get something like:
>
- House -> Hou, ous, use
- Houses -> Hou, ous, use, ses
Note that we end up with 4 new dimensions, 3 of which collide. This reduces our previous found need for stemming. We could argue that it still doesn't collide fully, but then again, we might not want it to.
Language Independence
I don't think n-grams are language independent. Certainly not on the word level: some languages are more regular and some languages are more context-free. Some languages follow a pattern that looks more like adding words to the end as you go, and those languages are probably more suited for modelling with n-grams. And this probably holds too for character-based n-grams.
However, as previously argued, they add flexibility to the V.S.M. when it comes to handling morphology. That morphology would normally require 'manually' encoding knowledge about the language at hand. With this more or less out of the way your system needs less configuration to the language at hand, which makes it less dependent on the language.