NLP text representation techniques that preserve word order in sentence?

Question

NLP text representation techniques that preserve word order in sentence?

Hing

2022年4月3日 11:44

I see people are talking mostly about bag-of-words, td-idf and word embeddings. But these are at word levels. BoW and tf-idf fail to represent word orders, and word embeddings are not meant to represent any order at all. What's the best practice/most popular way of representing word order for texts of varying lengths? Simply concatenating word embeddings of individual words into long vector appearly not working for texts of varying lengths...

Or there exists no method of doing that except relying on network architectures like the positional encoding in transformer family?

By the way, ngram is not a solution to me, as it still fails to solve the problem in representing texts in varying lengths. (Or can it and how? It seems to me ngram is more for next word prediction rather than representing texts with varying lengths.)

TIA :)

Topic text feature-engineering text-mining feature-extraction nlp

Category Data Science

mork · Accepted Answer · 2022年4月3日 11:44

I recommend working with parts of speach (POS), more specifically with the RDF-Triple of Subject, Predicate and Object.

It both acts as the major structure of the sentence and preserves the order (i.e. the Subject Predicates upon the Object).

See if you can go with that alone. If not, you can add to it from the techniques you mentioned (bagging, tf-idf, etc..).

See my answer here for a suggested combined tf-idf score upon an rdf-triple, to check whether the triple itself is "unique-enough".

NLP text representation techniques that preserve word order in sentence?

About