Pre-trained models for finding similar word n-grams

Question

Pre-trained models for finding similar word n-grams

dzieciou

2021年1月9日 10:52

Are there any pre-trained models for finding similar word n-grams, where n1?

FastText, for instance, seems to work only on unigrams:

from pyfasttext import FastText
model = FastText('cc.en.300.bin')
model.nearest_neighbors('dog', k=2000)

[('dogs', 0.8463464975357056),
 ('puppy', 0.7873005270957947),
 ('pup', 0.7692237496376038),
 ('canine', 0.7435278296470642),
 ...

but it fails on longer n-grams:

model.nearest_neighbors('Gone with the Wind', k=2000)

[('DEky4M0BSpUOTPnSpkuL5I0GTSnRI4jMepcaFAoxIoFnX5kmJQk1aYvr2odGBAAIfkECQoABAAsCQAAABAAEgAACGcAARAYSLCgQQEABBokkFAhAQEQHQ4EMKCiQogRCVKsOOAiRocbLQ7EmJEhR4cfEWoUOTFhRIUNE44kGZOjSIQfG9rsyDCnzp0AaMYMyfNjS6JFZWpEKlDiUqALJ0KNatKmU4NDBwYEACH5BAUKAAQALAkAAAAQABIAAAhpAAEQGEiQIICDBAUgLEgAwICHAgkImBhxoMOHAyJOpGgQY8aBGxV2hJgwZMWLFTcCUIjwoEuLBym69PgxJMuDNAUqVDkz50qZLi',
  0.71047443151474),

or

model.nearest_neighbors('Star Wars', k=2000)
[('clockHauser', 0.5432934761047363),
 ('CrônicasEsdrasNeemiasEsterJóSalmosProvérbiosEclesiastesCânticosIsaíasJeremiasLamentaçõesEzequielDanielOséiasJoelAmósObadiasJonasMiquéiasNaumHabacuqueSofoniasAgeuZacariasMalaquiasNovo',
  0.5197194218635559),

Topic fasttext nlp

Category Data Science

Valentin Calomme · Accepted Answer · 2021年1月9日 10:52

First off, there aren't, to my knowledge, models trained specifically to generate ngram embeddings. Although, it would be very easy to modify the word2vec algorithm to accommodate ngrams.

Now, what can you do?

You could compute the ngram embedding by summing up the individual word embeddings. Potentially, you could apply weights based on tfidf for instance, but not required. Once you have 1 embedding, simply find a nearest neighbor using cosine distance.

Another approach, though more computionally expensive would be to compute the Earth Mover's Distance (also called Wasserstein) between ngrams and find nearest neighbors this way.

Pre-trained models for finding similar word n-grams

About