Looking for a 'CITY, STATE' within a body of text (from a CITY-STATE database)

I'm looking for an optimal way to search a large body of text for a combination of words that resemble any CITY, STATE combination I have in a separate CITY-STATE database.

My only idea would be to do a separate search against the body of text for each CITY, STATE in the database, but that would require a lot of time considering the amount of CITY, STATE combinations the database has in it. The desired result from this query would be to pull a single CITY, STATE for each body of text I am analyzing to tell the geographical side of the story for this data subset.

Anyone know of an optimal way/process to do such a query?

Topic parsing data-indexing-techniques search databases

Category Data Science


The only thing I can see would be to separate both city and state lists and treat the problem as an automaton: parse your text, run through the n-grams, whenever you detect a CITY token (meaning a n-gram present in your list of cities or close to it in a similarity sense, as there might be misspellings) then look for a STATE token in its neighbourhood (similarly by looking into a list of states, using an edit distance metric to allow for misspellings). If you find one, then you can tag your text with that geographical location.

Of course, allowing for misspellings will bring some false positives but you could easily bypass that by doing a quick lookup through your corpus to see that "SALAMI, OREGANO" is different from "SALEM, OREGON" (because the frequency of the latter will hopefully be higher than the former)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.