How to calculate lexical cohension and semantic informaticveness for a given dataset?

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned;

There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain.

However, the review does not include the ways to calculate/derive these measures. Can someone please specify how to get these two measurements for a given text documents?

Topic text-mining nlp statistics data-mining

Category Data Science


Lexical cohesion is also called colocation extraction, frequently occurring ngrams. One example is "San Francisco" which occurs more often relative to "San" and "Francisco" appearing independently. One method for colocation extraction is to rank order the occurrence all ngrams and pick a threshold for inclusion.

Semantic informativeness is closer to tf–idf for ngrams. Instead of just using raw frequency counts, the frequency is weighted by uniqueness.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.