How to calculate lexical cohension and semantic informaticveness for a given dataset?

Question

How to calculate lexical cohension and semantic informaticveness for a given dataset?

J Cena

2022年6月4日 14:00

In 'Automatic construction of lexicons, taxonomies, ontologies, and other knowledge structures' they have mentioned;

There are two slightly different classes of measure: lexical cohesion (sometimes called ‘unithood’ or ‘phraseness’), which quantifies the expectation of co-occurrence of words in a phrase (e.g., back-of-the-book index is significantly more cohesive than term name); and semantic informativeness (sometimes called ‘termhood’), which highlights phrases that are representative of a given document or domain.

However, the review does not include the ways to calculate/derive these measures. Can someone please specify how to get these two measurements for a given text documents?

Topic text-mining nlp statistics data-mining

Category Data Science

Brian Spiering · Accepted Answer · 2021年8月29日 20:34

Lexical cohesion is also called colocation extraction, frequently occurring ngrams. One example is "San Francisco" which occurs more often relative to "San" and "Francisco" appearing independently. One method for colocation extraction is to rank order the occurrence all ngrams and pick a threshold for inclusion.

Semantic informativeness is closer to tf–idf for ngrams. Instead of just using raw frequency counts, the frequency is weighted by uniqueness.

How to calculate lexical cohension and semantic informaticveness for a given dataset?

About