Word representation that gives more weight to terms frequent in corpus?

Question

Word representation that gives more weight to terms frequent in corpus?

Borut Flis

2020年8月23日 21:05

The tf-idf discounts the words that appear in a lot of documents in the corpus. I am constructing an anomaly detection text classification algorithm that is trained only on valid documents. Later I use One-class SVM to detect outliers. Interesting enough the tf-idf performs worse than a simple count-vectorizer. First I was confused, but later it made sense to me, as tf-idf discounts attributes that are most indicative of a valid document. Therefore I was thinking of a new approach that would weight words that always appear in documents more, or rather assign a negative weight for the absence of such words. I have preset dictionary of words, so there is no worry that irrelevant words such as(is, that) will be weighted.

Do you have any ideas about such representations? The only thing I could imagine would be subtracting the document frequency from the attributes that are zero in a certain document.

Topic bag-of-words tfidf anomaly-detection outlier nlp

Category Data Science

Erwan · Accepted Answer · 2020年8月23日 21:05

I'm not aware of any standard representation which increases the importance of document-frequent words, but IDF can simply be reverted: instead of the usual

$$idf(w,D)=\log\left(\frac{N}{|d\in D\ |\ w \in d|}\right)$$

you could use the following:

$$revidf(w,D)=\log\left(\frac{N}{|d\in D\ |\ w \notin d|}\right)$$

However for the task you describe I would be tempted to try some more advanced feature engineering, typically by using features which represent how close the distribution of words in the current document is from the average distribution.

Word representation that gives more weight to terms frequent in corpus?

About