Word representation that gives more weight to terms frequent in corpus?
The tf-idf discounts the words that appear in a lot of documents in the corpus. I am constructing an anomaly detection text classification algorithm that is trained only on valid documents. Later I use One-class SVM to detect outliers. Interesting enough the tf-idf performs worse than a simple count-vectorizer. First I was confused, but later it made sense to me, as tf-idf discounts attributes that are most indicative of a valid document. Therefore I was thinking of a new approach that would weight words that always appear in documents more, or rather assign a negative weight for the absence of such words. I have preset dictionary of words, so there is no worry that irrelevant words such as(is, that) will be weighted.
Do you have any ideas about such representations? The only thing I could imagine would be subtracting the document frequency from the attributes that are zero in a certain document.
Topic bag-of-words tfidf anomaly-detection outlier nlp
Category Data Science