Python Code to find the number of hapax legomena in a Text or Words_List

In corpus linguistics, a hapax legomenon is a word that occurs only once within a context, either in the written record of an entire language, in the works of an author, or in a single text. The term is sometimes incorrectly used to describe a word that occurs in just one of an author's works, but more than once in that particular work. Hapax legomenon is a transliteration of Greek ἅπαξ λεγόμενον, meaning "(something) being said (only) once"

Hapax_legomenon enter link description here

Topic nltk ipython python machine-learning

Category Data Science


You can use nltk's freq dist, there is a built in method there for that.

from nltk.probability import FreqDist
from nltk import Text

self.text = Text(self.tokens)
self.fdist = FreqDist(self.text)    
hapaxes = fdist1.hapaxes()

if by any chance you cannot use the library, feel free to manually calculate it:

text = 'your text ... '
local_hapax = list(set(text.split(' ')))
weighted_local_hapax = local_hapax/len(text)

If you want to find the unique words in your Words_list, you can do it with one-line using sets:

Words_list = list(set(Words_list))

If you only want the hapax and a hapax is a word that occurs only once, then if the code prints the words that have frequency 1 would be your answer.

import re
path=r"C:\Users\something\something\yourfile.txt".replace('\\', '/')
def hapax_function(give_your_path):
    file = open(give_your_path)
    list_of_words = re.findall('\w+', file.read().lower())
    freqs = {key: 0 for key in list_of_words}
    for word in list_of_words:
        freqs[word] += 1
    for word in freqs:
        if freqs[word] == 1:
            print(word)
hapax_function(path)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.