How to get the number of syllables in a word?

I have already gone through this post which uses nltk's cmudict for counting the number of syllables in a word:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]] 

However, for words outside the cmu's dictionary like names for example: Rohit, it doesn't give a result.

So, is there any other/better way to count syllables for a word?

Topic nlp

Category Data Science


https://github.com/repp/big-phoney seems to work great if you patch the pull request @ https://github.com/repp/big-phoney/pull/8/commits/580d5a582e445510d28a6270aa16453ed868151e

It uses a Dict and when out of dictionary it uses a tensorflow model to predict phones & syllables that guess with more training it could be more accurate. Only EN currently though and seems a bit stagnant but maybe should get some love...


I've tried a lot of automatic methods and data based methods and none really get all of them right. If the word is in the dictionary, its a sure thing to get the right number of syllables. Failing that we can try an automatic method. In your case of Rohit works to say it has 2 syllables. Comes has 1, karate has 3, readier has 3, Siberia has 4, insouciance has 4, pineapple has 3, strawberries has 3, snozzberries has 3. So it seems to be fairly comprehensive. If this function gets something wrong leave a comment.

from nltk.corpus import cmudict
import syllapy

d = cmudict.dict()    
def syllable_count(word):
    try:
        return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]][0]
    except KeyError:
        #if word not found in cmudict
        return syllapy.count(word)

Like you, I wasn't thrilled with the quality of syllable counting functions I could find online, so here's my take:

import re

VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile(
    # fixes trailing e issues:
    # smite, scared
    "[^aeiou]e[sd]?$|"
    # fixes adverbs:
    # nicely
    + "[^e]ely$",
    flags=re.I
)
ADDITIONAL = re.compile(
    # fixes incorrect subtractions from exceptions:
    # smile, scarred, raises, fated
    "[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
    # fixes miscellaneous issues:
    # flying, piano, video, prism, fire, evaluate
    + ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
    flags=re.I
)

def count_syllables(word):
    vowel_runs = len(VOWEL_RUNS.findall(word))
    exceptions = len(EXCEPTIONS.findall(word))
    additional = len(ADDITIONAL.findall(word))
    return max(1, vowel_runs - exceptions + additional)

We avoid looping in pure Python; at the same time, these regexes should be easy to understand.

This performs better than the various snippets floating around online that I've found (including Pyphen and Syllapy's fallback). It gets over 90% of cmudict correct (and I find its mistakes quite understandable).

cd = nltk.corpus.cmudict.dict()
sum(
    1 for word, pron in cd.items()
    if count_syllables(word) in (sum(1 for p in x if p[-1].isdigit()) for x in pron)
) / len(cd)
# 0.9073751569397757

For comparison, Pyphen is at 53.8% and the syllables function in the other answer is at 83.7%.

Here are some common words it gets wrong:

from collections import Counter
for word, _ in Counter(nltk.corpus.brown.words()).most_common(1000):
    word = word.lower()
    if word in cd and count_syllables(word) not in (sum(1 for p in x if p[-1].isdigit()) for x in cd[word]):
        print(word)

Below is how I did

def countsyllables(pron):
 return len([char for phone in pron for char in phone if char[-1].isdigit() ])
from nltk.corpus import cmudict
from nltk.corpus import brown
cmudict_dict=cmudict.dict()
sw = stopwords.words('english')
bwns=[w.lower() for w in brown.words() if w.lower() not in sw ]
missingw=[]
syllablecnt=[]
for  w in bwns:
  try:
    syllablecnt.append(countsyllables(cmudict_dict[w]))
  except:
    missingw.append(w)
    continue  
# below is approximate count of syllable in the text brown, there are many missing words too    
sum(syllablecnt)

I was facing the exact same issue, this is what I did:
Catch the key error you get when the word is not found in cmu's dictionary as below:

from nltk.corpus import cmudict
d = cmudict.dict()

def nsyl(word):
    try:
        return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]
    except KeyError:
        #if word not found in cmudict
        return syllables(word)

Call the below syllables function

def syllables(word):
    #referred from stackoverflow.com/questions/14541303/count-the-number-of-syllables-in-a-word
    count = 0
    vowels = 'aeiouy'
    word = word.lower()
    if word[0] in vowels:
        count +=1
    for index in range(1,len(word)):
        if word[index] in vowels and word[index-1] not in vowels:
            count +=1
    if word.endswith('e'):
        count -= 1
    if word.endswith('le'):
        count += 1
    if count == 0:
        count += 1
    return count

You can try another Python library called Pyphen. It's easy to use and supports a lot of languages.

import pyphen
dic = pyphen.Pyphen(lang='en')
print dic.inserted('Rohit')
>>'Ro-hit'

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.