Like you, I wasn't thrilled with the quality of syllable counting functions I could find online, so here's my take:
import re
VOWEL_RUNS = re.compile("[aeiouy]+", flags=re.I)
EXCEPTIONS = re.compile(
# fixes trailing e issues:
# smite, scared
"[^aeiou]e[sd]?$|"
# fixes adverbs:
# nicely
+ "[^e]ely$",
flags=re.I
)
ADDITIONAL = re.compile(
# fixes incorrect subtractions from exceptions:
# smile, scarred, raises, fated
"[^aeioulr][lr]e[sd]?$|[csgz]es$|[td]ed$|"
# fixes miscellaneous issues:
# flying, piano, video, prism, fire, evaluate
+ ".y[aeiou]|ia(?!n$)|eo|ism$|[^aeiou]ire$|[^gq]ua",
flags=re.I
)
def count_syllables(word):
vowel_runs = len(VOWEL_RUNS.findall(word))
exceptions = len(EXCEPTIONS.findall(word))
additional = len(ADDITIONAL.findall(word))
return max(1, vowel_runs - exceptions + additional)
We avoid looping in pure Python; at the same time, these regexes should be easy to understand.
This performs better than the various snippets floating around online that I've found (including Pyphen and Syllapy's fallback). It gets over 90% of cmudict correct (and I find its mistakes quite understandable).
cd = nltk.corpus.cmudict.dict()
sum(
1 for word, pron in cd.items()
if count_syllables(word) in (sum(1 for p in x if p[-1].isdigit()) for x in pron)
) / len(cd)
# 0.9073751569397757
For comparison, Pyphen is at 53.8% and the syllables function in the other answer is at 83.7%.
Here are some common words it gets wrong:
from collections import Counter
for word, _ in Counter(nltk.corpus.brown.words()).most_common(1000):
word = word.lower()
if word in cd and count_syllables(word) not in (sum(1 for p in x if p[-1].isdigit()) for x in cd[word]):
print(word)