Language Detection using pycld2

I am trying to use the pycld2 package to detect multiple languages in text. This package provides Python bindings for the Compact Language Detect 2 (CLD2)

This is the example I am testing out:

import pycld2 as cld2

text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: Connessione push-in. Terminare solido e incagliato (trefoli di classe B 7 o meno), così come i conduttori a puntale, semplicemente spingendoli in – nessun attrezzo richiesto. Der universelle Anschluss mit zusätzlichem Vorteil: Push-in-Anschluss Vollständig und verseilt abschließen (Klasse B 7 Stränge oder weniger), sowie Aderendhülsen durch einfaches Aufschieben in – kein Werkzeug erforderlich.'''

reliable, index, top_3_choices,vecs = cld2.detect(text, returnVectors=True)

The top 3 detected languages are the following:

print(top_3_choices)

(('GERMAN', 'de', 34, 1089.0), ('ITALIAN', 'it', 33, 355.0), ('ENGLISH', 'en', 32, 953.0))

According to the documentation the confidence score is the fourth argument in each tuple and the third argument corresponds to the percentage of the original text detected in the respective language. I am struggling though how to interpret the score so I can flag the confidence of the detected language. Does anyone know if I can somehow normalize the score to get some form of interpretable probabilities? Thanks!

Topic text-classification multiclass-classification naive-bayes-classifier nlp machine-learning

Category Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.