4 Python Libraries to Detect English and Non-English Language

Python has a bunch of great libraries and tools for NLP, which give us some cool ways to detect languages. In this guide, we'll check out four Python libraries that can tell English from non-English:

langdetect
langid
pycld2
fastText

Let's take a closer look at each of these libraries.

The langdetect Library

The langdetect library is a well-known Python library to spot languages. It's a Python version of Google's language-detection library, which was in Java. This library can recognize 55 languages and works well with longer pieces of text.

Installation

In order to install, you can use the pip installer as shown below:

Syntax:

 
```rewritten_humanized_text>
$ pip install langdetect   

Basic usage:

Here's an easy way to use langdetect:

 
from langdetect import detect
text = "Hello how are you?"
lang = detect(text)
print(f"The language we found: {lang}") # You'll see: The language we found: en
In this case, 'en' means English. langdetect uses two-letter codes from ISO 639-1 to show languages.
langdetect has a detect_langs() function too. It gives you a list of languages it might be, with how likely each one is:
from langdetect import detect_langs
text = "Hello, wie geht es dir?"
langs = detect_langs(text)
for lang in langs:
    print(f"Language: {lang.lang} Probability: {lang.prob}")   

Output:

 
Language: de Probability: 0.5714285714285714
Language: en Probability: 0.42857142857142855

To handle text with more than one language:

 
langdetect can work with text that has different languages:
text = "This is English. Das ist Deutsch. Esto es español."
langs = detect_langs(text)
for lang in langs:
print(f"Language: {lang.lang} Probability: {lang.prob}")   

Output:

 
Language: en, Probability: 0.5714285714285714
Language: es, Probability: 0.2857142857142857
Language: de, Probability: 0.1428571428571428

Pros about langdetect:

It can figure out lots of languages (55)
It's pretty good with short bits of text
It tells you how sure it is about each language it spots

Cons about langdetect:

It can be a bit all over the place with short text
Some other tools are quicker

The langid Library

The langid library is another tool people like to use to figure out languages. It's made to be quick and spot on, and it can handle 97 different languages.

Installation

In order to install this library, you can use the pip installer as shown below:

Syntax:

Basic usage:

 
Here's an easy way to use langid:
import langid
text = "Bonjour, comment allez-vous?"
lang, confidence = langid.classify(text)
print(f"Language found: {lang} How sure: {confidence}")   

Output:

 
Language found: fr How sure: -54.41310358047485

Remember, langid gives a confidence score, but it's not a chance. Lower (more negative) numbers mean it's surer.

Handling Multiple Languages:

langid doesn't have a built-in way to spot multiple languages in one text. But you can split the text and check each part on its own:

 
text = "This is English. Das ist Deutsch. Esto es español."
sentences = text.split('.')
for sentence in sentences:
    lang, confidence = langid.classify(sentence.strip())
    print(f"Sentence: {sentence.strip()}")
    print(f"Detected language: {lang}, Confidence: {confidence}")
    print()   

Output:

 
Sentence: This is English
Detected language: en, Confidence: -54.41310358047485
Sentence: Das ist Deutsch
The language detected is German, with a confidence of -40.72214221954346.
The sentence "Esto es español" is in Spanish. The system identified it with a confidence of -44.98177528381348.

To set up the languages langid checks, you can use the following code:

 
import langid
langid.set_languages(['en', 'de', 'fr'])
text = "Bonjour, comment allez-vous?"
lang, confidence = langid.classify(text)
print(f"Detected language: {lang}, Confidence: {confidence}")   

This can make it more accurate if you know beforehand what languages might show up.

Pros about langid:

It's super quick
Handles a lot of languages (97)
Can beat langdetect when it comes to short texts

Cons of langid:

Doesn't give odds for different language options
Confidence scores are weird (more negative means more sure)

The pycld2 Library

pycld2 wraps Google's Compact Language Detector 2 (CLD2) for Python. It's fast and on point with longer texts.

Installation

Getting pycld2 to work can be a pain since you need to compile it. In order to install this library, you can use the pip installer as shown below:

Syntax:

If this doesn't work, you might have to install it from the source or use a pre-made wheel.

Basic Usage:

Here's an easy example of how to use pycld2:

 
import pycld2 as cld2
text = "こんにちは、お元気ですか？"
isReliable, textBytesFound, details = cld2.detect(text)
print(f"Is reliable: {isReliable}")
print(f"Text bytes found: {textBytesFound}")
print(f"Details: {details}")   

Output:

 
Is reliable: True
Text bytes found: 30
Details: (('ja', 'JAPANESE' 100 1024.0), ('un', 'Unknown', 0 0.0), ('un', 'Unknown', 0, 0.0))

This tuple has info about the top three spotted languages. It shows the language code name how sure it is, and a CLD2 score.

Handling Multiple Languages:

The pycld2 library can spot different languages in one piece of text:

 
text = "This is English. Das ist Deutsch. Esto es español."
isReliable, textBytesFound, details = cld2.detect(text)
print(f"Is reliable: {isReliable}")
print(f"Bytes of text found: {textBytesFound}")
for lang in details:
    print(f"Language: {lang[0]} Name: {lang[1]} How sure: {lang[2]}")   

Output:

 
Can trust: True
Bytes of text found: 54
Language: en, Name: ENGLISH How sure: 33
Language: de Name: GERMAN How sure: 33
Language: es, Name: SPANISH How sure: 33

Using Different Modes:

The pycld2 lets you spot languages in different ways:

 
text = "This is Spanish text with some English words."
isReliable, textBytesFound, details = cld2.detect(text to_return_vectors=True)
print(f"Is reliable: {isReliable}")
print(f"Text bytes found: {textBytesFound}")
for chunk in details[2]:
print(f"Offset: {chunk[0]}, Length: {chunk[1]} Language: {chunk[2][0]}, Language name: {chunk[2][1]}")   

This will show the language for each part of the text:

 
Is reliable: True
Text bytes found: 58
Offset: 0 Length: 31 Language: es, Language name: SPANISH
Offset: 31, Length: 27 Language: en, Language name: ENGLISH   

Pros about pycld2:

It's super quick
It can spot multiple languages in one chunk of text
It gives you lots of details about the languages it finds

Cons about pycld2:

Setting it up can be a pain on some computers
It works with the languages CLD2 supports (around 160)

The fastText Library

The fastText library is a tool for learning word meanings and grouping sentences . While people use it to sort text and understand words, it also has a feature to figure out what language something's in.

Installation

In order to install this library, you can use the pip installer as shown below:

Syntax:

Basic Usage:

First, you need to get the pretrained model to use fastText for figuring out languages:

 
import fasttext
import wget
# Get the model
url = 'https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin'
wget.download(url)
# Set up the model
model = fasttext.load_model('lid.176.bin')
# Guess the language
text = "This is a sample text in English."
guesses = model.predict(text)
print(f"The language seems to be: {guesses[0][0].split('__')[-1]}")
print(f"Confidence: {predictions[1][0]}")   

Output:

 
We figured out the language: en
Confidence: 0.9999998807907104

Handling Multiple Languages:

fastText can't spot multiple languages in one chunk of text right off the bat, but you can break it up and look at each bit:

 
text = "This is English. Das ist Deutsch. Esto es español."
sentences = text.split('.')
for sentence in sentences:
    predictions = model.predict(sentence.strip())
    lang = predictions[0][0].split('__')[-1]
    confidence = predictions[1][0]
print(f"Sentence: {sentence.strip()}")
    print(f"Language found: {lang} How sure: {confidence}")
    print()   

Output:

 
Sentence: This is English
Language found: en How sure: 0.9999998807907104
Sentence: Das ist Deutsch
Language found: de How sure: 0.9999822378158569
Sentence: Esto es español
Language found: es How sure: 0.9999998807907104

To get guesses for many possible languages:

 
text = "Someone wrote this in English."
predictions = model.predict(text, k=3) # Get top 3 predictions
for i in range(3):
lang = predictions[0][i].split('__')[-1]
confidence = predictions[1][i]
print(f"Language: {lang}, Confidence: {confidence}")   

Output:

 
…
Language: en, Confidence: 0.9999998807907104
Language: de Confidence: 1.0426505367578566e-07
Language: nl Confidence: 4.515463705506921e-08
…

Pros about fastText:

It's super precise when it comes to longer bits of writing
It works once you've loaded the model
It can give you confidence scores for different language options

Cons about fastText:

You need to download a big model file
It takes a while to load the model at first
It's not made just for figuring out languages (it's more of a general tool for sorting text)

Comparison and Conclusion:

These libraries all have strengths and weaknesses:

langdetect is flexible and simple to use, but it's not always right with short bits of text.
langid is quick and handles short text well, but its confidence scores aren't easy to understand.
pycld2 is super-fast and gets it right a lot with longer text, but it can be a pain to set up.
fastText is super accurate but you need to download a big model and it takes longer to start up.

The library you pick depends on what you need:

If you want to figure out languages in short texts, langid might be your best bet.
For longer stuff where you need speed, pycld2 could be just right.
If you want something easy to use and pretty accurate, langdetect is a good all-around choice.
If you need the most accurate results and don't care about the setup time, fastText might be the way to go.

Next Topic5 different meanings of underscore in python

← prev next →