4 Python Libraries to Detect English and Non-English Language

Python has a bunch of great libraries and tools for NLP, which give us some cool ways to detect languages. In this guide, we'll check out four Python libraries that can tell English from non-English:

  1. langdetect
  2. langid
  3. pycld2
  4. fastText

Let's take a closer look at each of these libraries.

The langdetect Library

The langdetect library is a well-known Python library to spot languages. It's a Python version of Google's language-detection library, which was in Java. This library can recognize 55 languages and works well with longer pieces of text.

Installation

In order to install, you can use the pip installer as shown below:

Syntax:

Basic usage:

Here's an easy way to use langdetect:

Output:

 
Language: de Probability: 0.5714285714285714
Language: en Probability: 0.42857142857142855   

To handle text with more than one language:

Output:

 
Language: en, Probability: 0.5714285714285714
Language: es, Probability: 0.2857142857142857
Language: de, Probability: 0.1428571428571428   

Pros about langdetect:

  1. It can figure out lots of languages (55)
  2. It's pretty good with short bits of text
  3. It tells you how sure it is about each language it spots

Cons about langdetect:

  1. It can be a bit all over the place with short text
  2. Some other tools are quicker

The langid Library

The langid library is another tool people like to use to figure out languages. It's made to be quick and spot on, and it can handle 97 different languages.

Installation

In order to install this library, you can use the pip installer as shown below:

Syntax:

Basic usage:

Output:

 
Language found: fr How sure: -54.41310358047485   

Remember, langid gives a confidence score, but it's not a chance. Lower (more negative) numbers mean it's surer.

Handling Multiple Languages:

langid doesn't have a built-in way to spot multiple languages in one text. But you can split the text and check each part on its own:

Output:

 
Sentence: This is English
Detected language: en, Confidence: -54.41310358047485
Sentence: Das ist Deutsch
The language detected is German, with a confidence of -40.72214221954346.
The sentence "Esto es español" is in Spanish. The system identified it with a confidence of -44.98177528381348.   

To set up the languages langid checks, you can use the following code:

This can make it more accurate if you know beforehand what languages might show up.

Pros about langid:

  1. It's super quick
  2. Handles a lot of languages (97)
  3. Can beat langdetect when it comes to short texts

Cons of langid:

  1. Doesn't give odds for different language options
  2. Confidence scores are weird (more negative means more sure)

The pycld2 Library

pycld2 wraps Google's Compact Language Detector 2 (CLD2) for Python. It's fast and on point with longer texts.

Installation

Getting pycld2 to work can be a pain since you need to compile it. In order to install this library, you can use the pip installer as shown below:

Syntax:

If this doesn't work, you might have to install it from the source or use a pre-made wheel.

Basic Usage:

Here's an easy example of how to use pycld2:

Output:

 
Is reliable: True
Text bytes found: 30
Details: (('ja', 'JAPANESE' 100 1024.0), ('un', 'Unknown', 0 0.0), ('un', 'Unknown', 0, 0.0))   

This tuple has info about the top three spotted languages. It shows the language code name how sure it is, and a CLD2 score.

Handling Multiple Languages:

The pycld2 library can spot different languages in one piece of text:

Output:

 
Can trust: True
Bytes of text found: 54
Language: en, Name: ENGLISH How sure: 33
Language: de Name: GERMAN How sure: 33
Language: es, Name: SPANISH How sure: 33   

Using Different Modes:

The pycld2 lets you spot languages in different ways:

This will show the language for each part of the text:

Pros about pycld2:

  1. It's super quick
  2. It can spot multiple languages in one chunk of text
  3. It gives you lots of details about the languages it finds

Cons about pycld2:

  1. Setting it up can be a pain on some computers
  2. It works with the languages CLD2 supports (around 160)

The fastText Library

The fastText library is a tool for learning word meanings and grouping sentences . While people use it to sort text and understand words, it also has a feature to figure out what language something's in.

Installation

In order to install this library, you can use the pip installer as shown below:

Syntax:

Basic Usage:

First, you need to get the pretrained model to use fastText for figuring out languages:

Output:

 
We figured out the language: en
Confidence: 0.9999998807907104   

Handling Multiple Languages:

fastText can't spot multiple languages in one chunk of text right off the bat, but you can break it up and look at each bit:

Output:

 
Sentence: This is English
Language found: en How sure: 0.9999998807907104
Sentence: Das ist Deutsch
Language found: de How sure: 0.9999822378158569
Sentence: Esto es español
Language found: es How sure: 0.9999998807907104   

To get guesses for many possible languages:

Output:

 
…
Language: en, Confidence: 0.9999998807907104
Language: de Confidence: 1.0426505367578566e-07
Language: nl Confidence: 4.515463705506921e-08
…   

Pros about fastText:

  1. It's super precise when it comes to longer bits of writing
  2. It works once you've loaded the model
  3. It can give you confidence scores for different language options

Cons about fastText:

  1. You need to download a big model file
  2. It takes a while to load the model at first
  3. It's not made just for figuring out languages (it's more of a general tool for sorting text)

Comparison and Conclusion:

These libraries all have strengths and weaknesses:

  • langdetect is flexible and simple to use, but it's not always right with short bits of text.
  • langid is quick and handles short text well, but its confidence scores aren't easy to understand.
  • pycld2 is super-fast and gets it right a lot with longer text, but it can be a pain to set up.
  • fastText is super accurate but you need to download a big model and it takes longer to start up.

The library you pick depends on what you need:

  • If you want to figure out languages in short texts, langid might be your best bet.
  • For longer stuff where you need speed, pycld2 could be just right.
  • If you want something easy to use and pretty accurate, langdetect is a good all-around choice.
  • If you need the most accurate results and don't care about the setup time, fastText might be the way to go.