4 Python Libraries to Detect English and Non-English LanguagePython has a bunch of great libraries and tools for NLP, which give us some cool ways to detect languages. In this guide, we'll check out four Python libraries that can tell English from non-English:
Let's take a closer look at each of these libraries. The langdetect LibraryThe langdetect library is a well-known Python library to spot languages. It's a Python version of Google's language-detection library, which was in Java. This library can recognize 55 languages and works well with longer pieces of text. InstallationIn order to install, you can use the pip installer as shown below: Syntax: Basic usage:Here's an easy way to use langdetect: Output: Language: de Probability: 0.5714285714285714 Language: en Probability: 0.42857142857142855 To handle text with more than one language: Output: Language: en, Probability: 0.5714285714285714 Language: es, Probability: 0.2857142857142857 Language: de, Probability: 0.1428571428571428 Pros about langdetect:
Cons about langdetect:
The langid LibraryThe langid library is another tool people like to use to figure out languages. It's made to be quick and spot on, and it can handle 97 different languages. InstallationIn order to install this library, you can use the pip installer as shown below: Syntax: Basic usage:Output: Language found: fr How sure: -54.41310358047485 Remember, langid gives a confidence score, but it's not a chance. Lower (more negative) numbers mean it's surer. Handling Multiple Languages:langid doesn't have a built-in way to spot multiple languages in one text. But you can split the text and check each part on its own: Output: Sentence: This is English Detected language: en, Confidence: -54.41310358047485 Sentence: Das ist Deutsch The language detected is German, with a confidence of -40.72214221954346. The sentence "Esto es español" is in Spanish. The system identified it with a confidence of -44.98177528381348. To set up the languages langid checks, you can use the following code: This can make it more accurate if you know beforehand what languages might show up. Pros about langid:
Cons of langid:
The pycld2 Librarypycld2 wraps Google's Compact Language Detector 2 (CLD2) for Python. It's fast and on point with longer texts. InstallationGetting pycld2 to work can be a pain since you need to compile it. In order to install this library, you can use the pip installer as shown below: Syntax: If this doesn't work, you might have to install it from the source or use a pre-made wheel. Basic Usage:Here's an easy example of how to use pycld2: Output: Is reliable: True Text bytes found: 30 Details: (('ja', 'JAPANESE' 100 1024.0), ('un', 'Unknown', 0 0.0), ('un', 'Unknown', 0, 0.0)) This tuple has info about the top three spotted languages. It shows the language code name how sure it is, and a CLD2 score. Handling Multiple Languages:The pycld2 library can spot different languages in one piece of text: Output: Can trust: True Bytes of text found: 54 Language: en, Name: ENGLISH How sure: 33 Language: de Name: GERMAN How sure: 33 Language: es, Name: SPANISH How sure: 33 Using Different Modes:The pycld2 lets you spot languages in different ways: This will show the language for each part of the text: Pros about pycld2:
Cons about pycld2:
The fastText LibraryThe fastText library is a tool for learning word meanings and grouping sentences . While people use it to sort text and understand words, it also has a feature to figure out what language something's in. InstallationIn order to install this library, you can use the pip installer as shown below: Syntax: Basic Usage: First, you need to get the pretrained model to use fastText for figuring out languages: Output: We figured out the language: en Confidence: 0.9999998807907104 Handling Multiple Languages:fastText can't spot multiple languages in one chunk of text right off the bat, but you can break it up and look at each bit: Output: Sentence: This is English Language found: en How sure: 0.9999998807907104 Sentence: Das ist Deutsch Language found: de How sure: 0.9999822378158569 Sentence: Esto es español Language found: es How sure: 0.9999998807907104 To get guesses for many possible languages: Output: … Language: en, Confidence: 0.9999998807907104 Language: de Confidence: 1.0426505367578566e-07 Language: nl Confidence: 4.515463705506921e-08 … Pros about fastText:
Cons about fastText:
Comparison and Conclusion:These libraries all have strengths and weaknesses:
The library you pick depends on what you need:
|
We provides tutorials and interview questions of all technology like java tutorial, android, java frameworks
G-13, 2nd Floor, Sec-3, Noida, UP, 201301, India