Rosette Language Identifier


Instantly identify languages within large volumes of text to prepare for further analysis

Overview

What is language detection?

Global-minded companies must be able to work with data and content in dozens of languages. Whether it is searching text or learning upfront what language experts to hire for an eDiscovery project, Rosette tells you automatically. Only by detecting language of a search query, can you search the correct database or return results in the right language.

Why do I need it?

Without question, text analysis is most accurate when working natively within each language, and for that, language identification is a prerequisite in order to apply the correct language-specific analyzer.

A critical two-line email may be written in French, but have a 12-line legal footer in English. That email might fool most language identifiers into tagging the email as English. But only correctly tagging the language of each section will unlock the information inside.

Short texts are also challenging, but ubiquitous: in social media, image captions, news headlines, email subject lines, tweets, metadata, keywords, queries, files, logs, and more.

Basis Technology leads the pack

Our language identifier outperforms most on the market in detecting the language of:

  • Short texts: such as the language of tweets and queries (from as little as 1-3 words, to a full sentence).
  • Multilingual documents: Rosette recognizes the dominant language in a body of text, as well as smaller sections of text in different languages.
  • Transliterated texts: At times, non-Latin languages (such as Arabic) may appear in Arabic script or Latin characters. Rosette recognizes both.
  • Language boundary detection: flagging language regions within multilingual text

Product Highlights

  • 55 languages
  • 18 language scripts (e.g. Latin, Cyrillic)
  • 188 language/encoding pairs
  • Identifies the dominant language of a document
  • Identifies different language regions within multilingual documents
  • Delivers high accuracy based on as little as 1-3 words

How It Works

Superior coverage of language, encodings, and scripts

Our language identifier achieves its incredible accuracy through the use of proprietary algorithms with information-rich language profiles derived from statistical analysis, in addition to language-specific methods for short text language detection.

Input

The input data may be in any of 364 language-encoding-script combinations, involving 55 languages, 48 encodings, and 18 writing scripts. The language identifier uses an n-gram algorithm to detect language. Each of the 155 built-in profiles contains the quad-grams (i.e., four consecutive bytes) that are most frequently encountered in documents in a given language, encoding, and script. The default number of n-grams is 10,000 for double-byte encodings and 5,000 for single-byte encodings.

Technology

When input text is submitted for detection, a similar n-gram profile is built based on that data. The input profile is then compared with all the built-in profiles (a vector distance measure between the input profile and the built-in profile is calculated). The built-in language profiles are then returned in ascending order from the most likely language (i.e., the built-in profile with the (shortest) distance from the input text’s profile).

Confidence

Our language identifier returns a confidence score with each language result, ranging from 0 to 1. It is a measurement that you can use as a threshold for flagging results that are “too close to be sure.”

Language Boundary Locator

Digital text is often composed of multiple languages within the same document, presenting a challenge to computers and humans alike. RLI enriches the text with start and end markers for each language placed within multilingual documents—even if all the languages are written in the same script— such as English, French, German, or Italian. Boundaries of each writing system are also detected, such as Latin, Cyrillic, Japanese kana, or Chinese hanzi.

Short string language detection

For a number of languages, the language identifier uses additional proprietary algorithms for detecting the language of short strings (140 characters or less).

Tech Specs

Availability and Platform Support

Deployment availability:
Plugins:
Bindings:

Supported Languages

Albanian German Macedonian Somali
Arabic Greek Malay Spanish
Arabic (transliterated) Gujarati Malayalam Swedish
Bengali Hebrew Norwegian Tagalog
Bulgarian Hindi Pashto Tamil
Catalan Hungarian Pashto (transliterated) Telugu
Chinese, Simp. Icelandic Persian Thai
Chinese, Trad. Indonesian Persian (transliterated) Turkish
Croatian Italian Polish Ukraine
Czech Japanese Portuguese Urdu
Danish Kannada Romanian Urdu (transliterated)
Dutch Korean Russian Uzbek
English Kurdish Serbian Uzbek (transliterated)
Estonian Kurdish (transliterated) Serbian (transliterated) Vietnamese
Finnish Latvian Slovak
French Lithuanian Slovenian

Short String Languages

Arabic Finnish Japanese Russian
Chinese, Simp. French Korean Spanish
Chinese, Trad. German Norwegian Swedish
Czech Greek Pashto Thai
Danish Hebrew Persian Turkish
Dutch Hungarian Portuguese
English Italian Romanian

Try the Demo

Cloud API

Easy to use API

Ideal for product evaluation, academic research, and smaller, cost-conscious businesses, our fast and powerful API is instantly accessible and free to get started. The language ID endpoint identifies the dominant language within a document. For multilingual documents, send text through the sentence tagger endpoint and then feed a sentence at a time to the language ID endpoint. Or, ask about our on-premise deployments.

Try language identifier and the rest of Rosette API’s endpoints for free up to 10,000 calls/month!

Get an API Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various API endpoints along-side examples in the binding of your choice.

Visit our GitHub for the binding and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on our cloud API knowing that scalable, customizable, on-premise deployments are available if you need them.

{
"languageDetections": [
{
"language": "spa",
"confidence": 0.38719602327387076
},
{
"language": "eng",
"confidence": 0.32699986625091865
},
{
"language": "por",
"confidence": 0.05569054210624943
},
{
"language": "deu",
"confidence": 0.030069489878380328
},
{
"language": "swe",
"confidence": 0.027734757034048835
}
]
}

On Premise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise API deployment and SDKs to be hosted on your internal servers.

On premise language identification can identify both the dominant language of an entire document, and detect the language regions in multilingual documents.

Request product evaluation

If your organization requires an on-premise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an on-premise evaluation package.

Drop us a line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Customers Include

No coding required

rapidminer-1

rapidminer

RapidMiner is the industry’s #1 predictive analytics platform. The client platform, RapidMiner Studio, empowers organizations to easily prep data, create models and operationalize predictive analytics within any business process.

Try RapidMiner