Rosette Base Linguistics


Text analytics fundamentals to prepare your data for analysis. Language-specific tools for tokenization, part-of-speech tagging, lemmatization, decompounding, and Chinese and Japanese readings for your input.

Overview

Search many languages with high accuracy

Every language, including English, presents unique and difficult challenges for search applications to deliver relevant and precise results. Rosette® Base Linguistics (RBL) enables enterprise applications to effectively search or process text in many languages by providing a complete set of linguistic services. RBL enriches the original text in its native language for best-of-class natural language processing, improving speed, and accuracy.

What is base linguistics?

Base linguistics refers to the core morphological building blocks that prepare your text for further analysis, and allow you to effectively search or process text in many languages, including tokens, lemmas, parts of speech and more. Our base linguistics tools enrich your original text in its native language for best-in-class natural language processing, improving speed and accuracy.

The leaders in multilingual search

Intelligent, successful search is about semantics. People want to put in a real query of human language and get an answer. Words like ‘spoke’ referring to part of a bicycle wheel can be easily confused with the past tense of the verb to speak. While open source platforms now provide the basic framework for inverted full-text search engines, the challenges of accurate search are compounded as you add more languages to the queries and results. Rosette provides the tools you need to search across 40 languages.

Product Highlights

  • 28 supported languages
  • Sentence tagging
  • Tokenization
  • Lemmatization
  • Part-of-speech tagging
  • Decompounding
  • Chinese/Japanese readings
  • Intuitive cloud API
  • Customizable SDK
  • Fast and scalable
  • Industrial-strength support
  • Constantly stress-tested and improved

Language-Specific Features

How It Works

Part of speech tagging

Parts of Speech Tagging

As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy. Because different languages have different grammars, part-of-speech tags differ.

Our base linguistics support the Universal POS Tag standard from which the developer can map to Penn Treebank or other POS tag systems.

Decompounding

Decompounding

Decompounding breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.

Example: German

Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.

Chinese & Japanese readings

Chinese & Japanese readings

Our base linguistics functionality understands the difference between Chinese and Japanese text when they are written in Han script, and accurately returns the pronunciation information. For Chinese text in hanzi, Rosette returns the pronunciation information in pinyin transcriptions. For Japanese content, Rosette returns furigana transcriptions in katakana. For example, if you call Rosette with “医療番組”, it will return this reading: “イリョウ”, “バングミ”.

Lemmatization

Lemmatization

Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in more recall, but poorer precision, associating unrelated words such as arsenic/arsenal which share a stem (arsen). Instead, our base linguistics tools find the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search precision and recall, and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.

Example: English

Linguistic analysis is useful for every language; lemmatization for English improves recall and precision.

Challenge Query Stem Lemma
Two unrelated words may share a stem animals
animated
anim animal
animate
Stemming may deliver unintended results. several sever several
Irregular verbs and nouns stump the stemmer. spoke spoke speak (v.)
spoke(n.)

Tokenization

Many search tools use n-grams to break up text into overlapping strings of characters to create a search index in languages written without spaces between words. N-grams results in a larger index size and a reduction in precision. Our tools, in contrast, accurately identify and separate each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.

Tech Specs

Availability and Platform Support

Deployment Availability:
Plugins:
Bindings:

Supported Languages

Arabic English Hungarian Persian Swedish
Chinese, Simplified Finnish Italian Polish Thai
Chinese, Traditional French Japanese Portuguese Turkish
Czech German Korean Romanian Urdu
Danish Greek Norwegian Russian
 Dutch Hebrew Pashto Spanish

Try the Demo

Cloud API

Easy to use API

Ideal for product evaluation, academic research, and smaller, cost-conscious businesses, our fast and powerful API is instantly accessible and free to get started. The tokenization and sentences endpoints break your text into word components and sentences, and the morphological analysis endpoint provides POS tagging, lemmatization, decompounding, and Chinese/Japanese readings.

Try base linguistics and the rest of Rosette’s endpoints, free up to 10,000 calls/month!

Get an API Key

Quality documentation and support

Customers love our thorough and responsive support team. We also provide in-depth documentation that lists all the features and functions of the various API endpoints along-side examples in the binding of your choice.

Visit our GitHub for the binding and documentation.

Enterprise ready

Evaluate Rosette’s functional fit with your business and data needs on our cloud API knowing that scalable, customizable, on-premise deployments are available if you need them.

{
  "tokens": [
    "The",
    "fact",
    "is",
    "that",
    "the",
    "geese",
    "just",
    "went",
    "back",
    "to",
    "get",
    "a",
    "rest",
    "and",
    "I",
    "'m",
    "not",
    "banking",
    "on",
    "their",
    "return",
    "soon"
  ],
  "lemmas": [
    "the",
    "fact",
    "be",
    "that",
    "the",
    "goose",
    "just",
    "go",
    "back",
    "to",
    "get",
    "a",
    "rest",
    "and",
    "I",
    "be",
    "not",
    "bank",
    "on",
    "they",
    "return",
    "soon"
  ]
}

On Premise

Customize and scale your text analytics on premise

For organizations with vast data quantities, unique integration needs, and data security restrictions, we provide on-premise API deployment and SDKs to be hosted on your internal servers.

Request product evaluation

If your organization requires an on-premise solution, we’re happy to work with you to meet your business’ unique needs. For more in-depth evaluations please complete the form below and our Customer Engineering team will provide you with an on-premise evaluation package.

Drop us a line

EMAIL:
info@basistech.com

PHONE:
+1-617-386-2000

Select Customers Include

No coding required

rapidminer-1

rapidminer

RapidMiner is the industry’s #1 predictive analytics platform. The client platform, RapidMiner Studio, empowers organizations to easily prep data, create models and operationalize predictive analytics within any business process.

Try RapidMiner