Speaking in tongues: Foreign language KM Technologies
Recently, I was listening to a show on NPR about the current U.S. reconstruction efforts in Iraq and all the problems they’ve had, ranging from understanding the language to finding locations on a map. That got me thinking about the great foreign language support tools that I’ve worked with over the past six years. They include not only international language features that provide multilingual versions of knowledge management (KM) applications in dozens of languages, but also KM tools that can help extract meaning and improve understanding of the unstructured text found in foreign language Web sites, databases, enterprise documents, repositories and, yes, even on maps.
This article is the first part in a two-part series in which I will introduce a number of different foreign language KM technologies. In this segment, I focus on unstructured text mining tools that provide users with natural language processing, language identification, transliteration and name normalization capabilities. In a follow-on article later this year, I will focus on speech-to-text and machine translation systems.
In previous pieces, I’ve probably sounded like a broken record when talking about metadata, and I am not going to stop now. Most of the metadata extraction tools I’ve discussed in the past are available in a variety of foreign language formats. As you might expect, those tools include document categorizers, clustering engines, classifiers, named entity extractors, summarizers and indexers, just to mention a few.
Many companies produce foreign language metadata extraction and generation technologies. Those firms are the cornerstones of many modern KM systems, and many readers are probably already familiar with them. The technologies are important to this discussion, because they lay the foundation for building more complex and sophisticated foreign language technologies. Table 1 on page 9, (KMWorld, Vol 16 Issue 7), lists some of those KM technology companies and the types of products they sell.
When considering natural language processing, language identification, name normalization and transliteration, Basis Technology tends to stand out in the crowd. Basis’ linguistic support products provide excellent examples of the commercial state of the art in foreign language support and KM.
Basis Technology is well known in the linguistics and foreign language support community for its Rosette Linguistics Platform (RLP). RLP provides a multilanguage platform for large-scale text management and exploitation systems that identify, analyze, index, search and transliterate unstructured text in Asian, European and Middle-Eastern languages.
RLP provides a multifaceted toolkit, which can add internationalization services to existing software applications. It also provides a variety of analytic functions to build comprehensive and sophisticated foreign language text mining solutions. Basis’ products help other companies that need multilingual software support for unstructured text processing, by providing the specific services shown in Figure 1 on page 9, (KMWorld, Vol 16 Issue 7).
For the sake of this discussion and as shown in Figure 1, I’m going to group the capabilities of RLP into two categories: 1.) basic services and 2.) advanced services. Those are my own groupings and have nothing to do with Basis or its product names or marketing conventions per se. Rather, they are a way to help you view RLP in the context of more or less common capabilities in the marketplace.
RLP’s language processing capabilities are built on top of a variety of basic linguistic services including 1.) core language support via Unicode, 2.) base linguistics and 3.) entity extraction.
Rosette Core Library for Unicode (RCLU)Rosette Core Library for Unicode (RCLU) helps organizations that have multiple language support requirements for their information systems easily implement standard Unicode encoding for a variety of global languages. RCLU is a set of programming libraries written in C that allow software developers to easily add Unicode support to their software, rather than having to develop it all themselves. RCLU supports multiple computer platforms including Windows, Linux, MAC OS and Unix, among others.
Unicode is often referred to as UTF-8 and UTF-16 (Unicode Transformation Format for single byte [8-bit] encoded languages and double byte [16-bit] encoded languages). Every letter in a Unicode-enabled system is assigned a unique 8- or 16-bit code. It is a standards-based digital encoding scheme for internationalization of software and computer systems, which allows software manufacturers to implement support for dozens of different language character sets, including single-byte Roman, Cyrillic, Hebrew and Arabic scripts as UTF-8 and double-byte Asian languages such as Japanese, Korean and Chinese in Kanji scripts (UTF-16).
Unicode also provides a standardized means for operating system software vendors to present font- and language-based information to your computer and peripheral devices (e.g., on your screen and from your printer), and to accept input from keyboards and other language- dependent devices in a standardized, language-independent fashion.
Rosette Base Linguistics (RBL)
There are multiple technology approaches for building language support tools for the wide variety of languages currently spoken throughout the world. Those can be broken down into two principal approaches: 1.) statistical methods and 2.) natural language methods. My experience has shown that a deep understanding of natural language rules and heuristics more accurately identifies specific characteristics and detail within a language set than do statistical approaches, which are more generalized. Basis uses specific natural language approaches rather than statistical approaches for its RBL language processing capabilities. However, other support tools within Basis’ portfolio of foreign language technologies do include statistical methods for foreign language processing.
Underlying RBL, Basis uses natural language-based morphological techniques in developing its core technology platform. That is, RBL understands the specific parts of a given source language in fine detail. That includes grammar, spelling, punctuation, parts of speech, semantic word roots and variants, male/female components and other detailed rules that are often extremely nuanced for a given language. RBL supports other linguistic methods for language analysis including normalization of parts of speech, segmentation, decompounding, support for lexical stemming (i.e., reducing inflected or derived words to their stem) and support for words with compound meaning.
Software that is based on the rules of natural language tends to evolve and improve in accuracy only over time and requires a long-term commitment to investment in research, development and product enhancement. Such a commitment is demonstrated by Basis’ long-term commitment to its RBL system. Figure 2 shows a screen shot of RBL identifying language, word, part of speech (POS) and stem.
Rosette Entity ExtractorThe Rosette Entity Extractor (REX) understands noun phrases within sentences in multiple languages. It specifically extracts names, places, dates and other text components. Entity extraction is an important part of providing structure to unstructured text and is critical in