Semantic Analysis of Unstructured Data

Big data is a much-hyped subject. In fact, the term is rather misleading because it seems to relate only to the volume of data, while ignoring its heterogeneous nature. Accordingly, the general definition of "big data," as suggested by Gartner, also considers the complexity and variety of data, thus focusing on the relevance of highly heterogeneous and unstructured information and how it can be analyzed in various forms.

In particular, apart from the growing network of machines and equipment, social interactions are a further source of big data. With the exchange of messages in a wide variety of formats, on various platforms, and through multiple channels, large amounts of data are generated that demand analysis and interpretation. Emails and documents, together with audio and video content as well as social media, remain important drivers for data growth.

This data is, by nature, highly unstructured and based largely on natural language-either directly or, as in the case of audio and video content, after transcription or similar pre-processing steps. Traditional data mining or business intelligence methods cannot be applied to analyze this type of data. Instead, content-based indexing and analysis is required.

Semantics and language processing methods are used to extract relevant information from unstructured data streams and texts, identify structures and create links between the data itself and with other data sources. In a sense, the primary goal is to implement business intelligence for text—and that requires innovative technologies.

Linguistics and Semantics Working Together

Gaining structured, usable information from big data sources requires the use of linguistics, or language technology. Information is extracted and structures are identified through the in-depth analysis of texts and data flows. This involves several processing steps that are initially carried out at the linguistic level only, independent of the specific application domain. These processing steps include, for example, language recognition, sentence segmentation, tokenization, lemmatization, part-of-speech analysis and noun-phrase recognition.

Carrying out all of these steps parses the relevant text, creating an annotation structure similar to adding hand-written notes containing detailed comments to a printed copy.

Semantics makes content accessible. While linguistics makes it possible to identify a document's linguistic structures, that process does not link it to a specific application domain because it is not yet possible to recognize technical concepts or extract information about relationships.

To do that, we need a specific knowledge model, which can often be created from existing data (more below). In the simplest case, the knowledge model consists of relevant concepts taken from the area of application, for example, the names of products drawn from a product catalog, suppliers and customers from CRM, or relevant materials and components as identified by the R&D department.

Based on this information, the annotated text is semantically analyzed, yielding the following possibilities:

  • Recognizing domain-specific nouns and phrases—such as products, replacement parts, suppliers or customers-in the texts, so that it is immediately clear which of these entities are being referred to;
  • Identifying the relationships between these entities, for example, when assigning products mentioned in the text to product categories, or parts to products;
  • Extracting events and facts by recognizing, based on the linguistic pattern, who has interacted with whom, when and in what form. For example, the user can automatically recognize which competitors ?have entered into partnerships, or which individuals have expressed an opinion and in what context.

Existing Systems Provide Input

Using semantic processes requires knowledge models (so-called ontologies) in which the area of application is described in technical terms. Today, sophisticated systems for managing these knowledge models are available as part of the semantic web. Most importantly, however, companies themselves own numerous sources of information that can be used to populate these ontologies: Information about components and products can be extracted from a spare parts catalog; the CRM system contains information about customers and partners; R&D departments certainly have systems from which relevant technologies can be extracted; and there may even be online sources (so-called linked open data) that can be used to create a knowledge model, so that only minimal revisions are required after automated import processes have been completed.

By using the above-mentioned technologies, information inside an organization can be found faster and more accurately, drastically reducing redundancy. Moreover, when searching in an organization's internal data sources, a domain-specific knowledge model overcompensates for the lack of links as used to determine relevance in web searches.

Semantic platform for value added knowledge management

The Empolis Information Access System (IAS) is the highly scalable, semantic platform for value-adding knowledge management solutions, which integrates the above-mentioned technologies and linguistic processes. IAS has the ability to "understand" unstructured data, e.g. text, and transforms the data into so-called "smart information" with semantic annotations. IAS allows for massive parallel processing utilizing linguistic methods for information extraction. These, in turn, form the basis for Empolis' Smart Information Management solutions, which transform unstructured content into structured information that can be automatically processed with the help of content analysis.

Empolis Smart Information Management (SIM) combines component content management and knowledge management. SIM represents comprehensive creation, management, analysis, intelligent processing and provision of all information relevant to a company's business processes, regardless of source, format, user, location or device. Content created and managed in a component content management system is uniquely combined with mined and generated knowledge about products, customers, their profiles, suppliers and much more in a knowledge management system to deliver intelligent, smart information and added value. This enables organizations to optimize business-critical processes, to make founded decisions in extremely flexible and dynamic markets, and to better understand and recognize emerging developments and issues, in order to be able to react correctly and in time.

For more information, contact Empolis Information Management GmbH: info@empolis.com or www.empolis.com

KMWorld Covers
for qualified subscribers
Subscribe Now Current Issue Past Issues
Companies and Suppliers Mentioned