Enriching text analytics with graph databases
Components of graph database systems
Property graph databases and RDF graph databases are two of the major categories of graph databases. “Property graphs are well-suited for ecommerce said Andreas Blumauer, CEO of the Semantic Web Company. “They are frequently used in recommendation engines to find products of potential interest given that an individual has made a particular purchase.” However, for complex scenarios such as knowledge discovery, Blumauer suggests RDF graphs are a better choice. “For pharmaceutical research or decision support about whether to invest in a new oil well, either of which could have major safety and cost implications, a semantic layer is essential,” he said, “and that is provided by RDF.”
The layer that users interact with is typically a knowledge graph that is on top of the graph database and other data sources, which may also include relational databases. In between the knowledge graph and the underlying databases, middleware provides connectors and a semantic layer. PoolParty Semantic Suite is a software product from the Semantic Web Company, which specializes in knowledge engineering and knowledge graph management. Among the components of the platform is PoolParty Thesaurus Manager, which supports the management of taxonomies and ontologies as well as the maintenance of knowledge graphs. Another component is PoolParty Extractor, with which semantic tagging and NLP can be integrated into applications as a service based on knowledge graphs and machine learning.
PoolParty uses the Simple Knowledge Organization System (SKOS) standard developed by the W3C to support knowledge organization systems such as thesauri, classification schemes, subject heading systems, and taxonomies. “SKOS provides a means for machines to understand the concepts without regard to the specific labels,” noted Blumauer. The standard allows for interoperability among the different information sources.
The next step in the text analytics process is to distinguish between labels that have the same name but mean different things: “Java,” for example. “We compare what is found on one side of the text in the graph database and what is found on the other,” explained Blumauer. “If the words are ‘Indonesia,’ or ‘tourism,’ then it’s easy to figure out that the intended meaning is the island; if the words on the other side of the link are ‘program’ or ‘computer,’ then it is likely the computer language that is the topic.” A statistical approach in which the system is fed millions of documents for supervised learning is another approach, but in some industry categories, the required training set is not available, and the process is less efficient.
With the use of graph databases, text analytics is no longer a matter of extracting simple terms; a great deal more information about the entity can be inferred. “When a user searches for an article about Switzerland after the metadata has been enriched by content and relationships in the graph database,” observed Blumauer, “the system will know quite a bit—for example, that Switzerland is part of Central Europe and that all the factors affecting that geographic area also apply to Switzerland. It will be able to connect with additional related information, which can be dynamic as new facts are acquired. Or, through the use of an ontology, the system can be aware that a certain document is a legal text, and that it is related to a contract of a certain size.” Any incoming data object can be linked to an existing part of the knowledge graph, which results in the growth of the graph, which can, in turn, be used for improved text analytics.