Are you receiving the most up-to-date KM news? Subscribe to one or more of our newsletters to make sure you are!

It's a messy endeavor: Automated text processing

This article appears in the issue January 2014, [Vol 23, Issue 1]
Page 1 of 2 next >>

   Bookmark and Share

Björn Höhrmann, a German engineer who contributes to open source projects, posted the question, "How much does it cost to archive all human audio visual experiences?" He then proceeded to answer his own question. According to his estimate: about $1 trillion per year. (For more details, see

My back-of-the-envelope calculation suggests that is likely to be conservative. It's no secret that nano devices equipped with CPUs, software and wireless will start becoming more widely available. Höhrmann's analysis does not appear to include the data that new technologies will generate in ever-increasing volumes.

Now consider the monumental problem of converting the audio to text and then indexing that text. On top of that, keywords and metadata are needed because nobody wants to watch or listen to a video to find the item of information needed. There are not enough hours in the day to keep pace with the textual information available. Toss in millions of hour-long podcasts or one day's uploads to YouTube and time becomes an insurmountable barrier.

Global media attention directed at content processing is partly a reaction to this simple fact of digital life: These days the only way a company, a government entity or a researcher can figure out what is happening is to use next-generation tools. The tools, as most knowledge management professionals know, are somewhat crude. The limitations on the most advanced products are ones that are difficult to work around. Budgets are tight, so systems that can filter or trim the content are essential. Computing power continues to increase, but the volume of data and the mathematical recipes themselves can bring supercomputers to a halt. Most concerning is how the plumbing required to move large volumes of data from Point A to Point B has capacity limitations. To increase available bandwidth in a computing infrastructure is not quite the walk in the park some marketers picture in their HDR-colored PowerPoints. 


When I was in Australia in 2009, I learned about Leximancer, a text processing system that had its roots at the University of Queensland. I spoke with Andrew Smith, a physicist and the founder of Leximancer, about what makes his system different from other systems. Leximancer, unlike other content processing systems, is designed to show users the information landscape to raise awareness of the space of available knowledge. The idea is to enable a user to generate and explore hypotheses. "My goal from the start," Smith said, "was to create a practical system for doing a kind of spectrum analysis on large collections of unstructured data, in a language-independent and emergent manner."

The company, conceived in 1999, was blessed with foresight because it immediately embraced the idea that the amount of digital information would grow exponentially. The core concept for the Leximancer system is that patterns of meaning would be latent in that data. As Smith told me, "Humans have limited memory, time and cognition, so these critical patterns of meaning might be missed by the people who need to know." Leximancer's system is designed to fill in the gaps of human brainpower and show users the information landscape to raise awareness of the space of available knowledge, and enable them to generate and explore hypotheses.

Smith said, "Leximancer is used most of the time for analyzing surveys, interviews, stakeholder submissions, social media extracts, research articles and patent corpora, engineering documentation, policy documents, inquiry transcripts, etc. It is not primarily a search engine, and is certainly not an enterprise search solution, though it is used as a component of such."

The Leximancer system is almost entirely data-driven, so that the "ontology" emerges from the data and is faithful to that data. Smith said, "My sense was that the gulf between the quantity of available information versus the actual human awareness, integration and understanding of this information is a serious and insidious threat. Certainly we address the problem of not knowing the best search terms to use in a given context, but we also address the problem of not even knowing what questions can or should be asked." 

Valid representation of data

In the global data explosion, users are in a difficult position. Smith explained, "What we have seen is that many users are not prepared to think hard enough to understand complex or statistical truths. Users are looking for a plausible story or anecdote from the data, even if it is not representative. I think this is a danger in some interfaces for any user who is doing serious search/research/analysis. With our new product under development, we are designing to achieve both. We take care of statistical validity and present the user with an attractive mash-up that is nevertheless a valid representation of the data."

To help organizations and professionals who must analyze information, Leximancer offers software as a service and on-premises options. Smith positions Leximancer as an enhancement to existing retrieval systems, not a replacement.

He explained, "I do believe that most if not all current search technologies are not suitable for social media, or most fire hoses of weakly structured complex data such as system or transaction logs. The points that support my reasons for this are, first, that each data record is only a fragment of some unfolding story and cannot be understood in isolation, and contains few if any of the obvious keywords for its thread. Second, multiple stories are being played out simultaneously in different time scales, and the fragments of each story are intermixed in the fire hose. Third, terms that make up the data items can mean different things in different contexts, or different terms can mean the same things in some contexts. And, lastly, new data terms can appear at any time." 

Four challenges

If we ignore for the moment the problem of processing "all" content, four interesting challenges are testing organizations that want to manage their knowledge in a more effective way.

The first is the shortage of mathematicians. Earlier this year, Dr. Roslyn Prinsley told The Conversation: "The fact that the demand in Australia for math graduates, at the minute, is outstripping supply is a major issue for this country. From 1998 to 2005, the demand for mathematicians increased by 52 percent. From 2001 to 2007, the number of enrollments in a mathematics major in Australian universities declined by 15 percent. On the global scale, we are falling behind too. In 2003, the percentage of students graduating with a major in mathematics or statistics in Australia was 0.4 percent. The Organization for Economic Co-operation and Development's (OECD, average was 1 percent." (See

Prinsley's comments have global implications. In the United States, the problem is not just a decline in the mathematics major. There is a critical shortage of mathematics teachers. States from Alaska to Wyoming are being severely affected. See, for instance, the Commonwealth of Virginia's "Critical Shortage Teaching Endorsement Areas for 2013-2014 School Year" and the U.S. Department of Education's Nationwide Listing of teacher shortage areas. Without individuals skilled in mathematics, systems that rely on numeric recipes will be unfathomable. How can an organization or an individual determine if a system's outputs are valid with a subpar education in numbers?

Page 1 of 2 next >>

Search KMWorld