-->

Register Now to SAVE BIG & Join Us for KMWorld 2025, November 17-20, in Washington, DC.

Unstructured content and GenAI: Bridging gaps and transforming access

Tapping into the value of unstructured content is an ongoing goal for many organizations navigating the expanse of their data estates. Through technologies such as natural language processing and generative AI (GenAI), transforming unstructured content into structured, actionable insights can supercharge enterprise decision making and competitive presence in the market.

Grant Spradlin, VP of product, Vertesia, and Jan Štihec, director, data and GenAI, Shelf, discussed the significance of leveraging unstructured content in KMWorld’s webinar, From Unstructured Content to Actionable Insights: Accelerating Access to Accurate, Relevant Information, offering their expertise and guidance on the utility and value of unstructured information.

Regarding unstructured data, GenAI plays a major role in unlocking its potential, according to Spradlin. Effectively preparing unstructured content for large language models (LLMs) and retrieval-augmented generation (RAG) means tackling GenAI’s biggest challenges, which include content and data preparation, building the necessary infrastructure, and delivering on GenAI outcomes.

Pre-processing unstructured content is critical for GenAI to prevent hallucinations, noted Spradlin, limiting unexpected or incorrect results. Yet many of today’s pre-processing tools are ill-equipped to process unstructured content, missing crucial information that ultimately invites hallucinations and errors.

Innovating in this space, Vertesia’s platform acts as a semantic layer for LLMs, transforming unstructured content—such as PDFs—into structured XML by intelligently processing each page.

“We understand how to best augment the original content,” said Spradlin. “We’re never going to rewrite or modify any of the original content, but we’ll pull out information that's in tables, we’ll appropriately identify what’s in images, and we'll preserve the full content hierarchy so that we have the original fidelity of the source document.”

However, this is just one step in preparing unstructured content for GenAI, according to Spradlin. Additionally, organizations must classify the content while providing a schema that assigns metadata to that content. Generating that metadata, chunking content, and creating embeddings is equally as important, as each of these actions helps to drive searchability and findability of that content.

It’s necessary to recognize that the role of unstructured data has massively changed over the past few years due to the popularity of GenAI, explained Štihec.

GenAI “puts unstructured data on center stage…[because] it’s trained on vast amounts of unstructured data,” said Štihec. Furthermore, enterprise GenAI use cases often require LLMs to pull from an organization's repository of unstructured data. With most data in the enterprise being unstructured—compounded with the fact that this data is typically not actively managed or quality assured—makes its effective management even more critical.

This is highlighted by research from Shelf—which involved the analysis of millions of documents—where data quality was identified as a leading obstacle preventing the successful usage of unstructured data. The research found that:

  • 94% of files contained at least one major inaccuracy
  • 26% of files contained outdated information
  • 33% of files contained duplicate and redundant information
  • 12% of company files had at least one compliance risk

Štihec then introduced the Shelf platform, which helps companies automatically identify, fix, and monitor unstructured data issues so they can deploy GenAI with confidence. The platform helps enterprises:

  • Eliminate bad unstructured data before it becomes bad answers.
  • Gain control over unstructured data.
  • Continuously improve GenAI answers.
  • Get answer transparency, enabling responsible AI.

This is only a snippet of the From Unstructured Content to Actionable Insights webinar. To view the full webinar, featuring detailed explanations, survey results, a Q&A, and more, you can view an archived version of the webinar here.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues