September 6, 2019
By Daniel Vasicek, Senior Data Scientist, Access Innovations, Inc.
Article

Data Uncertainty, Model Uncertainty, and the Perils of Overfitting

Why should you be interested in artificial intelligence (AI) and machine learning? Any classification problem where you have a good source of classified examples is a candidate for AI. Historically, optical character recognition (OCR) was a difficult problem. We have recently experienced enormous improvement in the performance of OCR because, at least in part, we have a very large collection of already classified examples. Similarly, automatic translation between languages has made tremendous advances because we have access to enormous collections of translated documents that can be used to train the classifier. Other contexts that seem to recommend themselves to machine intelligence and AI learning are concept identification in texts, entity extraction, assigning peer reviewers to submitted documents, sentiment analysis, quality evaluation, and priority assignment.

Data Uncertainty

Real data has measurement errors or has noise that makes it non-conforming to the correct, intended or original values. Data veracity has been acknowledged since at least 2012 as an issue in using AI to support business decisions. Some examples of uncertain data include:

♦ Rooms are often not square even though they were designed to be

♦ A person’s address in my contact management system from 5 years ago

♦ The official temperature reading in my city and my backyard thermometer reading

In these examples, the uncertainty can be caused by any number of factors: the carpenters measured wrong or misread a specification, or the ground beneath the building has shifted, or there was an earthquake that broke a supporting structure, or any number of possibilities. There are just as many possibilities for the other examples.

Model Uncertainty

Our models are never perfect, rather ?they are useful approximations. Consider ?geocentrism, the model of the universe where the Earth is the center around which other celestial bodies orbit. This model dates from the ancient Greeks, was further developed by Ptolemy in Egypt around the 2nd Century AD. This was the accepted model until 1543 AD, when Copernicus advocated Aristarchus’ concept of heliocentrism—the model where the sun is the center of our planetary system. Debates raged for centuries as more and more information was collected, and finally around the late 18th and early 19th centuries, a confluence of empiric evidence overwhelmed the scientific community. What is important to note here is that the geocentric model was used for somewhere between 22 and 24 centuries until a heliocentric model was shown to be “better.” And now we have better models where the sun is traveling in an orbit around the center of our galaxy and the universe is expanding.

Free

for qualified subscribers

Subscribe Now Current Issue Past Issues

Register Now to SAVE BIG & Join Us for KMWorld 2025, November 17-20, in Washington, DC.

Data Uncertainty, Model Uncertainty, and the Perils of Overfitting

Special Report- Shadow AI: Managing the Unseen Copyright Risks in Your Organization

Supercharging Your Customer Experience Program With AI and Automation

Special Report- The Role Metadata Plays in the Information Lifecycle

More

Driving Real Impact with AI-Powered Document Automation

Better Together: Combining Generative and Extractive AI

Fact, Failure, or Fantasy: Navigating How to Win with AI in Knowledge Management

The Future of Intranets: Building Dynamic, Collaborative and User-centric Digital Workspaces

More Webinars