Data Uncertainty, Model Uncertainty, and the Perils of Overfitting
Why should you be interested in artificial intelligence (AI) and machine learning? Any classification problem where you have a good source of classified examples is a candidate for AI. Historically, optical character recognition (OCR) was a difficult problem. We have recently experienced enormous improvement in the performance of OCR because, at least in part, we have a very large collection of already classified examples. Similarly, automatic translation between languages has made tremendous advances because we have access to enormous collections of translated documents that can be used to train the classifier. Other contexts that seem to recommend themselves to machine intelligence and AI learning are concept identification in texts, entity extraction, assigning peer reviewers to submitted documents, sentiment analysis, quality evaluation, and priority assignment.
Real data has measurement errors or has noise that makes it non-conforming to the correct, intended or original values. Data veracity has been acknowledged since at least 2012 as an issue in using AI to support business decisions. Some examples of uncertain data include:
♦ Rooms are often not square even though they were designed to be
♦ A person’s address in my contact management system from 5 years ago
♦ The official temperature reading in my city and my backyard thermometer reading
In these examples, the uncertainty can be caused by any number of factors: the carpenters measured wrong or misread a specification, or the ground beneath the building has shifted, or there was an earthquake that broke a supporting structure, or any number of possibilities. There are just as many possibilities for the other examples.
Our models are never perfect, rather ?they are useful approximations. Consider ?geocentrism, the model of the universe where the Earth is the center around which other celestial bodies orbit. This model dates from the ancient Greeks, was further developed by Ptolemy in Egypt around the 2nd Century AD. This was the accepted model until 1543 AD, when Copernicus advocated Aristarchus’ concept of heliocentrism—the model where the sun is the center of our planetary system. Debates raged for centuries as more and more information was collected, and finally around the late 18th and early 19th centuries, a confluence of empiric evidence overwhelmed the scientific community. What is important to note here is that the geocentric model was used for somewhere between 22 and 24 centuries until a heliocentric model was shown to be “better.” And now we have better models where the sun is traveling in an orbit around the center of our galaxy and the universe is expanding.