Data Uncertainty, Model Uncertainty, and the Perils of Overfitting
Fitting the Data to the Model
Balancing measurement errors with model errors enhances model predictions.
Carl Friedrich Gauss, a German mathematician and physicist, made two major changes to a model that his predecessors had tried to use to rediscover the dwarf planet Ceres. One is that he adjusted the orbit parameters to minimize the sum of the squared error between the observed measurements and the model’s elliptical orbit predictions, allowing him to improve his estimate of the orbit parameters. The second is that he forced the model to be an elliptical orbit. Using only elliptical orbits, he eliminated some wilder variations available in straight line and circular motion. Ceres would move in an elliptical orbit if it was moving under the gravitational influence of only the sun. However, Ceres is moving under the gravitational influence of the planets as well as the sun which perturbs it to wobble about an approximately elliptical solar orbit. An elliptical orbit model is better than either a straight line or a circle model—but it remains imperfect.
Once we allow for noise in measurements, we must allow for the fact that exact fits are no longer possible or even desirable. An exact fit to noisy data means that the model is fitting the noise. Each new data point will require an increase in complexity of the model. Noise means that repeated measurements of the same thing will not produce the same values, implying uncertainty in predictions and violent numerical behavior. We want to reduce the effect of noise on the model because we want to reduce the uncertainty in our predictions. Models can still predict from noisy data, but they must not fit perfectly.
When the uncertainty in predictions becomes less than the uncertainty in data, overfitting should be considered as a possibility. Overfitting is adjusting the model to fit the data exactly, even though we know that data to be uncertain. In overfitted models noise has large effects on model predictions.
For example, machine learning is applied to a physics journal corpus to develop author profiles.
Overfitting leads the AI to assign extremely granular concepts to authors (e.g., “Quantum entanglement” rather than just “Physics”). The highly precise sub-topic leads the machine to NOT suggest this author’s submission for review because they are not an expert in “Physics,” rather, they are an expert in a highly specialized area of physics.
Overfitting is a constant challenge with any machine learning task. Because of the neural network basis of machine learning, and the fact that an overly complex model will often fit the same data “better,” we must constantly be on guard against overfitting, balancing the errors associated with measurements with errors associated with overfitting.