The fundamental problem in learning from data is proper Model Selection. As we have seen in the previous lectures, a model that is too complex could overfit the training data (causing an estimation error) and a model that is too simple could be a bad approximation of the function that we are trying to estimate (causing an approximation error). The estimation error arises because of the fact that we do not know the true joint distribution of data in the input and output space, and therefore we minimize the empirical risk (which, for each candidate model, is a random number depending on the data) and estimate the average risk again from the limited number of training samples we have. The approximation error measures how well the functions in the chosen model space can approximate the underlying relationship between the output space on the input space, and in general improves as the “size” of our model space increases.
In the preceding lectures, we looked at some solutions to deal with the overfitting problem. The basic approach followed was the Method of Sieves, in which the complexity of the model space was chosen as a function of the number of training samples. In particular, both the denoising and classification problems we looked at consider estimators based on histogram partitions. The size of the partition was an increasing function of the number of training samples. In this lecture, we will refine our learning methods further introduce model selection procedures that automatically adapt to the distribution of the training data, rather than basing the model class solely on the number of samples. This sort of adaptivity will play a major role in the design of more effective classifiers and denoising methods. The key to designing data-adaptive model selection procedures is obtaining useful upper bounds on the estimation error. To this end, we will introduce the idea of “Probably Approximately Correct” learning methods.
The method of Sieves underpinned our approaches in the denoising
problem and in the histogram classification problem. Recall that the
basic idea is to define a sequence of model spaces
The choice of the model space
In general, learning basically comprises of two things:
Sieves basically force us to deal with (2) a priori (before we analyze the training data). This will lead to suboptimal classifiers and estimators, in general. Indeed deciding where/how to average is the really interesting and fundamental aspect of learning; once this is decided we have effectively solved the learing problem. There are at least two possibilities for breaking the rigidity of the method of sieves, as we shall see in the following section.
The basic idea is to select
Let
be a function from
for some small
A typical example could be the
use of VC dimension to characterize the complexity of the collection
of model spaces i.e.,
Consider a very large class of candidate models
This probability bound also implies an upper bound on the estimation error and complexity regularization is based on the criterion
Complexity Regularization and SRM are very similar and equivalent in certain instances. A distinguishing feature of SRM and complexity reqularization techniques is that the complexity and structure of the model is not fixed prior to examining the data; the data aid in the selection of the best complexity. In fact, the key difference compared to the Method of Sieves is that these techniques can allow the data to play an integral role in deciding where and how to average the data.
Probability bounds of the forms in Equation 4 and Equation 6 are the foundation for SRM and complexity regularization techniques. The simplest of these bounds are known as PAC bounds in the machine learning community.
In order to develop complexity regularization schemes we will need to
revisit the estimation error / approximation error trade-off. Let
The approximation error depends on how close
![]() |
Probability bounds of the forms in Equation 4 and Equation 6
guarantee that the empirical risk is uniformly close to the true risk,
and using Equation 4 and Equation 6 it is possible to show that
with high probability the selected model
or
The estimation error will be small if
This says that the difference between
To introduce PAC bounds, let us consider a simple case. Let
Assume
Since
The last inequality follows from the fact that if
Note that for
Assume that
Recall that for any non-negative random variable
Minimizing with respect to
"A complete course in modern statistical learning theory at the graduate student level."