CS 189 Intro to Learning Theory
CS 189 Learning Theory blog
A range space is a pair $P$, $H$:
$P$ is set of all possible test/training points (can be infinite)
$H$ is a hypothesis class (set the boundary of legal classifiers), a set of hypotheses (classifiers)
- e.g. all the linear classifiers
Suppose all training pts & test pts are drawn independently from same prob. distribution $\mathcal D$ defined on domain $P$
$h\in \mathcal H$ be a hypothesis (a classifier), h predicts a pt x is in class C if $x\in h$
Risk (generalization error) $R(h)$ of $h$ is the probability that $h$ misclassifies a random pt $x$ drawn from $\mathcal D$ (i.e. the prob that $x\in C$ but $\notin h$) ← essentially test error
- risk is the average test error for test points drawn randomly from $\mathcal D$ (但我们一般是一个 subset of the theoretical entire set of $\mathcal D$, so test error sometimes is high, sometimes is low).
Notation:
Let $X\subseteq P$ be a set of $n$ training pts drawn from $\mathcal D$
Empirical risk (training error) $\hat R(h)$ is % of $X$ misclassified by $h$
$h$ misclassfies each training pt w/ prob. $R(h)$, so total misclassified has a binomial distributio.
$P(|\hat R(h)-R(h)|>\epsilon) \le2e^{-2\epsilon^2n}$
- when n is very large, the probability gets smaller.
Idea for learning algorithm: choose $\hat h\in H$ that minimizes $\hat R(\hat h)$ ← empirical risk minimization
Problem: if too many hypotheses, some $h$ with high $R(h)$ will get lucky and have very low $\hat R(h)$
- we can have so many hypotheses that some of them just get lucky and score far lower training error than their actual risk.
→ This is overfitting
Dichotomies
- a dichotomy of $X$ is $X \cap h$, where $h\in H$
- picks out the training points that $h$ predicts are in class $C$
- e.g. for $n$ training points, there could be up to $2^n$ dichotomies.
- $\hat R(\hat h) = 0$ even if every $h\in H$ has high risk
- we simply overfits
We limit to have a constant $\prod$ dichotomies,
$P(\text{at least one dichotomy has } |\hat R - R|\ge\epsilon)\le\delta$
- $\delta = 2\prod e^{-2\epsilon^2n}$
- Hence with prob. $\ge 1-\delta$ for every $h\in H$
- $\downarrow$ complement of the above inequality
- $|\hat R(h) - R(h)| \le \epsilon = \sqrt{\frac 1{2n}ln\frac{2\prod}{\delta}}$
Smaller $\prod$ means we’re less likely to overfit, we have less variance, but more bias. But it doesn’t mean that risk is small.
- Goal: we want a H that fits the data well and doesn’t produce many dichotomies.
Let $h^* \in H$ minimizes $R(h^*)$: “best” classifier
Let $\hat h\in H$ minimizes $\hat R(\hat h)$; the classifier we learn, with prob $\ge 1-\delta$, our chosen $\hat h$ has nearly optimal risk:
Note: we still choose the best classifier $\hat h$ that minimizes the empirical risk. We don’t know the actual $h^*$ since we have no means to know it. But if $\prod$ is small and $n$ is large, the hypothesis $\hat h$ we have chosen is probably nearly as good as $h^{*}$.
We have the following inequalities:
$R(\hat h) \le \hat R(\hat h) +\epsilon$
- got from previous inequalities
since $\hat R$ is lowest on classifiers trained on training data, it’s not the lowest on the theoretical optimal $\hat h$.
Then, using the previous inequality again, we have
$\hat R (h^{*})+ \epsilon \le R(h^*) + 2\epsilon$
To sum up,
Sample complexity: the # of training pts needed to achieve this $\epsilon$ with prob $\ge 1-\epsilon$
CS 189 Intro to Learning Theory
https://charliejcj.github.io/blog/2023/05/07/learning-theory/