# # Outline

• An Overview of Statistical Learning
统计学习概述
• What is Statistical Learning?
什么是统计学习？
• Assessing Model Accuracy
评估模型精度

# # An Overview of Statistical Learning

## # Introduction to Statistical Learning

Statistical learning 统计学习
refers to a vast set of tools for understanding data.

These tools can be classified as supervised or unsupervised.

Supervised learning 监督式学习
building a statistical model for predicting, or estimating, an output based on one or more inputs.

Unsupervised learning 无监督学习
no supervising output; find relationships and structure from such data.

## # Statistical Learning Problems 统计学习问题

• Identify the numbers in a handwritten zip code
识别手写邮政编码中的数字

• Establish the relationship between salary and demographic variables in population survey data
在人口调查数据中建立工资和人口统计变量之间的关系

• Predict whether the index will increase or decrease on a given day using the past 5 days percentage changes in the index
使用过去 5 天的指数变化百分比来预测某一天的指数是否会增加或减少

• Understand which types of customers are similar to each other by grouping individuals according to their observed characteristics
根据观察到的顾客特征对他们进行分组，从而了解哪些类型的顾客是相似的

## # Supervised Learning Problem 监督学习问题

• Outcome measurement $Y$ (also called dependent variable, response, target).
结果测量$Y$ (也称因变量、响应、目标)。
• Vector of m predictor measurements $X$ (also called independent variables, inputs, attributes, features).
向量 m 预测值的测量 $X$ (也称为自变量，输入，属性，特征)。
• In the regression problem, $Y$ is quantitative (e.g price, blood pressure).
回归 问题中，Y 是定量的 (例如价格、血压)。
• In the classification problem, $Y$ takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample).
分类 问题中，$Y$ 在有限的无序集合中取值 (存活 / 死亡，数字 0-9，组织样本的癌症类别)。
• We have training data $(x_1; y_1); \ldots ; (x_N; y_N)$: These are observations (examples, instances) of these measurements.
我们有训练数据$(x_1; y_1); \ldots ; (x_N; y_N)$：这些是测量的观察结果 (示例，实例)

On the basis of the training data we would like to:

• Accurately predict unseen test cases.
准确预测看不见的测试用例。
• Understand which inputs affect the outcome, and how.
了解哪些投入会影响结果，以及如何影响结果。
• Assess the quality of our predictions and inferences.
评估我们的预测和推论的质量。

### # A Linear Model and Least Squares Example 线性模型和最小二乘示例

An example of the linear model in a classification context.

$\hat{G}=\left\{\begin{array}{ll} \text { Orange } & \hat{y}=x^{T} \hat{\beta}>0.5 \\ \text { Blue } & \hat{y}=x^{T} \hat{\beta} \leq 0.5 \end{array}\right.$

Two predicted classes are separated by the decision boundary $\{x: \left.x^{T} \hat{\beta}=0.5\right\}$.

### # K-Nearest Neighbor Example K - 最近邻示例

The same classification example and then fit by 15-nearest-neighbor averaging.
The predicted class is chosen by majority vote amongst the 15nearest neighbors.

## # Unsupervised Learning Problem 无监督学习问题

• No outcome variable, just a set of predictors (features) measured on a set of samples.
没有结果变量，只是在一组样本上测量的一组预测因子 (特征)。
• objective is more fuzzy — find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation.
目标更加模糊 —— 找出行为相似的样本组，找出行为相似的特征，找出变化最大的特征线性组合。
• diffcult to know how well your are doing.
很难知道你做得怎么样。
• different from supervised learning, but can be useful as a pre-processing step for supervised learning.
不同于监督式学习，但可作为监督式学习的预处理步骤。

### # Clustering Example with Three Groups 三组聚类示例

Left: The three groups are well-separated. In this setting, a clustering approach should successfully identify the three groups.

Right: There is some overlap among the groups.

Now the clustering task is more challenging.

## # Real Problems

• Spam filter
垃圾邮件过滤器
• Malware classification
恶意软件分类
• Anomaly detection problems such as fraud detection
异常检测问题，如欺诈检测
• Recommendation system
推荐系统
• Identifying fake news
识别假新闻

One example with Iris Data: Clustering for Iris

## # Statistical Learning VS Machine Learning

• Machine learning arose as a subeld of Articial Intelligence.
机器学习是人工智能的一个分支。
• Statistical learning arose as a subeld of Statistics.
统计学习是作为统计学的一个分支出现的。
• There is much overlap — both fields focus on supervised and unsupervised problems:
有很多重叠 —— 两个领域都关注监督和非监督问题:
• Machine learning has a greater emphasis on large scale applications and prediction accuracy.
机器学习更强调大规模应用和预测准确性。
• Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
统计学习强调模型及其可解释性、精确性和不确定性。
But the distinction has become more and more blurred, and there is a great deal of cross-fertilization.
但是这种区别已经变得越来越模糊，并且存在大量的交叉现象。

# # What is Statistical Learning?

Shown are Sales vs TV , Radio and Newspaper , with a blue linear-regression line fit separately to each.

Can we predict Sales using these three?

Perhaps we can do better using a model $\text { Sales } \approx f(\text { TV, Radio, Newspaper })$

Here Sales is a response or target that we wish to predict. We generically refer to the response as $Y$.

TV is a feature, or input, or predictor; we name it $X_1$: Likewise name Radio as $X_2$; and so on. We can refer to the input vector collectively as
TV 是一个特征，或输入，或预测；我们把它命名为$X_1$: 同样把 Radio 命名为$X_2$；等等。我们可以将 输入向量 统称为

$X=\left(\begin{array}{l} X_{1} \\ X_{2} \\ X_{3} \end{array}\right)$

Now we write our model as $Y = f(X)+\varepsilon$ ; where $\varepsilon$ captures measurement errors and other discrepancies.

More generally, suppose that we observe a quantitative response $Y$ and $p$ different predictors, $X_{1}, X_{2}, \ldots, X_{p}$ : Let $X = \left(X_{1}, X_{2}, \ldots, X_{p}\right)^{T}$ , which can be written in the very general form $Y=f(X)+\varepsilon$.

$\varepsilon$ is a random error term, which is independent of $X$ and has mean zero.
$\varepsilon$ 是一个随机误差项，与$X$ 无关，均值为零。

In this formulation, $f$ represents the systematic information that $X$ provides about $Y$.

Another Example

Figure: Left: The red dots are the observed values of income (in tens of thousands of dollars) and years of education for 30 individuals.

Right: The blue curve represents the true underlying relationship between income and years of education, which is generally unknown (but is known in this case because the data were simulated).

## # What is $f(X)$ good for?

• With a good $f$ we can make predictions of $Y$ at new points $X = x$.
• We can understand which components of $X = (X_{1}, X_{2}, \ldots, X_{p})$ are important in explaining $Y$ , and which are irrelevant.
我们可以理解 $X = (X_{1}, X_{2}, \ldots, X_{p})$ 中的哪些成分对解释 $Y$ 是重要的，哪些是无关的。
e.g. Seniority and Years of Education have a big impact on Income , but Marital Status does not.
例如：“资历” 和 “受教育年限” 对 “收入” 有很大影响，但 “婚姻状况” 没有。
• Depending on the complexity of $f$, we may be able to understand how each component $X_j$ of $X$ affects $Y$.
根据$f$ 的复杂度，我们可以理解$X$ 的每个分量$X_j$ 如何影响$Y$

## # Is There an Ideal $f$?

In particular, what is a good value for $f(X)$ at any selected value of $X$, say $X = 4$?
$f(X)$ 在任意选定的$X$ 值 (比如$X = 4$) 上的值是多少？

There can be many $Y$ values at $X = 4$. A good value is $f(4) = E(Y \mid X = 4)$.
$X = 4$ 处可以有很多$Y$ 值。一个好的值是$f(4) = E(Y \mid X = 4)$

$E(Y \mid X = 4)$ means expected value (average) of $Y$ given $X = 4$.
$E(Y \mid X = 4)$ 表示在 $X = 4$ 的条件下，$Y$期望值 (平均值)。

This ideal $f(x) = E(Y \mid X = x)$ is called the regression function.

## # How to Estimate $f$?

Typically we have few if any data points with $X = 4$ exactly.

So we cannot compute $E(Y \mid X = x)$.

Relax the denition and let $\hat{f}(x)=\operatorname{Ave}(Y \mid X \in \mathcal{N}(x))$ where $\mathcal{N}(x)$ is some neighborhood of $x$.

• Nearest neighbor averaging can be pretty good for small $p$.
i.e. $p \leq 4$ and large-ish $N$.
最近邻平均对于较小的$p$ 来说是相当不错的
例如，$p \leq 4$ 和 较大的 $N$
• Nearest neighbor methods can be lousy when $p$ is large.
$p$ 很大时，最近邻方法可能很糟糕。
Reason: the curse of dimensionality. Nearest neighbors tend to be far away in high dimensions.
原因：维度诅咒。在高维空间中，最近的邻居往往离得很远。

## # Parametric and Structured Models 参数化和结构化模型

The linear model is an important example of a parametric model:

$f_{L}(X)=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\beta_{p} X_{p}$

• A linear model is specified in terms of $p + 1$ parameters $\beta_{0}, \beta_{1}, \ldots, \beta_{p}$.
线性模型由$p + 1$ 参数$\beta_{0}, \beta_{1}, \ldots, \beta_{p}$ 指定。
• We estimate the parameters by fitting the model to training data.
通过拟合训练数据来估计模型的参数。
• Although it is almost never correct, a linear model often serves as a good and interpretable approximation to the unknown true function $f(X)$.
尽管线性模型几乎从来都不是正确的，但它通常是未知真函数$f(X)$ 的一个很好的、可解释的近似。

We can also estimate $f$ with non-parametric methods.

• do not make explicit assumptions about the functional form of $f$.
不要对$f$ 的函数形式做明确的假设。
• Seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly.
寻求一个$f$ 的估计，尽可能接近数据点，而不是太粗糙或扭曲。

However, we will focus on parametric models.

A linear model $\hat{f}_{L}(X)=\hat{\beta}_{0}+\hat{\beta}_{1} X$ gives a reasonable fit here

A quadratic model $\hat{f}_{Q}(X)=\hat{\beta}_{0}+\hat{\beta}_{1} X+\hat{\beta}_{2} X^{2}$ fits slightly better.

Another Example: Revisit Income example

Red points are simulated values for income from the model $\text { income }=f \text { (education, seniority) }+\varepsilon$. $f$ is the blue surface.

Linear regression model fitt to the simulated data.

$\hat{f}_{L} \text { (education, seniority) }=\hat{\beta}_{0}+\hat{\beta}_{1} \times \text { education }+\hat{\beta}_{2} \times \text { seniority }$

More flexible regression model $\hat{f}_{S} \text { (education,seniority) }$fit to the simulated data.

Here we use a technique called a thin-plate spline to fit a flexible surface.

Even more flexible spline regression model $\hat{f}_{S} \text { (education,seniority) }$fit to the simulated data.

Here the tted model makes no errors on the training data! Also known as overtting .

### # Prediction Accuracy VS Interpretability 预测精度 VS 可解释性

Less flexible methods 不太灵活的方法
more restrictive, relatively small range of shapes for $\hat{f}$.

E.g. linear regression: easy to understand the relationship between $Y$ and $x_{1}, \ldots, x_{p}$.

More flexible methods 更灵活的方法
can generate a wider range of possible shapes to estimate $f$.

E.g. thin-plate splines: lead to such complicated estimates of $f$ that it is difficult to understand how any individual predictor is associated with the response.

Should we use more flexible models to get better accuracy?

Often more accurate prediction using a less flexible method. Why?

# # Assessing Model Accuracy 评定模型准确度

There is no free lunch in statistics!

No one method dominates all others over all possible data sets.

• Important task: decide, for any given set of data, which method produces the best results.
重要任务：决定，对于任何给定的数据集，哪种方法产生最好的结果。

• Note: Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.
注意：选择最佳方法可能是统计学习实践中最具挑战性的部分之一。

• Need: measure how well predictions match observed data.
需求：衡量预测与观测数据的匹配程度。
Quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation.
量化给定观测值的预测响应值，在多大程度上，接近观测值的真实响应值。

## # Measuring quality of fit 评价拟合质量

### # regression setting 回归设置

Mean Squared Error MSE 均方误差

$MSE = \frac{1}{n} \sum_{i=1}^{N}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$

where $\hat{f}\left(x_{i}\right)$ is the prediction that $f$ gives for the $i$th observation.

Suppose we fit a model $\hat{f}(x)$ to some training data $\operatorname{Tr}=\left\{x_{i}, y_{i}\right\}_{1}^{N}$

$MSE_{Tr} = \frac{1}{N} \sum_{i \in Tr}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$

However, this may be biased toward more overfit models.

What is the accuracy of the predictions that we obtain when we apply our method to previously unseen test data?

### # test MSE 测试 MSE

Instead we should, if possible, compute it using fresh.red} test data $Te=\left\{x_{i}, y_{i}\right\}_{1}({M)$ :

$MSE_{Te}=\frac{1}{M} \sum_{i \in Te}\left[y_{i}-\hat{f}\left(x_{i}\right)\right]^{2}$

• Scenario: test data available; minimize test MSE $MSE_{Te}$ on that set.
场景：测试数据可用；最小化该集合上的测试 MSE $MSE_{Te}$
• Scenario: no test observations available; minimizes training MSE $MSE_{Tr}$.
场景：没有可用的测试观察；最小化训练 MSE $MSE_{Tr}$

Left: Data simulated from $f$ , shown in black.

Three estimates of f are shown: the linear regression line (orange curve), and two smoothing spline fits (blue and green curves).

Right: Red curve on right is $MSE_{Te}$, grey curve is $MSE_{Tr}$.

Orange, blue and green curves/squares correspond to fits of different flexibility.

Here the truth is smoother, so the smoother fit and linear model do really well.

Here the truth is wiggly and the noise is low, so the more flexible fits do the best.

### # A Fundamental Conclusion 基本结论

• Increase in model flexibility
增加模型的灵活性
• decrease in $MSE_{Tr}$ 减少
• $MSE_{Te}$ depends on the truth, most case you will see a U-shape
取决于事实，大多数情况下你会看到一个 U 型
• Overfitting: small $MSE_{Tr}$ + large $MSE_{Ts}$
过拟合
• What to do if no test data available?
没有测试数据怎么办？
• Cross-validation: a method for estimating test MSE using training data. (We will learn later)
交叉验证：一种利用训练数据估计测试均方误差的方法。(我们将在后面学习)
• Partition the data
分割数据

## # Classification Setting 分类设置

Here the response variable $Y$ is qualitative

e.g. email is one of $\mathcal{C}=(\text { spam, ham })$ (ham=good email), digit class is one of $\mathcal{C}= \{ 0.1, \ldots,9 \}$

• Typically we measure the performance of $\hat{\mathcal{C}}(x)$ using the misclassication error rate:
通常，我们使用误分类错误率来衡量$\hat{\mathcal{C}}(x)$ 的性能：

$Err_{T e}=A v e_{i \in T e} l\left[y_{i} \neq \hat{\mathcal{C}}\left(x_{i}\right)\right]$

• There are several classfiers: the Bayes classifier, support-vector machines Logistic regression, k-nearest neighbour, decision trees, etc.
有几种分类器：贝叶斯分类器、支持向量机 Logistic 回归、k 近邻、决策树等。

• Some other measures needed. (Why? )
还需要一些其他措施。(为什么？)