# # Decision Trees 决策树

「机器学习实战」摘录 - 决策树

k - 近邻算法可以完成很多分类任务，但是它最大的缺点就是无法给出数据的内在含义，决策树的主要优势就在于数据形式非常容易理解。

• 优点：计算复杂度不高，输出结果易于理解，对中间值的缺失不敏感，可以处理不相关特征数据。
• 缺点：可能会产生过度匹配问题。
• 适用数据类型：数值型和标称型。

## # Basic

A decision tree is a flow chart like tree structure

### # How it works?

• We learn and build a tree structure based on the training set
我们基于训练集学习并构建一个树结构
• After that, we are able to make predictions based on the tree
然后，我们能够基于树进行预测
• Example, Is it good to play golf? {sunny, windy, high humidity}
适合打高尔夫吗？ 晴朗、多风、高湿度
• Question: How to learn such a tree? There could be many possible trees
问题：如何学习这样的树？可能有许多可能的树

Example: “is it a good day to play golf?”

• a set of attributes and their possible values

• outlook → sunny, overcast, rain
• temperature → cool, mild, hot
• humidity → high, normal
• windy → true, false

In this case, the target class is a binary attribute, so each instance represents a positive or a negative example.
在本例中，目标类是一个二进制属性，因此每个实例表示一个正示例或一个负示例。

• So a new instance:
<rainy ,hot, normal, true> : ?
will be classified as "noplay"
• Root node: the top of the tree, e.g. the node Outlook
根节点
• Parent and Children nodes: outlook as the parent, humidity and windy are children nodes
outlook 作为父节点，“湿度” 和 “风象” 是子节点
• Leaf node: P , N are the leaf nodes which do not have children and we can reach a leaf node to get the predictions
P , N 是没有子节点的叶节点，我们可以到达叶节点来获得预测

• If attributes are continuous, it should be converted into a nominal variable
如果属性是连续型变量，则应将其转换为标称型变量
• Each path in the tree represents a decision rule:
1. Rule1:
If (outlook="sunny") AND (humidity <= 0.75)
Then (play="yes")

2. Rule2:
If (outlook="rainy") AND (wind > 20)
Then (play="no")

3. Rule3:
If (outlook="overcast")
Then (play="yes")

ID3
or Iternative Dichotomizer, was the first of these three Decision Tree techniques implementations developed by Ross Quinlan (Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106.)

C4.5
Quinlan's next iteration. 昆兰的下一个迭代。
The new features (versus ID3) are:

1. accepts both continuous and discrete features;
同时接受连续和离散的特征。
2. handles incomplete data points;
处理不完整的数据点。
3. solves over-fitting problem by (very clever) bottom-up technique usually known as "__pruning__"; and
通过（非常聪明的）自下而上的技术解决过度拟合问题，通常被称为 "剪枝"；以及
4. different weights can be applied the features that comprise the training data.
可以对组成训练数据的特征使用不同的权重。

CART
or Classification And Regression Trees. 或分类和回归树。
The CART implementation is very similar to C4.5; the one notable difference is that CART constructs the tree based on a numerical splitting criterion recursively applied to the data.
CART 的实现与 C4.5 非常相似；一个明显的区别是，CART 是根据一个递归应用于数据的数字分割标准来构建树。

### # Top-Down Decision Tree Generation 自上而下的决策树生成

• The basic approach usually consists of two phases:
基本方法通常包括两个阶段：

1. Tree construction 树的构建
• At the start, all the training examples are at the root
开始时，所有的训练实例都在根部
• Partition examples are recursively based on selected attributes
根据选定的属性，递归地分割实例
2. Tree pruning 树的修剪
• remove tree branches that may reflect noise in the training data and lead to errors when classifying test data
删除可能反映训练数据中的噪声并导致测试数据分类时出现错误的树枝
• improve classification accuracy
提高分类精度
• Basic Steps in Decision Tree Construction 决策树构建的基本步骤

• Tree starts a single node representing all data
树开始是一个代表所有数据的单一节点
• If sample are all same classthen node becomes a leaf labeled with class label
如果样本都是同一类别，那么节点就会成为标有类别标签的叶子。
• Otherwise, select feature that best separates sample into individual classes.
否则，选择最能将样本分成各个类别的特征。
• Recursion stops when:
递归在以下情况下停止：
• Samples in node belong to the same class (majority)
节点中的样本属于同一类别（多数）。
• There are no remaining attributes on which to split
没有剩余的属性可供分割

## # Feature Selection 特征筛选

Choosing the “Best” Feature

Feature selection 特征筛选
is the key component in decision trees: deciding what features of the data are relevant to the target class we want to predict.

• Popular impurity measures in decision tree learning
决策树学习中流行的杂质度量
• Information Gain: Used in ID3.
信息增益。在 ID3 中使用。
• Gain Ratio: improvement over information gain. It is used in C4.5
增益比：对信息增益的改进。它在 C4.5 中使用
• Gini Index: Used in CART. It is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it was randomly labelled according to the distribution of labels in the subset.
吉尼指数。在 CART 中使用。它是衡量从集合中随机选择的元素，如果按照子集合中的标签分布随机贴上标签，那么它被错误贴上标签的频率。

### # Understand Entropy & Information Gain 理解熵和信息增益

• The decision tree is built in a top down fashion, but the question is how do you choose which attribute to split at each node?
决策树是自顶向下构建的，但问题是如何选择在每个节点上分割哪个属性？
The answer is find the feature that best splits the target class into the purest possible children nodes ( ie : nodes that don't contain a mix of both male and female, rather pure nodes with only one class ).
答案就是找到能够最好地将目标类分解为尽可能纯的子节点的特性 (例如：不包含男性和女性混合的节点，而是只包含一个类的纯节点)。

• For instance, in our previous example on recognition of tigers and lions, we have features like stripes, weights, size, color , etc. But, you may notice that, “stripes” is the key feature to distinguish tigers and lions!
例如，在我们上一个关于老虎和狮子识别的例子中，我们有条纹、重量、大小、颜色等特征，但是，你可能会注意到，“条纹” 是区分老虎和狮子的关键特征！

• This measure of information is called purity.
这种对信息的衡量被称为纯净度。

• Entropy is a measure of impurity. Information Gain which is the difference between the entropies before and after.
熵是对不纯度的一种衡量。信息增益，是前后熵的区别。

• Information_Gain = Entropy_before split - Entropy_after split
信息增益 = 分割前的熵 - 分割后的熵

• Entropy_before split = Impurity before the split
分割前的熵 = 分割前的不纯度

• Entryopy_after split = Impurtiy after the split
分割后的熵 = 分割后的不纯度

• The larger an information gain value (by a feature) is, the feature should be used to split the instances as the “best” node.
一个信息增益值（由一个特征决定）越大，该特征应该被用来作为 "最佳" 节点来分割实例。

### # Trees Construction Algorithm (ID3)

#### # Decision Tree Learning Method (ID3)

• Input: a set of training examples $S$, a set of features $F$

1. If every element of $S$ has a class value “yes”, return “yes”; if every element of S has class value “no”, return “no”
2. Otherwise, choose the best feature $f$ from $F$ (if there are no features remaining, then return failure);
3. Extend tree from $f$ by adding a new branch for each attribute value of $f$
1. Set $F_{’}= F –$
4. Distribute training examples to leaf nodes (so each leaf node $n$ represents the subset of examples $S_{n}$ of $S$ with the corresponding attribute value
5. Repeat steps 1-5 for each leaf node $n$ with $S_{n}$ as the new set of training examples and $F_{’}$ as the set of attributes until we finally label all the leaf nodes
• Main Question:

• how do we choose the best feature at each step?

Note: ID3 algorithm only deals with categorical attributes, but can be extended (as in C4.5) to handle continuous attributes

#### # Choosing the “Best” Feature

Use Information Gain to find the “best” (most discriminating) feature

• Assume there are two classes, $P$ and $N$ (e.g, $P$ = “yes” and $N$ = “no”)
• Let the set of instances $S$ (training data) contains $p$ elements of class $P$ and $n$ elements of class $N$
• The amount of information, needed to decide if an arbitrary example in $S$ belongs to $P$ or $N$ is defined in terms of entropy, $I(p,n)$:

$I(p, n)=-\operatorname{Pr}(P) \log _{2} \operatorname{Pr}(P)-\operatorname{Pr}(N) \log _{2} \operatorname{Pr}(N)$

• Note that $\operatorname{Pr}(P)=p /(p+n) \text { and } \operatorname{Pr}(N)=n /(p+n)$
• More generally, if we have $m$ classes, and $s_{1}, s_{2}, \dots , s_{m}$ are the number of instances of $S$ in each class, then the entropy is:

$I\left(s_{1}, s_{2}, \cdots, s_{m}\right)=-\sum_{i=1}^{m} p_{i} \log _{2} p_{i}$

where $p_{i}$ is the probability that an arbitrary instance belongs to the class $i$
• Now, assume that using attribute $A$ a set $S$ of instances will be partitioned into sets $s_{1}, s_{2}, \dots , s_{v}$ each corresponding to distinct values of attribute $A$.
• If $S_{i}$ contains $p_{i}$ cases of $P$ and $n_{i}$ cases of $N$, the entropy, or the expected information needed to classify objects in all subtrees $S_{i}$ is

$E(A)=\sum_{i=1}^{v} \operatorname{Pr}\left(S_{i}\right) I\left(p_{i}, n_{i}\right)$

where,
$$\operatorname{Pr}\left(S_{i}\right)=\frac{\left|S_{i}\right|}{|S|}=\frac{p_{i}+n_{i}}{p+n}$$

The probability that an arbitrary instance in $S$ belongs to the partition S_

• The encoding information that would be gained by branching on $A$

$\operatorname{Gain}(S, A)=E(S)-E(A)$

We use attribute $A$ to split node $S$ = Entropy before splitting - Entropy after splitting by using feature $A$

• At any point we want to branch using an attribute that provides the highest information gain.

### # Other Criteria

• Information Entropy and Information Gain
it is used in ID3

$Gain(S, A) = E(S) – E(A)$

• Drawbacks of Information Gain

• If an attribute has large number of values, it is more possible to be pure by using this attribute
• IG will introduce biases for these attributes which have large number of values
• The bias will further result in overfitting problem
• Gain Ratio

• It was first introduced to C4.5 classification

$Gain Ratio = Gain (S, A)/Info(A)$

$\operatorname{Info}(A)=\sum_{i=1}^{v} \operatorname{Pr}\left(S_{i}\right) \log _{2} \operatorname{Pr}\left(S_{i}\right)$

• Gini Index

• It was used in the CART algorithm

Same questions in Decision Trees:

1. Any data requirements?
Categorical data can be used directly; numeric data may be transformed to categorical ones. Numeric data can be automatically utilized or transformed in C4.5
2. Is there a learning/optimization process
Yes, we are going to learn a tree structure.
3. Overfitting in DT?
More details in the next page.

## # Overfitting and Pruning

### # Overfitting Problem

• Problem: The model is over-trained by the training set; it may showa high accuracy on training set, but significantly worse performance on test set.

• Let’s see an example
• You worked classifications on a data set. The data set is big, so you used hold-out evaluation.
You build a model based on training, and evaluate the model based on the testing set.
Finally, you get a 99% classification accuracy on the testing set. Is this an example of overfitting?
• How about N-fold cross validations?

• Overfitting in DT: A tree may obtain good results on training, but bad on testing
• The tree may be too specific with many branches & leaf nodes

### # Solution: Tree Pruning

• A tree generated may over-fit the training examples due to noise or too small a set of training data
• Two approaches to alleviate over-fitting:
• Stop earlier: Stop growing the tree earlier
• Post-prune: Allow over-fit and then post-prune the tree
• Example of the Stop-Earlier:
• Examine the classification metric (such as accuracy) at each node. Stop the splitting process if the metric meets pre-defined value
• Use Minimum Description Length MDL principle: halting growth of the tree when the encoding is minimized.

#### # Post-Pruning the Tree

• A decision tree based on the training data may need to be pruned
• over-fitting may result in branches or leaves based on too few examples
• pruning is the process of removing branches and subtrees that are generated due to noise; this improves classification accuracy
• Subtree Replacement: merge a subtree into a leaf node
• At a tree node, if the accuracy without splitting is higher than the accuracy with splitting, replace the subtree with a leaf node; label it using the majority class

Suppose with test set we find 3 red “no” examples, and 2 blue “yes” example. We can replace the tree with a single “no” node. After replacement there will be only 2 errors instead of 5.

Summary

KNN ClassifierNaive Bayes ClassifierDecision Trees
PrincipleFind the KNN by distances; Assign the major class label;Calculate conditional probability; compare probability of each label given an example"Build a tree by top-down fashion. The node is selected by""best"" feature"
AssumptionsNoFeatures are conditional independent with labelsNo
Feature TypesCategorical data should be converted to binary valuesNumeric va lues may be converted to categorical onesNumeric values may be converted to categorical ones
Feature NormalizationYesNoNo
OverfittingLazy learner; ParametersImbalance classesTree Pruning

## # Regression Tree

• Decision tree, by default, was developed for classifications where the target variable is a nominal variable
• The tree-based method can also be extended to predict or estimate a numerical variable – it is the technique of Regression Tree
• To further understand regression tree, let’s compare decision classification tree vs regression tree
Regression TreeClassification Tree
Target variableNominalNumerical
Output in LeafNominal labelNumerical value (mean of set)
ImpurityG or Gini indexMSE (mean squared error)
BranchesCould be more than twoBinary
Comparison between classification trees and regression trees

### # How it works

• How to split the space to create branch/trees
Everytime, it iterates all possible value or categories in each feature to create a binary split
The impurity is MSE. We want to find the best split which makes lowest MSE in each split
We continue the splitting until it meets stopping criteria

• How to output a numerical value in leaf node
The value is the mean of values in a splitted group

Tree Based Learning

• More complicated but much more effective sometimes
更复杂，但有时更有效
• Tree based learning: a machine learning method
基于树的学习：一种机器学习方法
• Require feature selection
需要特征选择
• Require to handle overfitting problems (Stop Earlier or Post Pruning)
要处理过拟合问题 (提前停止或后期剪枝)

# # Logistic Regression Logistic 回归

「机器学习实战」摘录 - Logistic回归

• 优点：计算代价不高，易于理解和实现。
• 缺点：容易欠拟合，分类精度可能不高。
• 适用数据类型：数值型和标称型数据。
• Both Logistic regression and Linear SVM model can be considered as linear classification models.
Logistic 回归模型和线性 SVM 模型都可以视为线性分类模型。
They tried to utilize linear models to solve the problem of classifications
试图利用线性模型来解决分类问题
• We discuss logistic regression and SVM by using a binary classification as an example
以二分类为例讨论逻辑回归和 SVM
• Note that both of them can be applied to multi class classifications too
这两种方法也可以应用于多类分类

## # Simple Logistic regression model 简单 Logistic 回归模型

Relationship between qualitative binary variable Y and one x-variable:

Model for probability $p=Pr(Y=1)$ for each value $x$.

$\log \left(\frac{p}{1-p}\right)=\beta_{0}+\beta_{1} x$

$\text { Odds }=\frac{p}{1-p}=\frac{P(Y=1)}{P(Y=0)}$

measures the odds that event $Y = 1$ occurs

In logistic regression, we use 1 and 0 to denote binary labels

### # Interpreting

Let $p=Pr(Y=1)$ the probability of “success”
$p=Pr(Y=1)$ 成功的概率

• If odd>1 then $Pr(Y=1) > Pr(Y=0)$$Pr(Y=1) > 0.5$
• If odd=1 then $Pr(Y=1) = Pr(Y=0)$$Pr(Y=1) = 0.5$
• If odd<1 then $P=Pr(Y=1) < Pr(Y=0)$$Pr(Y=1) < 0.5$

### # General Logistic Regression

• We may have several x variables in the model
模型中可能有几个 x 变量

$\log \left(\frac{p}{1-p}\right)=\beta_{0}+\beta_{1} x_{1}+\beta_{2} x_{2}+\beta_{3} x_{3}+\beta_{4} x_{4}+\beta_{5} x_{1}^{2}+\beta_{6} x_{1} x_{2}$

• 0.5 is the default cut off value, but we may improve the model by using other cut off values
0.5 是默认的 cut off 值，但可以使用其他的 cut off 值来改进模型
• $P(Y=1) >= alpha$ ➔ predicted as 1
• $P(Y=1) < alpha$ ➔ predicted as 0
• Try different alpha values to see which one is the best
尝试不同的 alpha 值，看看哪个是最优的
• The model is interpretable. $P(Y=1)$ can be considered as a confidence value
模型是可解释的。$P(Y=1)$ 可以认为是一个置信度值

### # Model fitting or building 模型拟合或建造

• The process is similar to the linear regression models
过程类似于线性回归模型
• X must be numerical variable. Transformation is required if there are nominal variables
X 必须是数值变量，如果有标称变量，则需要进行转换。
• Feature selection methods, such as backward elimination, forward or stepwise selection, can also be applied
特征选择方法，如向后消除，向前或逐步选择，也可以应用
• Residual analysis needs to be performed
需要进行残留分析
• The model is evaluated by classification metrics, such as accuracy, precision, recall, ROC curve, etc
通过准确率、精密度、召回率、ROC 曲线等分类指标对模型进行评价

## # Example: Logistic Regression

admit,gre,gpa,rank
0,380,3.61,3
1,660,3.67,3
1,800,4,1
1,640,3.19,4
0,520,2.93,4
1,760,3,2
1,560,2.98,1
0,400,3.08,2


    admit	gre	gpa	rank
265	1	520	3.90	3
140	1	600	3.58	1
120	0	340	2.92	3
172	0	540	2.81	3
247	0	680	3.34	2
168	0	720	3.77	3

2. build model by FS

Call: glm(formula = admit ~ gpa + rank + gre, family = binomial(), data=train.data)

Coefficients:
(Intercept)	gpa	rank	gre
-2.861669	0.683853	-0.594686	0.002019

Degrees of Freedom: 319 Total (i.e. Null); 316 Residual
Null Deviance:	402.1
Residual Deviance: 370.7	AIC:378.7

3. produce probabilities

4. choose cut off value to calculate accuracy

 0.6875

 0.7

「机器学习实战」摘录 - Logistic回归的一般过程
1. 收集数据：采用任意方法收集数据。
2. 准备数据：由于需要进行距离计算，因此要求数据类型为数值型。另外，结构化数据格式则最佳。
3. 分析数据：采用任意方法对数据进行分析。
4. 训练算法：大部分时间将用于训练，训练的目的是为了找到最佳的分类回归系数。
5. 测试算法：一旦训练步骤完成，分类将会很快。
6. 使用算法：首先，我们需要输入一些数据，并将其转换成对应的结构化数值；接着，基于训练好的回归系数就可以对这些数值进行简单的回归计算，判定它们属于哪个类别；在这之后，我们就可以在输出的类别上做一些其他分析工作。

# # Support Vector Machines (SVM) 支持向量机

「机器学习实战」摘录 - 支持向量机

• 优点：泛化错误率低，计算开销不大，结果易解释。
• 缺点：对参数调节和核函数的选择敏感，原始分类器不加修改仅适用于处理二类问题。
• 适用数据类型：数值型和标称型数据。

## # Linear SVM

• Draw a linear model to separate two classes
画一个线性模型来区分两个类
• The model could be a straight line model (such as regression line in 2D space)
模型可以是直线模型 (如二维空间中的回归线)
• The model could be a hyperplane model in multi dimensional space
模型可以是多维空间中的超平面模型

We use straight line model as an example in the class. But, you should also keep in mind that the hyperplane model is still linear SVM

$f(x, w, b)=\operatorname{sign}(w x+b)$

PINK denotes → +1
BLUE denotes → -1

In logistic regression, we use 0 and 1 for binary labels.

In SVM, we use +1 and -1 as binary labels.

How would you classify this data?
Any of these would be fine..

「机器学习实战」摘录 - 超平面

..but which is best?

### # Definition: Margin 间隔

Define the hyperplane $H$ such that:

$x_{i}w+b \geq +1 \text { when } y_{i} = +1 \\ x_{i}w+b \leq -1 \text { when } y_{i} = -1$

$H_{1}$ and $H_{2}$ are the planes:

$H_{1}: x_{i}w+b = +1$

$H_{2}: x_{i}w+b = -1$

The points on the planes $H_{1}$ and $H_{2}$ are the points in two classes (+1, -1) on the boundary.

They are also called the Support Vectors.

$d^{+}$ = the shortest distance to the closest positive point

$d^{-}$ = the shortest distance to the closest negative point

The margin of a separating hyperplane is $d^{+} + d^{-}$

set $M$ = Margin Width

Objective: Maximal Margin in SVM Classification

$M=\frac{\left(x^{+}-x^{-}\right)w}{|w|}=\frac{2}{|w|}$

What we know:

• $wx^{+} + b = +1$
• $wx^{-} + b = -1$
• $w(x^{+}-x^{-}) = 2$
「机器学习实战」摘录 - 间隔

### # Method: Maximizing the Margin 寻找最大间隔

We want a classifier with as big margin as possible.

$M=\frac{\left(x^{+}-x^{-}\right)w}{|w|}=\frac{2}{|w|}$

Maximize $M$ ➔ Minimize $|w|$ ➔ Minimize $\frac{1}{2} w^{\mathrm{T}} w$ = objective function 目标函数

「机器学习实战」摘录 - 寻找最大间隔

### # Solving the Optimization Problem 解决分类器求解的优化问题

Find $w$ and $b$ such that
$\Phi(w)=\frac{1}{2} w^{\mathrm{T}}w$ is minimized;
and for all $\left\{\left(x_{i}, y_{i}\right)\right\}: y_{i}\left(w^{\mathrm{T}} x_{i}+b\right) \geq 1$

Need to optimize a quadratic function subject to linear constraints.

Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them.

The solution involves constructing a dual problem where a Lagrange multiplier $\alpha_{i}$ associated with every constraint in the primary problem:

Find $\alpha_{i} \dots \alpha_{N}$ such that
$Q(\alpha)=\sum \alpha_{i}-\frac{1}{2} \sum \sum \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}{ }^{\mathrm{T}} x_{j}$ is maximized and

1. $\sum \alpha_{i}y_{i} = 0$
2. $\alpha_{i} \ge 0$ for all $\alpha_{i}$.
「机器学习实战」摘录 - 为什么SVM类别标签采用-1和+1，而不是0和1呢？

$label * (w^{\mathrm{T}}x+b)$ 被称为点到分隔面的函数间隔， $label * (w^{\mathrm{T}}x+b) \cdot \frac{1}{\|w\|}$ 称为点到分隔面的几何间隔。

「机器学习实战」摘录 - 找出分类器定义中的w和b

$\arg \max _{w, b}\left\{\min _{n}\left(\text { label } \cdot\left(w^{\mathrm{T}} x+b\right)\right) \cdot \frac{1}{\|w\|}\right\}$

$\max _{\alpha}\left[\sum_{i=1}^{m} \alpha-\frac{1}{2} \sum_{i, j=1}^{m} \mathrm{label}^{(i)} \cdot \mathrm{label}^{(j)} \cdot a_{i} \cdot a_{j}\left\langle x^{(i)}, x^{(j)}\right\rangle\right]^{(2)}$

$\alpha \ge 0$$\sum_{i=1}^{m} \alpha_{i} \cdot \text { label }^{(i)}=0$

### # Dataset with noise

Hard Margin
So far we require all data points be classified correctly

No training errors are allowed

Hard margin will build models without errors, which may introduce overfitting

Soft Margin
we allow errors but we want to minimize the errors

Soft margin allows errors in the model, which may help build a more general model

Slack variables $\xi_{i}$ can be added to allow misclassification of difficult or noisy examples.

New objective function Minimize 新目标函数最小化

$\frac{1}{2} ww+C \sum_{k=1}^{R} \varepsilon_{k}$

• The old formulation: Hard Margin
Find $w$ and $b$ such that
$\Phi(w)=\frac{1}{2} w^{\mathrm{T}}w$ is minimized;
and for all $\left\{\left(x_{i}, y_{i}\right)\right\}$:
$y_{i}\left(w^{\mathrm{T}} x_{i}+b\right) \geq 1$

• The new formulation incorporating slack variables: Soft Margin
Find $w$ and $b$ such that
$\Phi(w)=\frac{1}{2} w^{\mathrm{T}}w+C \sum \xi_{i}$ is minimized;
and for all $\left\{\left(x_{i}, y_{i}\right)\right\}$:
$y_{i}\left(w^{\mathrm{T}} x_{i}+b\right) \geq 1-\xi_{i}$ and $\xi_{i} \geq 0$ for all $i$

• Parameter $C$ can be viewed as a way to control overfitting.
常数 $C$ 可以被看作是一种控制过拟合的方法。

「机器学习实战」摘录 - 松弛变量（slack variable）

$C \ge \alpha \ge 0$$\sum_{i=1}^{m} \alpha_{i} \cdot \text { label }^{(i)}=0$

## # Non-Linear SVM

• Datasets that are linearly separable with some noise work out great:
带有一些噪声的线性可分数据集效果很好：

• But what are we going to do if the dataset is just too hard?

• How about mapping data to a higher dimensional space:
把数据映射到更高维的空间如何：

### # Feature spaces 特征空间

• We can map the original data to higher dimensional space

• General idea: 总体思路:
the original input space can always be mapped to some higher-dimensional feature space where the training set is separable:
原始输入空间总是可以映射到某个训练集可分离的高维特征空间:

• In the 2D space, our linear SVM model is $f(x) = wx + b$
在 2D 空间中，我们的线性 SVM 模型是

• In the MD space, our model becomes $f(x)=w \varphi(x)+b$

• $\varphi(x)$ are the new vectors mapped to a higher dimensional space
$\varphi(x)$ 是映射到更高维度空间的新向量

### # The Kernel Function 核函数

• A Kernel Function is some function that corresponds to an inner product in some expanded feature space.
核函数是对应于某个扩展特征空间中的内积的函数。

• It helps us convert the input space from lower dimension to higher dimension by using the inner product.
它帮助我们利用内积将输入空间从低维转换到高维。

$K\left(x_{i}, x_{j}\right)=\varphi\left(x_{i}\right)^{\mathrm{T}} \varphi\left(x_{j}\right)$

• Examples of Popular Kernel Functions
• Linear Kernel

$K\left(x_{i}, x_{j}\right)=x_{i}^{\mathrm{T}} x_{j}$

• Polynomial Kernel of power p

$K\left(x_{i}, x_{j}\right)=\left(1+x_{i}^{\mathrm{T}} x_{j}\right)^{p}$

• Gaussian (radial basis function network) Kernel

$K\left(x_{i}, x_{j}\right)=\exp \left(-\frac{\left\|x_{i}-x_{j}\right\|^{2}}{2 \sigma^{2}}\right)$

• Sigmoid Kernel

$K\left(x_{i}, x_{j}\right)=\tanh \left(\beta_{0} x_{i}^{\mathrm{T}} x_{j}+\beta_{1}\right)$

### # Non-linear SVMs Mathematically

• Dual problem formulation:
Find $\alpha_{i} \dots \alpha_{N}$ such that
$Q(\alpha) = \Sigma \alpha_{i}-\frac{1}{2} \Sigma \Sigma \alpha_{i} \alpha_{j} y_{i} y_{j} K\left(x_{i}, x_{j}\right)$ is maximized and

1. $\sum \alpha_{i}y_{i} = 0$
2. $\alpha_{i} \ge 0$ for all $\alpha_{i}$.
• The solution is: Still linear formula

$f(x)=\Sigma \alpha_{i} y_{i} K\left(x_{i}, x_{j}\right)+b$

• Optimization techniques for finding $\alpha_{i}$ is remain the same!

### # Overview

• SVM finds a separating hyperplane in the feature space and classify points in that space.
SVM 在特征空间中找到一个分离超平面，并对该空间中的点进行分类。
• It does not need to represent the space explicitly, simply by defining a kernel function.
它不需要显式地表示该空间，只需定义一个核函数。
• The kernel function plays the role of the dot product in the feature space.
核函数在特征空间中起点积的作用。

## # Weakness of SVM

### # It is sensitive to noise 它对噪音很敏感

A relatively small number of mislabeled examples can dramatically decrease the performance

### # It only considers two classes 它只考虑两类

how to do multi-class classification (MCC) with SVM?

There are many methods to convert MCC to binary classification.

Below is one of these methods:

1. with output arity m, learn m SVM’s

• SVM 1 learns Output == 1 vs Output != 1
• SVM 2 learns Output == 2 vs Output != 2
• ...
• SVM m learns Output == m vs Output != m
2. To predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region.
要预测新输入的输出，只需对每个 SVM 进行预测，并找出哪个预测最接近正区域。

## # Support Vector Regression (SVR)

• In SVM, we have two lines on the boundary -they are the lines closest to the hyper-plane (i.e., the SVM classifier line or plane). Our model is the hyper-plane which wants to maximize the margin/distances
在 SVM 中，我们在边界上有两条线 —— 它们是最接近超平面的线 (即 SVM 分类器线或平面)。我们的模型是超平面，它想要最大化余量 / 距离

• In SVR, we have two lines on the boundary -they are the lines farthest to the hyper-plane (i.e., the regression line). Our model is the hyper plane or regression line which minimizes the distances
在 SVR 中，我们在边界上有两条线 —— 它们是离超平面最远的线 (即回归线)。我们的模型是超平面或回归线，它使距离最小化

• Same characteristics: the hyper-plane are the models in between the boundary lines
相同的特征：超平面是边界线之间的模型

# # Multi-Class Classification by Binary Classification 二元分类法的多类分类法

• All the classification techniques we discussed can be applied to multi class classifications
所有分类技术都可以应用于多类分类
• Multi Class classification can be solved by multiple binary classifications
多类分类可以通过多个二元分类来解决
• One vs. One
• One vs. Rest
• Many vs. Many

## # Strategy 1: One vs. One

• Assume we have $N$ labels
假设我们有$N$ 个标签
• We will choose unique pair of these labels, and perform $\frac{N(N-1)}{2}$ binary classifications
我们将选择这些标签的唯一对，并执行$\frac{N(N-1)}{2}$ 二进制分类
• We will get $\frac{N(N-1)}{2}$ classification results
我们将得到$\frac{N(N-1)}{2}$ 分类结果
• Finally, we use voting to get the final prediction results
最后，利用投票的方式得到最终的预测结果
• Notes: one label as positive, another as negative
注：一个标签是正的，另一个是负的

Example: Assume we have 4 labels: c1, c2, c3, c4
We will get $\frac{N(N-1)}{2} = 6$ unique pairs

 c1, c2 Binary Classification Predictions c1, c3 Binary Classification c1, c4 Binary Classification c2, c3 Binary Classification c2, c4 Binary Classification c3, c4 Binary Classification

## # Strategy 2: One vs. Rest

• Assume we have $N$ labels
假设我们有$N$ 个标签
• We will perform $N$ binary classifications
我们将执行$N$ 二进制分类
• In each classification, we predict $C$ vs. Not-$C$
在每个分类中，我们预测$C$ vs. Not-$C$
• Finally, we use voting to get the final prediction results
最后，利用投票的方式得到最终的预测结果
• Notes: one label as positive, others as negative
注：一个标签是正的，另一个是负的

Example: Assume we have 4 labels: c1, c2, c3, c4
We will perform $N = 4$ binary classifications

 c1, ┐c1 Binary Classification Predictions c2, ┐c2 Binary Classification c3, ┐c3 Binary Classification c4, ┐c4 Binary Classification

## # Strategy 3: Many vs. Many

• Assume we have $N$ labels
假设我们有$N$ 个标签
• We will perform $N$ binary classifications
我们将执行$N$ 二进制分类
• They encode labels into new ones
它们将标签编码成新的标签
• Example: Error Correcting Output Codes, ECOC
• Notes: one set as positive, another set as negative
注：一组为正，另一组为负