# # Neural Networks

## # Artificial Neural Networks (ANN)

• Researchers tried to learn from the biological neuron systems and built the ANN
研究人员试图向生物神经元系统学习，并建立了 ANN

• There are many neuron units in ANN
ANN 中有许多神经元单位
• They are connected within a structure
他们是连接在一个结构
• They work as threshold switching units
他们作为阈值切换单元工作
• There are weighted interconnections among units
有加权互联单位之一
• We are able to learn and tune up these weights automatically by a training process
我们能够通过训练过程自动学习和调整这些重量
• A neuron in ANN looks like this…
人工神经网络中的一个神经元看起来像这样…

input signals → input function(linear) → activation function(nonlinear) → output signal
输入信号→输入函数 (线性)→激活函数 (非线性)→输出信号

## # Perceptron 感知器

• First neural network learning model in the 1960’s
20 世纪 60 年代的第一个神经网络学习模型
• Simple and limited (single layer models)
简单且有限 (单层模型)
• Still used in current applications (modems, etc.)
仍在当前应用中使用 (调制解调器等)。
Input 输入
different $x$ variables with weights on edges

Input function 输入函数
it is used to aggregate the inputs, usually it is a weighted sum of its inputs

Activation function 激活函数
• It is a threshold function
是一种阈值函数
• For the purpose of binary classification
为目的的二元分类
• The output is only 1 or 0
输出是只 1 或 0
• Sign function can be used as the activation function
符号函数可以用来激活功能
• Sigmoid can be used as activation function
乙状函数可用作激活函数
Sigmoid function 乙状函数 / S 函数
It is popular for classification, due to being easy to be updated and learned in the training process.

• Neural networks canbe used for both classifications and regressions.
神经网络既可以用于分类，也可以用于回归
• It can be controlled by applying different activation functions.
可以通过施加不同的激活函数进行控制。

### # Perceptron Training 感知器训练

• It is the simplest ANN model
这是最简单的人工神经网络模型

• We need to train the model to learn the weights, $w$, where
我们需要训练模型来学习权重，$w$，其中

$w_{i} \leftarrow w_{i}+\Delta w_{i}$

$\Delta w_{i}=\eta(t-o) x_{i}$

• $t$ is the real value
是实际值
• $o$ is the output value (prediction by the model)
是输出值 (由模型预测)
• $\eta$ is a constant value in $[0, 1]$ as the learning rate
$[0, 1]$ 中的一个常数值，作为学习速率
• It is a process of iterative learning
这是一个反复学习的过程

• At the beginning, give random values to $w$
开始时，给$w$ 随机取值
• Get the output $o$ through the perceptron
通过感知器得到输出$o$
• Update the $w$ by using the update rules
使用更新规则更新$w$
• Stop the learning process by a stopping criterion
通过停止标准停止学习过程
• Classification error is smaller than a threshold
分类误差小于阈值
• Or, maximal learning iterations have been reached
或者，已经达到最大学习迭代次数

### # Perceptron: Example

• Consider learning the logical OR function
考虑学习逻辑或函数
Samplex0x1x2label
11000
21011
31101
41111
• Activation function 激活函数

$S=\sum_{k=0}^{k=n} w_{k} x_{k} \quad S>0 \text { then } O=1 \quad \text { else } \quad O=0$

• We’ll use a single perceptron with three inputs.
我们将使用一个有三个输入的感知器。

• We’ll start with all weights 0 W= <0,0,0>
我们将从所有重量 0 开始

• Example 1 I = <0,0,0> label = 0 W = <0,0,0>

• Perceptron ($1 \times 0 + 0 \times 0 + 0 \times 0 = 0, S=0$) output = 0
• it classifies it as 0 , so correct, do nothing
它将其归类为 0 ，所以正确，什么也不做
• Example 2 I = <1,0,1> label=1 W = <0,0,0>

• Perceptron ($1 \times 0 + 0 \times 0 + 1 \times 0 = 0$) output = 0
• it classifies it as 0 , while it should be 1 , so we add input to weights W = <0,0,0> + <1,0,1>= <1,0,1>
它将其分类为 0，而它应该是 1，所以我们将输入添加到权重 W = <0,0,0> + <1,0,1>= <1,0,1>
• Example 3 I = <1,1,0> label = 1 W = <1,0,1>

• Perceptron ($1 \times 0 + 1 \times 0 + 0 \times 0 \gt 0$) output = 1
• it classifies it as 1 , correct, do nothing W = <1,0,1>
它将其分类为 1 ，正确，什么都不做 W = <1,0,1>
• Example 4 I = <1,1,1> label = 1 W = <1,0,1>

• Perceptron ($1 \times 0 + 1 \times 0 + 1 \times 0 \gt 0$) output = 1
• it classifies it as 1 , correct, do nothing W = <1,0,1>

1st iteration is completed. 第一次迭代完成。
Repeat until no errors 重复，直到没有错误

### # Limitations of Perceptron 感知器的局限性

• It is too simple, cannot learn complex and effective models
它太简单，无法学习复杂而有效的模型
• It assumes the data can be linearly separatable in the binary classification, but actually it could be non linear!
它假设数据在二元分类中可以线性分离，但实际上它可能是非线性的！
• SVM, we use kernel function to map data to higher dimension
SVM，使用核函数将数据映射到更高维度
• ANN, we can add more layers!!
ANN，可以加更多的层

## # Multi layer Feed forward Networks 多层前馈网络

• Multi layer Feed forward Networks is an extension of the perceptron model. It adds hidden layers to the original perceptron.
多层前馈网络是感知器模型的扩展。它将隐藏层添加到原始感知器中。
• Input layer: accepts inputs only
输入层：仅接受输入
• Hidden layers: neurons with functions
隐藏层：具有功能的神经元
• Output layer: produce outputs
输出层：产生输出

### # Training Phrase

• The training phrase is a typical process of machine learning and optimization
训练阶段是机器学习和优化的典型过程
• We need to 我们需要
• Setup a learning objective as loss function
将学习目标设置为损失函数
• Use appropriate optimizer to learn the parameters
使用适当的优化器来学习参数
• It is usually a process of iterative learning
这通常是一个迭代学习的过程

### # Loss Function 损失函数

• The loss function $L\left(x, y, y^{\prime}\right)$ is defined as the amount of utility lost by predicting $h(x)=y^{\prime}$ when the correct answer is $f(x)=y$
损失函数定义为当正确答案是 $f(x)=y$ 时，通过预测而损失的效用量
• Often a simplified version is used, $L\left(y, y^{\prime}\right)$, that is independent of $x$
通常使用简化的版本，独立于$x$
• Three commonly used loss functions:
三种常用的损失函数:
• Absolute value loss: 绝对值损失 $L_{1}\left(y, y^{\prime}\right)=\left|y-y^{\prime}\right|$
• Squared error loss: 平方误差损失 L_{2}\left(y, y^{\prime}\right)=\left(y-y^{\prime}\right)^
• 0/1 loss: $L_{0 / 1}\left(y, y^{\prime}\right)=0$ if $y=y^{\prime}$, else $1$
• Let $E$ be the set of examples. Total loss $L(E)=\sum_{e \in E} L(e)$

### # Optimizer: Gradient Descent 优化器：梯度下降

• Gradient Descent is widely used as one of the
popular optimizers in machine learning, especially in
the ANN learning
梯度下降法是机器学习，尤其是人工神经网络学习中最常用的优化方法之一

#### # Optimization In Linear Regression

• How to apply gradient descent to minimize the cost function for regression
如何应用梯度下降来最小化回归的成本函数
1. a closer look at the cost function
仔细看看成本函数
2. applying gradient descent to find the minimum of the cost function
应用梯度下降来寻找成本函数的最小值
##### # a closer look at the cost function
• Hypothesis: 假设

$h_{\theta}(x)=\theta_{0}+\theta_{1}x$

• Parameters: 参数

$\theta_{0}, \theta_{1}$

• Cost Function: 成本函数
Sum of squared errors 误差平方和

$J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$

• Goal:

$\underset{\theta_{0}, \theta_{1}}{\operatorname{minimize}} J\left(\theta_{0}, \theta_{1}\right)$

• Optimization

• There are at least two optimization methods
至少有两种优化方法
• Least square optimization
最小平方优化
• Optimization based on gradient descent
基于梯度下降的优化
• Least Square Optimization 最小平方优化

• Find the optimal point 找到最佳点
$\frac{\partial}{\partial \theta_{j}} J=0$, $J$ is the objective function with $\theta_{1}, \theta_{2}, \theta_{3}, \ldots$
• $j=1,2,3, \ldots, N+1$, assume $u$ have $N \times$ variables
• Therefore, you will have $N+1$ functions to be solved
因此，将有$N+1$ 个函数需要求解
• Drawback: it is complicated if you have many $x$ variables
缺点：如果有许多$x$ 变量，这就复杂了
##### # applying gradient descent to find the minimum of the cost function
• Have some function $J\left(\theta_{0}, \theta_{1}\right)$

• Want $\min _{\theta_{0}, \theta_{1}} J\left(\theta_{0}, \theta_{1}\right)$

• Gradient descent algorithm outline: 梯度下降算法概述:

• Start with some $\theta_{0}, \theta_{1}$ ;
• Keep changing $\theta_{0}, \theta_{1}$ to reduce $J\left(\theta_{0}, \theta_{1}\right)$ until we hopefully end up at a minimum

## # Backpropagation Training 反向传播训练

• There are several network structure in neural networks, such as feed forward neural networks and the recurrent neural networks
神经网络有几种网络结构，如前馈神经网络和递归神经网络
• Multi layer Feed forward Networks used a forward procedure for predictions.
多层前馈网络使用前向程序进行预测。
But it was trained by using a Backward propagation approach
但它是用反向传播方法训练的
• These ANNs are also called BP (Backpropagation) Neural Networks
这些人工神经网络也被称为 BP (反向传播) 神经网络

### # ANN needs a process of weight training

• A set of examples, each with input vector $x$ and output vector $y$
一组例子，每个例子都有输入向量$x$ 和输出向量$y$
• Squared error loss: $Loss =\sum_{k} \operatorname{Loss}_{k}, \operatorname{Loss}_{k}=\left(y_{k}-a_{k}\right)^{2}$, where $a_{k}$ is the $k$-th output of the neural net
• The weights are adjusted as follows:

$w_{i j} \leftarrow w_{i j}-\alpha \partial L o s s / \partial w_{i j}$

• How can we compute the gradient efficiently given an arbitrary network structure?

### # Forward vs Backward in ANN

Forward phase:
• Propagate inputs forward to compute the output of each unit
• Output $a_{j}$ at unit $j$: $a_{j}=g\left(i n_{j}\right)$ where $in_{j}=\sum_{i} w_{i j} a_{i}$ .
Backward phase:
• Propagate errors backward
• For an output unit $j$: $\Delta_{j}=g^{\prime}\left(i n_{j}\right)\left(y_{j}-a_{j}\right)$
• For an hidden unit $i$: $\Delta_{i}=g^{\prime}\left(i n_{i}\right) \sum_{j} w_{i j} \Delta_{j}$ .

## # Neural Networks and Deep Learning

• To make ANN more powerful, there are two solutions
为了使 ANN 更加强大，有两种解决方案

• Add more neurons in the hidden layer
在隐藏层中添加更多的神经元
添加更多隐藏层
• Deep Learning Deep Learning

• Traditional ANN only has 3 layers. Deep learning utilizes neural networks with multiple layers
传统的人工神经网络只有 3 层。深度学习利用多层神经网络
• Deep learning have more structures for neural networks, such as ANN, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and so forth
深度学习有更多的神经网络结构，如 ANN、卷积神经网络 (CNN)、递归神经网络 (RNN) 等等
• Deep learning is not related to neural networks only. It also correlates with computing, such as GPU
深度学习不仅仅与神经网络相关。它还与计算相关，如 GPU
• ANN vs Deep Learning

# # Ensembles of Classifiers 分类器集合

• Basic idea is to learn a set of classifiers (experts) and to allow them to vote.
基本思想是学习一组分类器 (专家) 并允许他们投票。
• Advantage: improvement in predictive accuracy.
优点：预测精度提高。
• Disadvantage: it is difficult to understand an ensemble of classifiers.
缺点：很难理解分类器的集成。

## # Ensemble Methods 集成方法

### # Bagging

• Process in bagging 装袋过程

• Sample several training sets of size $n$ (instead of just having one training set of size $n$)
采样几个大小为$n$ 的训练集 (而不是只有一个大小为$n$ 的训练集）
• Build a classifier for each training set
为每个训练集构建一个分类器
• Combine the classifier’s predictions by voting or averaging
通过投票或平均来组合分类器的预测
• Bagging classifiers

• Classifier generation
Let n be the size of the training set.
For each of t iterations:
Sample n instances with replacement from the training set.
Apply the learning algorithm to the sample.
Store the resulting classifier.

• classification
For each of the t classifiers:
Predict class of instance using classifier.
Return class that was predicted most often.

• Voting and Averaging 投票和平均

• Voting is used for classifications, and averaging is used for regressions
投票用于分类，平均用于回归
• Voting: Hard and Soft voting
投票：硬投票和软投票
• Hard voting
Predictions:

Classifier 1 predicts class A
Classifier 2 predicts class B
Classifier 3 predicts class B
2/3 classifiers predict class B, so class B is the ensemble decision.

• Soft voting
Predictions (identical to the earlier example, but now in terms of probabilities.Shown only for class A here because the problem is binary):

Classifier 1 predicts class A with probability 99%
Classifier 2 predicts class A with probability 49%
Classifier 3 predicts class A with probability 49%
The average probability of belonging to class A across the classifiers is (99+49+49)/3 = 65.67% .
Therefore, class A is the ensemble decision.

• Why does bagging work?

• Bagging reduces variance by voting / averaging, thus reducing the overall expected error
• In the case of classification there are pathological situations where the overall error might increase
• Usually, the more classifiers the better

### # Boosting

• Also uses voting/averaging but models are weighted according to their performance

• Iterative procedure new models are influenced by performance of previously built ones

• New model is encouraged to become expert for instances classified incorrectly by earlier models
• Assign more weights to the misclassified instances to improve the classification iteratively
• There are several variants of this algorithm

• classifier generation
Assign equal weight to each training instance.
For each of t iterations:
Learn a classifier from weighted dataset.
Compute error e of classifier on weighted dataset.
If e equal to zero, or e greater or equal to 0.5:
Terminate classifier generation.
For each instance in dataset:
If instance classified correctly by classifier:
Multiply weight of instance by e / (1 - e)
Normalize weight of all instances.

• classification
Assign weight of zero to all classes.
For each of the t classifiers:
Add -log(e / (1 - e)) to weight of class predicted by the classifier.
Return class with highest weight.


### # Random Forest

• Random forest is a bagging method which uses decision trees as the classifiers

• The workflow in the random forest is the same as the ones in bagging

• In bagging, we can use any classifiers In random forest, we use decision trees

• Classifier generation

Let n be the size of the training set.
For each of t iterations:
(1) Sample n instances with replacement from the training set
(2) Learn a decision tree s.t. the variable for any new node is the best variable among m randomly selected variables.
(3) Store the resulting decision tree.

• Classification

For each of the t decision trees:
Predict class of instance.
Return class that was predicted most often.


# # Semi-Supervised Classification 半监督分类

• Classifications require labeled data

• Data labeling is a complicated and expensive process. It is not guaranteed that we have enough and high qualified labels

• Labels may be hard to get

• Human labeling is slow and boring
• It may require expert knowledge
• It may require special or expensive devices
• Goal:
Using both labeled and unlabeled data to build better classifiers (than using labeled data alone).

• Notation:

• input $x$, label $y$
• classifier f: \mathcal{X} \mapsto \mathcal
• labeled data $\left(X_{l}, Y_{l}\right)=\left\{\left(x_{1}, y_{1}\right), \ldots,\left(x_{l}, y_{l}\right)\right\}$
• unlabeled data $X_{u}=\left\{x_{l+1}, \ldots, x_{n}\right\}$
• usually $n \gg l$

## # Solutions: Self-training

• Algorithm: Self-training
1. Pick your favorite classification method. Train a classifier $f$ from $\left(X_{l}, Y_{l}\right)$.
2. Use $f$ to classify all unlabeled items $x \in X_{u}$.
3. Pick $x^{*}$ with the highest confidence, add $\left(x^{*}, f\left(x^{*}\right)\right)$ to labeled data.
4. Repeat.

The simplest semi-supervised learning method.

• Pros

• Simple
• Applies to almost all existing classifiers
• Cons

• Mistakes reinforce themselves. Heuristics against pitfalls
• 'Un-label' a training point if its classification confidence drops below a threshold
• Randomly perturb learning parameters

## # Solutions: Co-training

• Your data can be split into different views

• The view can be defined by different set of the features

• Each item is represented by two kinds of features $x=\left[x^{(1)} ; x^{(2)}\right]$

• $x^{(1)}$ = image features
• $\boldsymbol{\square} x^{(2)}$ = web page text
• This is a natural feature split (or multiple views)
• Co-training idea:

• Train an image classifier and a text classifier
• The two classifiers teach each other
• Algorithm: Co-training

1. Train two classifiers: $f^{(1)}$ from $\left(X_{l}^{(1)}, Y_{l}\right), f^{(2)}$ from $\left(X_{l}^{(2)}, Y_{l}\right)$
2. Classify $X_{u}$ with $f^{(1)}$ and $f^{(2)}$ separately.
3. Add $f^{(1)}$'s $k$-most-confident $\left(x, f^{(1)}(x)\right)$ to $f^{(2)}$'s labeled data.
4. Add $f^{(2)}$'s $k$-most-confident $\left(x, f^{(2)}(x)\right)$ to $f^{(1)}$'s labeled data.
5. Repeat.
• Pros

• Simple. Applies to almost all existing classifiers
• Less sensitive to mistakes
• Cons

• Feature split may not exist
• Models using BOTH features should do better

# # Multi-Label Classifications

• Binary classification: Is this a picture of the sea?

$\in\{ yes, no \}$

• Multi-class classification: What is this a picture of?

$\in\{ sea, sunset, trees, people, mountain, urban \}$

• Multi-label classification: Which labels are relevant to this picture?

$\subseteq\{ sea, sunset, trees, people, mountain, urban \}$

i.e., multiple labels per instance instead of a single label!

## # Applications

• Images are labelled to indicate

• multiple concepts
• multiple objects
• multiple people
e.g., Scene data with concept labels

$\subseteq\{ beach, sunset, foliage, field, mountain, urban \}$

• Labelling music/tracks with genres / voices, concepts, etc.

• e.g., Music dataset, audio tracks labelled with different moods, among:
• amazed-surprised,
• relaxing-calm,
• quiet-still,
• angry-aggressive

## # Example

• Difference in data sets

• Table: Single-label $Y \in \{0,1\}$.

• Table: Multi-label $Y \subseteq\left\{\lambda_{1}, \ldots, \lambda_{L}\right\}$

• We usually convert labels to binary labels

## # Solutions

### # Transformation Based Methods

Transform the task to binary/multi-class classifications

#### # Binary Relevance

• If there are $N$ labels, we have $N$ binary classifications

• Drawback: it ignores the label depenence

#### # Classifier Chains

• Classifier Chains build the model in a chain by taking label correlations into consideration
• It uses the feature to perform binary classification on 1st label, the prediction on 1st label will be reused as the features into the 2nd step to predict the 2nd label
• Repeat the process above until all of the labels are predicted

• Use previous prediction results as new features

• Drawbacks in Classifier Chains

• Difficult to define the sequence in the chain, though there are some methods (e.g., info gain)
• If the previous predictions are incorrect, the following predictions may not be right too.

#### # Label Powerset

• Each subset of the label set will be a single label

• Assign binary classification or multi-class classification to them

• Find a way to aggregate the results

1. Transform dataset

...into a multi-class problem, taking $2^{L}$ possible values:
2. ...and train any off-the-shelf multi-class classifier
• Drawbacks in Label Powerset
标签权力集的缺点

• Too many subsets if there are several labels
如果有多个标签，则有太多的子集
• Highly possible to have imbalance issue
极有可能出现不平衡的问题
• Overfitting: how to predict new values/labels?
过度拟合：如何预测新值 / 标签？

### # Adaptation Based Methods 基于适应性的方法

Develop new algorithms to solve the problem

• MLkNN.For each test instance:
• Retrieve the top-k nearest neighbors to each instance
• Compute the frequency of occurrence of each label
• Assign a probability to each label and select the labels by using a probability cut-off value

## # Evaluation of multilabel learning

Notes

• Both transformation and adaptation methods are the methods to solve MLC problem
• They are not classification algorithms
• For each method, you can use any traditional binary/multi-class classification algorithms to produce the predictions
• There are multiple labels in the MLC problem
• Traditional evaluation metrics in the classification may not work for MLC
• We need to develop new evaluation metrics

### # Hamming Loss

Consider the misclassification in each bit

$\text { HAMMING LOSS } =\frac{1}{N L} \sum_{i=1}^{N} \sum_{j=1}^{L} \mathbb{I}\left[\hat{y}_{j}^{(i)} \neq y_{j}^{(i)}\right] = 4 /(4 * 5) \\ =0.20$

N = # of labels
L = # of data rows

### # 0/1 Loss

Consider the misclassification in the whole label set

$0 / 1 \mathrm{LOSS} =\frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left(\hat{\mathbf{y}}^{(i)} \neq \mathbf{y}^{(i)}\right)=3 / 5 \\ =0.60$

### # Other Metrics

JACCARD INDEX
often called multi-label ACCURACY
RANK LOSS
average fraction of pairs not correctly ordered
ONE ERROR
if top ranked label is not in set of true labels
COVERAGE
average "depth" to cover all true labels
LOG LOSS
i.e., cross entropy
PRECISION
predicted positive labels that are relevant
RECALL
relevant labels which were predicted
• PRECISION VS. RECALL curves
F-MEASURE
• micro-averaged ('global' view)
• macro-averaged by label (ordinary averaging of a binary measure, changes in infrequent labels have a big impact)
• macro-averaged by example (one example at a time, average across examples)

## # Tools

Mulan
• Java Based
• Reuse Weka library
• No UI
• http://mulan.sourceforge.net/
Meka
• Similar to Weka
• Java Based
• With UI
• http://meka.sourceforge.net/

# # Classification: Summary

• We learned different algorithms

• No learning process: KNN and Naïve Bayes
• Learning based: Logistic regression, Decision tree, SVM, Neural Networks
• Ensemble methods: bagging, boosting, RandomForest
• For each algorithm 对于每一种算法

• Understand how it works
了解它是如何工作的
• Know the requirements on the data; Know how to prepare a preprocessed data set
知道对数据的要求；知道如何准备一个预处理的数据集
• Know what are the parameters to be tuned up
知道哪些是需要调整的参数
• Know the solutions for overfittings
知道超配的解决方案
• Which algorithm is the best?
哪种算法是最好的？
• It varies from data to data
不同的数据会有不同的结果
• We need to tune parameters to tune up the model
我们需要调整参数来调优模型
• We need to compare different classification models
我们需要比较不同的分类模型
• General issue: imbalance in labels
一般问题：标签的不平衡性