Emphasis: understand the techniques

• What it is
• What problems it can solve
• In which situations we should use them
• Any limitations or requirements to use them
• How to evaluate them

# # Supervised v.s. Unsupervised Learning 监督学习和无监督学习

## # Supervised Learning 监督学习

• infer a (predictive) function from data associated with pre defined targets/classes/labels
从与预定义的目标 / 类 / 标签相关的数据中推断一个 (预测) 函数

• Example: group objects by predefined labels
根据预定义的标签，对对象进行分组

• Goal: Learn a model from labelled data (with multiple features) for future predictions
从标记的数据 (具有多种特征) 中学习一个模型，用于未来的预测

• Outcomes: We know outcomes: the predefined labels
我们知道结果：预先定义的标签

• Evaluation: error/accuracy, and other more metrics
错误 / 准确性，以及其他更多的指标

数据挖掘任务：分类

## # Unsupervised Learning 无监督学习

• discover or describe underlying structure from unlabelled data
从未标记的数据中发现或描述底层结构

• Example: group objects by multiple features
通过多种特征，对对象进行分组

• Goal: Learn the structure from unlabelled data (with multiple features)
从没有标记的数据 (具有多种特性) 中学习结构

• Outcomes: We do not know the outcomes
我们不知道结果

• Evaluation: No clear performance or evaluation methods
没有明确的表现或评估方法

数据挖掘任务：归并

Machine Learning Algorithms 机器学习算法
Unsupervised 无监督学习Supervised 监督学习

Continuous 数值型

• Clustering & Dimensionality Reduction 聚类与降维
• SVD 奇异值分解
• PCA
• K-means
• Regression 回归
• Linear 线性
• Polynomial 多项式
• Decision Trees 决策树
• Random Forests 随机森林

Categorical 标称型

• Association Analysis 关联分析
• Apriori 先验推测
• FP-Growth
• Hidden Markov Model 隐马尔可夫模型
• Classification 分类法
• KNN K - 近邻算法 k-Nearest Neighbor
• Trees
• Logistic Regression 逻辑回归
• Naive-Bayes 朴素贝叶斯
• SVM

# # Linear Regression

• We have knowledge: values in y
已知值取自 y
• We have factors or features: x variables
有因素或特征：x 个变量
• We need to split data into training and testing
将数据分为培训和测试
• We learned the model from training, and evaluate it on the testing set
从训练中学习模型，并在测试集上对其进行评估
• We do have truth in testing test and predictions for test set, as well as evaluation metrics: RMSE, MAE
在检测测试集和对测试集的预测，以及评估指标上 RMSE，MAE
• Have a general problem in supervised learning: overfitting
监督学习有一个普遍的问题：过度适应

# # Classification

Classification
a supervised way to group objects 一个监督的方式来分组对象
• We must have predefined labels 必须有预定义的标签
• We must have knowledge: we know some instances are labeled by predefined classes/labels/categories
必须有知识：我们知道一些实例是由预定义的类 / 标签 / 类别来标记的
• For a Purpose of Prediction 为了预测的目的

• To forecast or deduce the label/class based on values of features
根据特征值预测或推断标签 / 类别
• Let the machines/computers think as humans
让机器 / 计算机像人一样思考
• There are many real world applications
现实世界中有很多应用程序

• Financial Decision Making, e.g., credit card application
财务决策 - 信用卡申请
• Image Processing, e.g., face recognition in cameras
图像处理 - 摄像头中的人脸识别
• Computer/Network Security, e.g., virus or attack detection
计算机 / 网络安全 - 病毒或攻击检测
• Information Retrieval, e.g., relevance of a document to a query
信息检索 - 文档与查询的相关性
• Recommender Systems, e.g., rating prediction for Amazon
推荐系统 - 亚马逊的评级预测

Example - Classification App: Credit Card Application

Terminologies in Classification

• Features 特征
• Each row with features values is named as example or instance 带有特征值的每一行都被命名为示例或实例
• Classes 分类
• Knowledge 知识
• Unseen data
Classification
• Learn from the knowledge (examples with unknown labels)
从 knowledge 中学习（标签未知的示例）
• build predictive models to predict the unknown examples
建立预测模型来预测未知的例子

There are usually three types of classification:

1. Binary Classification 二元分类
Question: Is this an apple? Yes or No.
问题：这是苹果吗？是或否。

2. Multi-class Classification 多类分类
Question: Is this an apple, banana or orange?
问题：这是苹果、香蕉还是橘子？

3. Multi-label Classification 多标签分类
Use appropriate words to describe it: Red, Apple, Fruit , Tech, Mac, iPhone
用适当的词来描述它：红色、苹果、水果、科技、Mac、iPhone

We use binary classification as examples to introduce classification techniques.

But most of these classification methods can handle multi class classifications too.

There are different strategies to handle multi class classifications.

## # Standard Classification Process 标准分类流程

1. Training/Learning: Learn a model using the training data
训练 / 学习：使用训练数据学习模型
2. Validation/Test: Test using test data to assess accuracy
验证 / 测试：使用测试数据评估准确性
3. Application/Predictions: Apply the selected model to unseen data
应用 / 预测：将所选模型应用于看不见的数据

$\text { Accuracy }=\frac{\text { Number of correct classifications }}{\text { Total number of test cases }}$

## # Evaluation 评估

How could we know it is good or bad?

Data Splits for Evaluations 将数据拆分来进行评估

• There are several ways to split your data for evaluations
有几种方法可以分割数据进行评估
• Hold-out evaluation
• N-fold cross validation N 倍交叉验证
• Leave one out evaluation
• Stratified N fold cross validation
• ...

### # Hold-out evaluation

If your data is large enough

### # N-folds Cross Evaluation N 倍交叉验证

If your data is relatively small

### # Summary

• We always suggest you to use N-fold cross validation, as long as you have enough computational power it doesn’t matter your data is large or small
建议使用 N 倍交叉验证，只要有足够的计算能力，数据大小都无关紧要
因为 Hold-out 方法总归会有 bias 偏倚，数据量大，bias 会小些
• If your computer is not powerful
• Data is large => you can use hold-out
• Data is small => you can use N-fold cross validation
• No fixed rule to say data is large or small. Usually, a data set with less than 500K rows can be considered as small data
没有固定的规则来说明数据的大小。通常，行数小于 500K 的数据集可视为小数据
• Common mistakes: some students run both hold-out
and N-fold cross validation, and report best results.

• How it works

## # General Problem: overfitting 过拟合

Overfitting Problem
The model is over trained by the training set;

the performance on the testing set (such as accuracy) is significantly worse than the performance on training set

• Example of over trained:
过度训练的例子：
students can work on questions on the assignment well, but they may not work well on the questions in the exams.
学生可以很好地解决作业中的问题，但他们可能无法很好地解决考试中的问题。

• Is there an overfitting problem?

1. Linear Regression Models 线性回归模型
• M1: $Adj-R^{2}$ = 96%, MAE = 0.36
• M2: $Adj-R^{2}$ = 98%, MAE = 0.6

Adjust R Square tells you how the model performs on the train data set
MAE tells you how the model performs on the test data set
in the train data set, M2 works better than M1.
in the test data set, M2 works worse than M1
M2 has overfitting problem.
Because theoretically M2 works better than M1 on train data set, so M2 should also works better than M1 on test data set. But unfortunately, the MAE on the test data set is increased.

2. Classification Models 分类模型
• M1: Accuracy on training = 90%, testing = 85%
• M2: Accuracy on training = 80%, testing = 85%
• M3: Accuracy on training = 85%, testing = 60%

M3 有严重的问题，M1 的情况比较常见，不算严重
不管是哪种模型，都要基于 testing data 出报告，而不应该基于 training data

## # Classification Algorithms

Classification algorithm is the key component in the process

They are able to learn from training and build models…

There are many (supervised) classification algorithms:

• K-nearest neighbor classifier K 近邻分类器
• Naïve Bayes classifier 朴素贝叶斯分类器
• Decision tress 决策树
• Linear/Logistic regression 线性 / 逻辑回归
• Support Vector Machines 支持向量机
• Ensemble classifiers (e.g., random forest) 集成分类器（如 随机森林）
• Neural Networks 神经网络