Emphasis: understand the techniques

  • What it is
  • What problems it can solve
  • In which situations we should use them
  • Any limitations or requirements to use them
  • How to evaluate them

# Supervised v.s. Unsupervised Learning 监督学习和无监督学习

# Supervised Learning 监督学习

  • infer a (predictive) function from data associated with pre defined targets/classes/labels
    从与预定义的目标 / 类 / 标签相关的数据中推断一个 (预测) 函数

  • Example: group objects by predefined labels

  • Goal: Learn a model from labelled data (with multiple features) for future predictions
    从标记的数据 (具有多种特征) 中学习一个模型,用于未来的预测

  • Outcomes: We know outcomes: the predefined labels

  • Evaluation: error/accuracy, and other more metrics
    错误 / 准确性,以及其他更多的指标

  • Data Mining Task: Classification

# Unsupervised Learning 无监督学习

  • discover or describe underlying structure from unlabelled data

  • Example: group objects by multiple features

  • Goal: Learn the structure from unlabelled data (with multiple features)
    从没有标记的数据 (具有多种特性) 中学习结构

  • Outcomes: We do not know the outcomes

  • Evaluation: No clear performance or evaluation methods

  • Data Mining Task: Clustering

Machine Learning Algorithms 机器学习算法
Unsupervised 无监督学习Supervised 监督学习

Continuous 数值型

  • Clustering & Dimensionality Reduction 聚类与降维
    • SVD 奇异值分解
    • PCA
    • K-means
  • Regression 回归
    • Linear 线性
    • Polynomial 多项式
  • Decision Trees 决策树
  • Random Forests 随机森林

Categorical 标称型

  • Association Analysis 关联分析
    • Apriori 先验推测
    • FP-Growth
  • Hidden Markov Model 隐马尔可夫模型
  • Classification 分类法
    • KNN K - 近邻算法 k-Nearest Neighbor
    • Trees
      • Logistic Regression 逻辑回归
    • Naive-Bayes 朴素贝叶斯
    • SVM

# Linear Regression

  • We have knowledge: values in y
    已知值取自 y
  • We have factors or features: x variables
    有因素或特征:x 个变量
  • We need to split data into training and testing
  • We learned the model from training, and evaluate it on the testing set
  • We do have truth in testing test and predictions for test set, as well as evaluation metrics: RMSE, MAE
    在检测测试集和对测试集的预测,以及评估指标上 RMSE,MAE
  • Have a general problem in supervised learning: overfitting

# Classification

a supervised way to group objects 一个监督的方式来分组对象
  • We must have predefined labels 必须有预定义的标签
  • We must have knowledge: we know some instances are labeled by predefined classes/labels/categories
    必须有知识:我们知道一些实例是由预定义的类 / 标签 / 类别来标记的
  • For a Purpose of Prediction 为了预测的目的

    • To forecast or deduce the label/class based on values of features
      根据特征值预测或推断标签 / 类别
    • Let the machines/computers think as humans
      让机器 / 计算机像人一样思考
  • There are many real world applications

    • Financial Decision Making, e.g., credit card application
      财务决策 - 信用卡申请
    • Image Processing, e.g., face recognition in cameras
      图像处理 - 摄像头中的人脸识别
    • Computer/Network Security, e.g., virus or attack detection
      计算机 / 网络安全 - 病毒或攻击检测
    • Information Retrieval, e.g., relevance of a document to a query
      信息检索 - 文档与查询的相关性
    • Recommender Systems, e.g., rating prediction for Amazon
      推荐系统 - 亚马逊的评级预测

Example - Classification App: Credit Card Application

Terminologies in Classification

  • Features 特征
    • Each row with features values is named as example or instance 带有特征值的每一行都被命名为示例或实例
  • Classes 分类
  • Knowledge 知识
  • Unseen data
  • Learn from the knowledge (examples with unknown labels)
    从 knowledge 中学习(标签未知的示例)
  • build predictive models to predict the unknown examples

# Classification Task 任务

There are usually three types of classification:

  1. Binary Classification 二元分类
    Question: Is this an apple? Yes or No.

  2. Multi-class Classification 多类分类
    Question: Is this an apple, banana or orange?

  3. Multi-label Classification 多标签分类
    Use appropriate words to describe it: Red, Apple, Fruit , Tech, Mac, iPhone

We use binary classification as examples to introduce classification techniques.

But most of these classification methods can handle multi class classifications too.

There are different strategies to handle multi class classifications.

# Standard Classification Process 标准分类流程

  1. Training/Learning: Learn a model using the training data
    训练 / 学习:使用训练数据学习模型
  2. Validation/Test: Test using test data to assess accuracy
    验证 / 测试:使用测试数据评估准确性
  3. Application/Predictions: Apply the selected model to unseen data
    应用 / 预测:将所选模型应用于看不见的数据

Accuracy=Number of correct classificationsTotal number of test cases\text { Accuracy }=\frac{\text { Number of correct classifications }}{\text { Total number of test cases }}

# Evaluation 评估

How could we know it is good or bad?

Data Splits for Evaluations 将数据拆分来进行评估

  • There are several ways to split your data for evaluations
    • Hold-out evaluation
    • N-fold cross validation N 倍交叉验证
    • Leave one out evaluation
    • Stratified N fold cross validation
    • ...

# Hold-out evaluation

If your data is large enough

方法一:将 Knowledge 随机分成两部分,一组为 Training Data,一组为 Test Data,通常训练集要大一些 70-80%。

方法二:将 Knowledge 分成三部分,Training Data,Validation Data,用来 Training, evaluating and tunning model parameters,模型参数的训练、评估和调整,以及 Testing Data,Report results on testing set only 仅在测试集上报告结果。这种情况下可以用 Validation 和 Test 两组数据进行评估,最终报告仅基于 Testing data,结果更可靠。

# N-folds Cross Evaluation N 倍交叉验证

If your data is relatively small

需要定一个 N ,可以选择任何一个大于 2 的整数,一般可以是 5 或者 10。
将数据随机平均分成 N 个 fold。
第一轮,选择第一个 fold 作为 valitation,其他组作为 training,建立模型,并评估,获得了一个 accuracy。
第二轮,选择第二个 fold 作为 valitation,其他组作为 training,建立模型,并评估,获得了另一个 accuracy。
以此类推,共做 N 个 Round。

每轮都选择一个 fold 作为 testing,其余作为 training。
最终生成 evaluation matrix,并且报告 average matrix 作为最终 accuracy。

# Summary

  • We always suggest you to use N-fold cross validation, as long as you have enough computational power it doesn’t matter your data is large or small
    建议使用 N 倍交叉验证,只要有足够的计算能力,数据大小都无关紧要
    因为 Hold-out 方法总归会有 bias 偏倚,数据量大,bias 会小些
  • If your computer is not powerful
    • Data is large => you can use hold-out
    • Data is small => you can use N-fold cross validation
    • No fixed rule to say data is large or small. Usually, a data set with less than 500K rows can be considered as small data
      没有固定的规则来说明数据的大小。通常,行数小于 500K 的数据集可视为小数据
  • Common mistakes: some students run both hold-out
    and N-fold cross validation, and report best results.

如果数据量小于 500k,直接选择 N-fold cross;数据量大于 500k,但是有大 cpu 或内存,仍然建议 N-fold cross。因为 N-fold cross 更可靠。

  • How it works

# General Problem: overfitting 过拟合

Overfitting Problem
The model is over trained by the training set;
the performance on the testing set (such as accuracy) is significantly worse than the performance on training set

  • Example of over trained:
    students can work on questions on the assignment well, but they may not work well on the questions in the exams.

  • Is there an overfitting problem?

    1. Linear Regression Models 线性回归模型
      • M1: AdjR2Adj-R^{2} = 96%, MAE = 0.36
      • M2: AdjR2Adj-R^{2} = 98%, MAE = 0.6

      Adjust R Square tells you how the model performs on the train data set
      MAE tells you how the model performs on the test data set
      in the train data set, M2 works better than M1.
      in the test data set, M2 works worse than M1
      M2 has overfitting problem.
      Because theoretically M2 works better than M1 on train data set, so M2 should also works better than M1 on test data set. But unfortunately, the MAE on the test data set is increased.

    2. Classification Models 分类模型
      • M1: Accuracy on training = 90%, testing = 85%
      • M2: Accuracy on training = 80%, testing = 85%
      • M3: Accuracy on training = 85%, testing = 60%

      M3 有严重的问题,M1 的情况比较常见,不算严重
      不管是哪种模型,都要基于 testing data 出报告,而不应该基于 training data

# Classification Algorithms

How to perform classification tasks?

Classification algorithm is the key component in the process

They are able to learn from training and build models…

There are many (supervised) classification algorithms:

  • K-nearest neighbor classifier K 近邻分类器
  • Naïve Bayes classifier 朴素贝叶斯分类器
  • Decision tress 决策树
  • Linear/Logistic regression 线性 / 逻辑回归
  • Support Vector Machines 支持向量机
  • Ensemble classifiers (e.g., random forest) 集成分类器(如 随机森林)
  • Neural Networks 神经网络

请我喝[茶]~( ̄▽ ̄)~*

Ruri Shimotsuki 微信支付


Ruri Shimotsuki 支付宝


Ruri Shimotsuki 贝宝