# Unsupervised learning 无监督学习

Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data.
无监督学习是从未标记的数据中推断出描述隐藏结构的函数的机器学习任务。
Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.
因为给学习者的例子是未标记的,所以没有错误或奖励信号来评估潜在的解决方案。
This distinguishes unsupervised learning from supervised learning and reinforcement learning.
这将无监督学习与监督学习和强化学习区分开来。

  • Clustering
  • Association Rule Mining
  • Principal Component Analysis
  • etc...

# How to evaluate unsupervised learning 如何评价无监督学习

  • Usually, we do not have a metric for evaluations
    通常没有评估的标准
  • But there are two ways 但是有两种方法
    1. We can manually look at the outputs, analyze and interpret it, to see whether there are significant differences and they are useful
      可以手动查看输出,对其进行分析和解释,以查看是否存在显著差异以及它们是否有用
    2. The outputs of unsupervised learning can be used as inputs to a supervised learning process, to see whether the supervised learning can be improved
      非监督学习的输出可以用作监督学习过程的输入,以查看是否可以改进监督学习

# Clustering 聚类

「机器学习实战」摘录 - 聚类

聚类是一种无监督的学习,它将相似的对象归到同一个簇中。
它有点像全自动分类。
聚类方法几乎可以应用于所有对象,簇内的对象越相似,聚类的效果越好。

簇识别(cluster identification)给出聚类结果的含义。
假定有一些数据,现在将相似数据归到一起,簇识别会告诉我们这些簇到底都是些什么。

聚类与分类的最大不同在于,分类的目标事先已知,而聚类则不一样。
因为其产生的结果与分类相同,而只是类别没有预先定义,聚类有时也被称为无监督分类(unsupervised classification)。

聚类分析试图将相似对象归入同一簇,将不相似对象归到不同簇。
相似这一概念取决于所选择的相似度计算方法。

Partitional Clustering 分区聚类
just group objects to minimize intra cluster distances and maximize inter cluster distances
只对对象进行分组,以最小化簇内距离和最大化簇间距离
Example: Document Clustering
示例:文档聚类
Density Based Clustering 基于密度的聚类
cluster objects based on the local connectivity and density functions
基于局部连通性和密度函数对对象进行聚类
Each cluster has a considerable higher density of points than outside of the cluster
每个簇都具有比簇外部高得多的点密度
Hierarchical Clustering 分层聚类
a clustering process in order to discover the hierarchical structure, like a hierarchical tree
一个聚类过程,以发现分层结构,如分层树
Example: categories and subcategories; taxonomies
示例:类别和子类别;分类学

# Partitional Clustering 分区聚类

Partitional Clustering
a unsupervised way to group objects
一种无监督的对象分组方法
  • Goal:
    Finding groups of objects in data such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
    在数据中寻找对象组,使得一个组中的对象彼此相似 (或相关),而与其他组中的对象不同 (或无关)

  • Notion of a Cluster can be Ambiguous
    簇的概念可能是模糊的

  • Basic idea

    • Measure similarity or distance between each two objects
      测量每两个对象之间的相似性或距离
    • Group the objects based on these similartiies
      根据这些相似性对对象进行分组

  • Distance or Similarity Measures 距离或相似性度量
    Common Distance Measures:
    常见的距离度量方法
    1. Manhattan distance:

      X=x1,x2,,xnX=\left\langle x_{1}, x_{2}, \cdots, x_{n}\right\rangle

      Y=y1,y2,,ynY=\left\langle y_{1}, y_{2}, \cdots, y_{n}\right\rangle

      dist(X,Y)=x1y1+x2y2++xnyn\operatorname{dist}(X, Y)=\left|x_{1}-y_{1}\right|+\left|x_{2}-y_{2}\right|+\cdots+\left|x_{n}-y_{n}\right|

    2. Euclidean distance:

      dist(X,Y)=(x1y1)2++(xnyn)2\operatorname{dist}(X, Y)=\sqrt{\left(x_{1}-y_{1}\right)^{2}+\cdots+\left(x_{n}-y_{n}\right)^{2}}

    3. Cosine distance:

      dist(X,Y)=1sim(X,Y)\operatorname{dist}(X, Y)=1-\operatorname{sim}(X, Y)

      sim(X,Y)=i(xi×yi)ixi2×iyi2\operatorname{sim}(X, Y)=\frac{\sum_{i}\left(x_{i} \times y_{i}\right)}{\sqrt{\sum_{i} x_{i}^{2} \times \sum_{i} y_{i}^{2}}}

# K-Means Clustering Algorithm K - 均值聚类算法

「机器学习实战」摘录 - K-均值聚类算法

之所以称之为 K - 均值是因为它可以发现kk 个不同的簇,且每个簇的中心采用簇中所含值的均值计算而成。
K - 均值是发现给定数据集的kk 个簇的算法。
簇个数kk 是用户给定的,每一个簇通过其质心 centroid ,即簇中所有点的中心来描述。

  • 优点:容易实现。
  • 缺点:可能收敛到局部最小值,在大规模数据集上收敛较慢。
  • 适用数据类型:数值型数据。
  • Assume we have many examples/instances, each example can be represented by a vector of features, where the features must be numerical ones, e.g., weight, size, price, profits, etc
    假设我们有许多例子 / 实例,每个例子可以由特征向量表示,其中特征必须是数字的,例如重量、尺寸、价格、利润等

  • So that, we can use the distance measures to calculate the similarity or the dissimilarity (i.e., distance) between each two examples.
    因此,我们可以使用距离度量来计算每两个示例之间的相似性或不相似性 (即,距离)。

  • With such setting, we are able to apply a K-Means clustering algorithms to perform the normal clustering task.
    通过这样的设置,我们能够应用 K-Means 聚类算法来执行正常的聚类任务。

# Steps

  1. Init: initialize K and K clusters
    初始化 K 和 K 个簇
    There are multiple ways to define the initial cluster:
    定义初始簇有多种方式:
    • You can randomly choose K instances and each one of them is an individual cluster;
      可以随机选择 K 个实例,每个实例都是一个独立的簇;
    • Or, you can randomly assign all or parts of your instances into K groups.
      或者,可以将所有或部分实例随机分配到 K 个组中。
      Each group is an individual cluster.
      每个组都是一个单独的簇。
  2. Step 1. Calculate centroids for K clusters
    计算 K 个簇的质心
  3. Step 2. Assign data points to each cluster based on the distance between data and centroids
    根据数据点和质心之间的距离,将数据点分配给每个簇
  4. Step 3. get new K clusters, compare them with previous clusters
    获得 K 个新的簇,将它们与先前的簇进行比较
  5. Step 4. Repeat 1,2,3 until convergence (i.e., no points move between clusters)
    重复 1、2、3 直到收敛 (即没有点在簇之间移动)
「机器学习实战」摘录 - K-均值算法的工作流程

首先,随机确定kk 个初始点作为质心。
然后将数据集中的每个点分配到一个簇中,具体来讲,为每个点找距其最近的质心,并将其分配给该质心所对应的簇。
这一步完成之后,每个簇的质心更新为该簇所有点的平均值。

上述过程的伪代码表示如下:

创建k个点作为起始质心(经常是随机选择)
当任意一个点的簇分配结果发生改变时
    对数据集中的每个数据点
        对每个质心
            计算质心与数据点之间的距离
        将数据点分配到距其最近的簇
    对每一个簇,计算簇中所有点的均值并将均值作为质心

上面提到 “最近” 质心的说法,意味着需要进行某种距离计算。
可以使用所喜欢的任意距离度量方法。

# Stopping Criterion in Iterative learning 迭代学习中的停止准则

  • We need to stop the learning iterations when it is converged
    收敛时需要停止学习迭代
  • How to determine it is converged?
    如何确定它是收敛的?
    • Criterion 1: new clusters = old clusters
      准则 1: 新簇 = 旧簇
      stop learning when no changes on clusters
      当簇没有变化时停止学习
    • Criterion 2: setup a maximal learning iterations
      准则 2: 设置最大的学习迭代次数
      stop learning when it got to maximal learning iterations
      当它达到最大的学习迭代次数时,停止学习
    • In practice, we usually use 2nd criterion, since clustering may converge after several/unexpected iterations, especially when the data set is large
      在实践中,我们通常使用第二准则,因为聚类可能会在几次 / 意想不到的迭代后收敛,特别是当数据集很大的时候

# Example

有 8 个文档 document (D1-D8),每个文档都由一个项 term 矩阵(T1-T5)表示,数值表示某个 term 出现的次数

T1T2T3T4T5
D103302
D241012
D304002
D403033
D501301
D622004
D710320
D831002
  1. Init: initialize K and K clusters

    • 拟创建 3 个组,即 set K=3,但不代表簇个数 K = 3 是最好的,还需要尝试不同的方法
    • create initial clusters 创建初始簇
      完全随机的,把样本分成三组(C1-C3)
      Initial (arbitrary)
      assignment:
      C1 = {D1,D2},
      C2 = {D3,D4},
      C3 = {D5,D6}
      
  2. Step 1. Calculate centroids for K clusters
    计算 K 个簇的质心

    算每一列的均值,即簇中所有点的中心,即质心 centroid

    T1T2T3T4T5
    D103302
    D241012
    D304002
    D403033
    D501301
    D622004
    D710320
    D831002
    C14/24/23/21/24/2
    C20/27/20/23/25/2
    C32/23/23/20/25/2
  3. Step 2. Assign data points to each cluster based on the distance between data and centroids
    根据数据点和质心之间的距离,将数据点分配给每个簇

    Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix.
    现在计算每个 item 与每个簇的相似性 (或距离),得到一个聚类 - 文档相似性矩阵
    Here we use dot product as the similarity measure for simplicity
    为了简单起见,这里使用点积作为相似性度量。

    计算质心与各个数据点之间的距离
    如果使用距离,需要将数据点放入距离最小的聚类
    如果使用相似性,需要将数据点放入具有最大相似性的聚类。
    本例使用了相似性来度量

    dot product & Cosine similarity

    Recall that the Cosine similarity of two vectors is their dot product divided by the product of their norms.
    回想一下,两个向量的余弦相似度是它们的点积除以它们的范数的乘积。

    For example, Consider the two vectors XX and YY:

    X=<3,0,1,2,0,3>X = <3, 0, 1, 2, 0, 3>

    Y=<2,0,0,3,8,4>Y = <2, 0, 0, 3, 8, 4>

    The dot product is given by sum of the coordinate-wise multiples:
    点积由坐标倍数的总和给出:

    \begin{align} \operatorname{dot-product}(X, Y) & = 3 \times 2+0 \times 0+1 \times 0+2 \times 3+0 \times 8+3 \times 4 \\ & = 6+0+0+6+0+12 \\ & = 24 \end{align}

    The norm of each vector is the square-root of the sum of the squares of its dimension values.
    每个向量的范数是其尺寸值平方和的平方根。
    So, the norms of X and Y are:
    所以,X 和 Y 的范数是:

    norm(X)=32+12+22+32=4.8\operatorname{norm}(X)=\sqrt{3^{2}+1^{2}+2^{2}+3^{2}}=4.8

    norm(Y)=22+32+82+42=9.6\operatorname{norm}(Y)=\sqrt{2^{2}+3^{2}+8^{2}+4^{2}}=9.6

    and the Cosine similarity of X and Y is given by:
    X 和 Y 的余弦相似性由下式给出:

    Sim(X,Y)=dot-product(X,Y)norm(X)×norm(Y)=244.8×9.6=0.52\operatorname{Sim}(X, Y)=\frac{\operatorname{dot-product}(X, Y)}{\operatorname{norm}(X) \times \operatorname{norm}(Y)}=\frac{24}{4.8 \times 9.6}=0.52

    D1D2D3D4D5D6D7D8
    C129/229/224/227/217/232/215/224/2
    C231/220/238/245/212/234/26/217/2
    C328/221/222/224/217/230/211/219/2

    For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table).
    对于每个文档,将该文档重新分配到与其具有最高相似性的簇中 (在上表中以红色显示)。
    After the reallocation we have the following new clusters.
    重新分配后,我们有以下新的簇。

    New assignment:
    C1 = {D2,D7,D8}, 
    C2 = {D1,D3,D4,D6}, 
    C3 = {D5}
    

    Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.
    请注意,之前未分配的 D7 和 D8 已被分配,D1 和 D6 已从其原始分配中重新分配。

    This is the end of first iteration (i.e., the first reallocation).
    这是第一次迭代 (即第一次重新分配) 的结束。

    Next, we repeat the process for another reallocation…
    接下来,我们重复该过程进行另一次重新分配…

  4. Step 3. get new K clusters, compare them with previous clusters
    获得 K 个新的簇,将它们与先前的簇进行比较

    Now compute new cluster centroids using the original document-term matrix
    现在使用原始的文档 - 项矩阵计算新的簇质心

    T1T2T3T4T5
    D103302
    D241012
    D304002
    D403033
    D501301
    D622004
    D710320
    D831002
    C18/32/33/33/34/3
    C22/412/43/43/411/4
    C30/11/13/10/11/1

    This will lead to a new cluster-doc similarity matrix similar to previous slide.
    这将产生一个新的簇 - 文档相似性矩阵,与前类似。

    D1D2D3D4D5D6D7D8
    C17.6715.015.349.005.0012.007.6711.34
    C216.7511.2517.5019.508.006.684.2510.00
    C314.003.006.006.0011.009.349.003.00

    Again, the items are reallocated to clusters with highest similarity.
    同样,文档被重新分配到具有最高相似性的簇中。

    New assignment:
    C1 = {D2,D6,D8},
    C2 = {D1,D3,D4},
    C3 = {D5,D7}
    

Note: This process is now repeated with new clusters.
注意:现在对新的簇重复这一过程。
However, the next iteration in this example will show no change to the clusters, thus terminating the algorithm.
然而,本例中的下一次迭代将不会有簇的变化,从而终止该算法。

# Evaluations 评价

  • There are no clear evaluations:
    没有明确的评价:
    clustering is good as long asit can serve for your usage or applications
    只要能为您的使用或应用服务,聚类就是好的

  • Most common measure is Sum of Squared Error SSE
    最常见的衡量标准是误差平方和 SSE

    • For each point, the error is the distance to the nearest cluster
      对于每个点,误差是到最近簇的距离

    • To get SSE , we square these errors and sum them.
      为了得到 SSE ,我们将这些误差平方并求和。

      SSE=i=1KxCidist2(mi,x)SSE=\sum_{i=1}^{K} \sum_{x \in C_{i}} dist^{2}\left(m_{i}, x\right)

      It is not a metric to evaluate clustering results
      它不是评估聚类结果的指标

    • xx is a data point in cluster CiC_{i} and mim_{i} is the representative point for cluster CiC_{i}
      xx 是簇CiC_{i} 中的数据点,而mim_{i} 是簇CiC_{i} 的代表点
      can show that micorresponds to the center (mean) of the cluster
      可以显示对应于簇中心 (平均值) 的微反应

    • Drawback: if K is increased, SSE can be decreased
      缺点:如果 K 增加, SSE 可以减少

    • It is used to measure how well the clustering process isIt cannot tell how well the clustering results are
      它用于衡量聚类过程在多大程度上无法判断聚类结果的好坏

    • SSE can also be used to find the best K value
      还可用于查找最佳 K 值
      Try K = 3, 5, 7, 10, 13, 20, etc…
      Observe the K value which can lower SSE

「机器学习实战」摘录 - 度量聚类效果 SSE

K - 均值算法收敛但聚类效果较差的原因是,K - 均值算法收敛到了局部最小值,而非全局最小值(局部最小值指结果还可以但并非最好结果,全局最小值是可能的最好结果)。

在包含簇分配结果的矩阵中保存着每个点的误差,即该点到簇质心的距离平方值。

一种用于度量聚类效果的指标是 SSE(Sum of Squared Error,误差平方和)。
SSE 值越小表示数据点越接近于它们的质心,聚类效果也越好。
因为对误差取了平方,因此更加重视那些远离中心的点。
一种肯定可以降低 SSE 值的方法是增加簇的个数,但这违背了聚类的目标。
聚类的目标是在保持簇数目不变的情况下提高簇的质量。

# How to evaluate the clustering results?

Solution 1
compare clusters by using centroid and tell the significant differences among different clusters, to better understand why they were put together
通过使用质心来比较簇,并指出不同簇之间的显著差异,以更好地理解为什么将它们放在一起
CentroidGenderGPAStudy HoursCourse Completed
C112.52010
C20.64.0403
C303.02511
Solution 2
add the clustering results into a supervised learning process to learn whether they are able to improve supervised learning
将聚类结果添加到监督学习过程中,以了解它们是否能够改进监督学习
StudentGenderGPAStudy HoursCourse CompletedTA?
S112.52010N
S204.0403Y
S303.02511Y
StudentGenderGPAStudy HoursCourse CompletedTA?Cluster
S112.52010Nc1
S204.0403Yc2
S303.02511Yc2

# Pros and Cons

Strength of the K-means K 均值的强处
  • Relatively efficient: O(tkn)O(tkn), where nn is # of objects, kk is # of clusters, and tt is # of iterations.
    Normally, k,tnk, t \ll n
    相对高效
  • Often terminates at a local optimum
    经常终止于局部最优
Weakness of the K-means K - 均值的弱处
  • What about categorical data? 分类数据呢
  • Performance is sensitive to initializations, e.g., K, initial clusters, and the definition of centriods
    性能对初始化的值很敏感,例如 K、初始簇和质心的定义
  • Need to specify K, the number of clusters, in advance
    需要提前指定 K,也就是簇的数量
  • Unable to handle noisy data and outliers
    无法处理嘈杂的数据和异常值
Variations of K-Means usually differ in: K - 均值的变化通常由于以下的不同
  • Selection of the initial K Means
    初始 K 均值的选择
  • Dissimilarity calculations
    相异度计算
  • Strategies to calculate cluster means
    计算簇平均值的策略

# Improve Your Clustering

「机器学习实战」摘录 - 对聚类结果进行改进

那么如何对结果进行改进?你可以对生成的簇进行后处理,一种方法是将具有最大 SSE 值的簇划分成两个簇。
具体实现时可以将最大簇包含的点过滤出来并在这些点上运行 K - 均值算法,其中的 k 设为 2。

有两种可以量化的办法:合并最近的质心,或者合并两个使得 SSE 增幅最小的质心。
第一种思路通过计算所有质心之间的距离,然后合并距离最近的两个点来实现。
第二种方法需要合并两个簇然后计算总 SSE 值。
必须在所有可能的两个簇上重复上述处理过程,直到找到合并最佳的两个簇为止。

# Pre-processing 预处理

  • Normalize the data 标准化数据
  • Eliminate outliers 消除异常值

# Post-processing 后处理

  • Eliminate small clusters that may represent outliers
    消除可能代表异常值的小簇
  • Split 'loose' clusters, i.e., clusters with relatively high SSE
    分割 “松散” 簇,即具有相对较高 SSE 的簇
  • Merge clusters that are ‘close’ and that have relatively low SSE
    合并 “接近” 且 SSE 相对较低的簇

# Variations of K-Means Clustering K - 均值聚类的变异

  • K-Means Clustering: centroid is defined as means
    k - 均值聚类:质心被定义为均值
  • K-Median Clustering: centroid is defined as medians
    k - 中位数聚类:质心被定义为中位数
  • K-Medoids Clustering: medoids as centroid
    K - 中心点聚类:以中心点为质心
  • X-Means Clustering: figure out a way to find best K
    X - 均值聚类:找出找到最佳 K 的方法
  • Fuzzy C-Means Clustering: fuzzy degree as confidence
    模糊 C 均值聚类:模糊度作为置信度
  • Many more…

# K-Medoids Clustering

K-Medoids Clustering
is built as one of partitonal clustering approaches
它是作为一种分区聚类方法建立的
Medoids as centroids
以中心点为质心
A medoid
is defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.
medoid 被定义为一个簇的对象,其与该簇中所有对象的平均相异度是最小的
In other words, a medoid is the most centrally located points in the cluster.
换句话说,一个 medoid 是簇中位于最中心的点

PAM (Kaufman and Rousseeuw , 1987), built in Splus

  • Use real object to represent the cluster

    1. Select kk representative objects arbitrarily
      任意选择kk 个代表对象
    2. For each pair of non selected object hh and selected object ii, calculate the total swapping cost TCihTC_{ih}
      对于每对非选定对象hh 和选定对象ii,计算总交换成本TC_
    3. For each pair of ii and hh ,
      • If TCih<0TC_{ih} < 0, ii is replaced by hh
      • Then assign each non-selected object to the most similar representative object
    4. repeat steps 2-3 until there is no change
  • Pros and Cons 利弊

    • The centroid is defined as the medoid which is the most centrally located object in one cluster
      质心被定义为中心点,也就是簇中位于最中心的对象
    • To some extent, it helps alleviate the situation of outliers
      在某种程度上,它有助于缓解离群值的情况
    • But this approach is not scalable time consuming for large scale of the data set
      但是这种方法对于大规模的数据集来说是不可扩展的、耗时的
    • Still sensitive to K, initialization, etc
      仍然对 K 值、初始化等敏感

# Density-Based Clustering

  • Major features: 主要特点

    • Discover clusters of arbitrary shape
      发现任意形状的簇
    • Handle noise
      处理噪声
  • Several interesting studies:

    • DBSCAN: Ester, et al. (KDD’96)
    • GDBSCAN: Sander, et al. (KDD’98)
    • OPTICS: Ankerst, et al (SIGMOD’99).
    • DENCLUE: Hinneburg& D. Keim(KDD’98)
    • CLIQUE: Agrawal, et al. (SIGMOD’98)

# Concepts

Two global parameters
两个全局参数
  • EpsEps: Maximum radius or distance of the neighborhood
    邻域的最大半径或距离
  • MinPtsMinPts: Minimum number of points in the neighborhood of that point
    这个点的邻域内的最小点数
Core Object 核心对象
its neighborhood has at least MinPts objects
它的邻域内至少有 MinPts 个对象
Border Object 边缘对象
object that on the border of a cluster
该对象位于簇的边界上
Eps-Neighborhood Eps 邻域
NEps(p)N_{Eps}(p): { qq belongs to Ddist(p,q)EpsD|dist(p,q) \le Eps }
圆圈内和圆圈边界上所有的点,称为该数据点的 Eps 邻域,用NEps(p)N_{Eps}(p) 表示

Eps 是邻域距离的最大值,为半径
给出 EPS 值后,以 p 为中心画半径为 Eps 的圆,即为邻域
然后数圆圈内和圆圈边界上有多少个点
如果点数大于提前设定的 MinPts,那么这个点 p 就是 Core Object
因此 Core Object 由 Eps 和 MinPts 两个参数确定
上图中,假设 MinPts = 5,而图中 p 的 Eps 邻域中的点数只有 3 个,因此 p 点不是 Core Object

Directly density-reachable 直接密度可达
A point qq is directly density-reachable from a point pp wrt.EpsEps, MinPtsMinPts if
qq 从点pp 的 Eps 领域中直接密度可达,需满足以下两个条件
  1. qq belongs to NEps(p)N_{Eps}(p)
    qqpp 的 Eps 邻域中
  2. pp is core object
    pp 是核心对象
Density-reachable 密度可达
A point pp is density-reachable from a point qq wrt.EpsEps, MinPtsMinPts if there is a chain of points p1,,pnp_{1}, \dots , p_{n}, p1=qp_{1} = q, pn=pp_{n} = p such that pi+1p_{i+1} is directly density-reachable from pip_{i}
如果有一条链,将p1,,pnp_{1}, \dots , p_{n} 链接起来,pi+1p_{i+1}pip_{i} 直接密度可达,则如果开始的点 p1=qp_{1} = q, 结束的点pn=pp_{n} = p,则点ppqq wrt.EpsEps, MinPtsMinPts 也是密度可达

如果找到一条链,将链拆分成多个碎片,如果这些碎片都是直接密度可达,则开始的点pp 和结束的点qq 密度可达
Density-connected 密度相关
A point pp is density-connected to a point qq wrt.EpsEps, MinPtsMinPts if there is a point oo such that both, pp and qq are density-reachable from oo wrt.EpsEps and MinPtsMinPts
假设有个点oo 在中间,点ppqq 分别与点oo 密度可达,此时可以说点ppqq 密度相关

# Example (Eps, MinPts as parameters)

  • There are 5 points: oo, mm, nn, pp, qq
  • Assume oo is the core object
  • Directly density-reachable : mm and oo, nn and oo,
    mm and nn are in NEps(o)N_{Eps}(o), and oo is core object
  • Density-reachable: pp and oo, qq and oo
    oo -> mm -> pp, (oo, mm) and (mm, pp) are directly density-reachable
    oo -> nn -> qq, (oo, nn) and (nn, qq) are directly density-reachable
  • Density-connected: pp and qq
    There is a route, (pp, oo), (oo, qq) are density-reachable

# DBSCAN

  • DBSCAN is a popular density based clustering method
    DBSCAN 是一种流行的基于密度的聚类方法

  • It relies on a density based notion of cluster:
    它依赖于以密度为基础的簇概念:
    A cluster is defined as a maximal set of density connected points
    簇被定义为最大的密度连接点集合

  • It can discover clusters of arbitrary shape
    它可以发现任意形状的簇

# Steps

  • Randomly select a point pp
    随机选择一个点pp
  • Retrieve all points density-reachable from pp wrt EpsEps and MinPtsMinPts
    ppwrtEpsEpsMinPtsMinPts 中检索所有密度可达的点
    • If pp is a core point, a cluster is formed.
      如果pp 是一个核心点,那么就形成了一个簇。
    • If pp is a border point, no points are density-reachable from pp and DBSCAN visits the next point of the database.
      如果pp 是一个边界点,那么从pp 无法到达任何点,DBSCAN 将访问数据库的下一个点。
  • Continue the process until all of the points have been processed.
    继续此过程,直到处理完所有点。

# DBSCAN vs CLARANS

CLARANS is an efficient medoid based clustering algorithm
CLARANS 是一种高效的基于 medoid 的聚类算法

The two parameters, EpsEps and MinPtsMinPts , need to be carefully tuned up.
EpsEpsMinPtsMinPts 这两个参数需要仔细调整。
Otherwise, results may be significantly different
否则,结果可能会显著不同

# K-Means vs DBSCAN

K-MeansDBSCAN
  • Partitional Clustering
    分区聚类
  • Density Based Clustering
    基于密度聚类
  • Need to pre-define the value of K
    需要预先定义 K 的值
  • Do not need to pre define the number of clusters
    不需要预先定义簇的数量
  • Need to pre-define EpsEps and MinPTsMinPTs
    需要预先定义EpsEpsMinPTsMinPTs
  • Sensitive to initial settings
    对初始设置敏感
  • Sensitive to initial settings
    对初始设置敏感
  • Sensitive to noise data
    对噪音数据敏感
  • Not sensitive to noise data
    对噪音数据不敏感

# Hierarchical Algorithms 分层算法

  • Use distance matrix as clustering criteria
    使用距离矩阵作为聚类标准
    • does not require the No. of clusters as input, but needs a termination condition
      不需要簇个数作为输入,但需要一个终止条件

  • In K-Means, we need to use similarity or distance metrics to measure the distance between two objects
    在 K-Means 中,我们需要使用相似性或距离度量来度量两个对象之间的距离
  • In hierarchical clustering, we need to measure the distance between two clusters
    在层次聚类中,我们需要测量两个聚类之间的距离
  • It is more complicated, since there are multiple objects within a cluster
    它更复杂,因为一个簇中有多个对象

# Agglomerative Method 凝聚法

  • Each individual object is considered as a single cluster at the beginning
    每个单独的对象在开始时都被视为一个簇
  • Choose a way to represent the cluster, such as means centroid
    选择一种表示簇的方式,例如 “平均质心”
  • Iterate all clusters, find the two clusters with smallest distance, and merge them to a new cluster
    迭代所有簇,找到距离最小的两个簇,并将它们合并为一个新簇
  • Repeat the step above until all objects are grouped to a single cluster
    重复上述步骤,直到所有对象都分组到一个簇中

# Distance Between Clusters 簇间距离

Single-Linkage 单连接
Distance = distance between two closest objects from two cluster
= 距离两个簇最近的两个对象之间的距离
Complete-Linkage 完全连接
Distance = distance between two farthest objects from two clusters
= 距离两个簇最远的两个对象之间的距离
Ward’s Linkage 沃德连接
Distance = how much the sum of squares (i.e., within cluster distance) will increase when we merge them
= 当我们合并它们时,平方和(即簇内距离)将增加多少
UPGMA
Distance = average distance of the distance of every two objects in the two clusters
= 两个簇中每两个对象距离的平均距离
Centroid Method 质心法
Distance = distance between the centroids of the two clusters
= 两个簇的质心之间的距离

# Example

  • HAC starts with unclustered data and performs successive pairwise joins among items (or previous clusters) to form larger ones
    HAC 从未聚集的数据开始,在项(或之前的簇)之间执行连续的成对连接,以形成更大的项
    • this results in a hierarchy of clusters which can be viewed as a dendrogram
      这就形成了一个簇的层次结构,可以看作是一个树状图
    • useful in pruning search in a clustered item set, or in browsing clustering results
      用于修剪聚集项集中的搜索,或浏览聚集结果

  • Given a list of numbers: 9, 13, 7, 3, 4
    给出一个数字列表

    • Build hierarchical clustering tree structure from bottom to the up
      建立自下而上的层次聚类树结构
    • Use the mean as the representative of each cluster, Use centroid method to merge clusters
      用平均值作为每个簇的代表,用质心法合并簇
    • Use Manhattan distance as metric
      使用曼哈顿距离作为度量方法
  • At the beginning, each number is an individual cluster
    一开始,每个数字都是一个单独的簇

    [3][4][7][9][13]

  • We calculate the distance between every two centroids
    计算每两个质心之间的距离

    [3][4][7][9][13]
    [3]0
    [4]10
    [7]430
    [9]6520
    [13]109640
  • And merge the two clusters with smallest distance
    以最小距离合并两个簇

    [3, 4] = 3.5[7][9][13]

  • Right now, we only have 4 clusters, re-calculate centroids
    现在,我们只有 4 个簇,重新计算质心

  • Next: calculate the distance between remaining clusters
    下一步:计算剩余簇之间的距离

    [3, 4] = 3.5[7][9][13]
    [3, 4] = 3.50
    [7]3.50
    [9]5.520
    [13]9.5640
  • Merge the two clusters with smallest distance
    以最小距离合并两个簇

    [3, 4] = 3.5[7, 9] = 8[13]

  • Right now, we only have 3 clusters, re-calculate centroids
    现在,我们只有 3 个簇,重新计算质心

  • Next: calculate the distance between remaining clusters
    下一步:计算剩余簇之间的距离

    [3, 4] = 3.5[7, 9] = 8[13]
    [3, 4] = 3.50
    [7, 9] = 84.50
    [13]9.550
  • Merge the two clusters with smallest distance
    以最小距离合并两个簇

  • Right now, we only have 2 clusters, re-calculate centroids
    现在,我们只有 2 个簇,重新计算质心

  • Next: calculate the distance between remaining clusters
    下一步:计算剩余簇之间的距离

  • Repeat until all the members are put into a single cluster
    重复此操作,直到所有成员都放入一个簇

# Extensions from the Example

  • Each object is a one-dimension data point in our previous example
    在前面的示例中,每个对象都是一维数据点
  • In general, each object is a multi-dimension vector
    一般来说,每个对象都是一个多维向量
  • The hierarchical clustering process is still the same
    分层聚类过程仍然是一样的

# How useful it is?

  • The hierarchical clustering tree can tell the inner structure or relationships, such as parent/children, category/subcategory.
    层次聚类树可以判断内部结构或关系,例如父 / 子、类别 / 子类别。
    You need to look into the objects after constructing such a hierarchical clustering tree.
    在构建这样的层次聚类树之后,您需要查看对象。

  • Hierarchical clustering results can also be used to create partitional clusters.
    层次聚类结果也可用于创建分区聚类。
    You just need to find the appropriate number of the clusters from the top to the bottom levels
    您只需要从上到下找到适当数量的簇

    • You can still use SSE to find the best number of clusters
      您仍然可以使用 SSE 来找到最佳数量的集群
    • But again, the evaluation process is still the same
      但评估过程仍然是一样的

# From Hierarchical to Partitional Clustering 从层次聚类到分区聚类

  • The dendrogram tells u the underlying structure of the data
    树状图告诉我们数据的基本结构
  • We can utilize dendrogram to produce partitional clusters
    我们可以利用树状图产生分区簇
阅读次数

请我喝[茶]~( ̄▽ ̄)~*

Ruri Shimotsuki 微信支付

微信支付

Ruri Shimotsuki 支付宝

支付宝

Ruri Shimotsuki 贝宝

贝宝