# Unsupervised learning 无监督学习

Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data.

Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution.

This distinguishes unsupervised learning from supervised learning and reinforcement learning.

• Clustering
• Association Rule Mining
• Principal Component Analysis
• etc...

# How to evaluate unsupervised learning 如何评价无监督学习

• Usually, we do not have a metric for evaluations
通常没有评估的标准
• But there are two ways 但是有两种方法
1. We can manually look at the outputs, analyze and interpret it, to see whether there are significant differences and they are useful
可以手动查看输出，对其进行分析和解释，以查看是否存在显著差异以及它们是否有用
2. The outputs of unsupervised learning can be used as inputs to a supervised learning process, to see whether the supervised learning can be improved
非监督学习的输出可以用作监督学习过程的输入，以查看是否可以改进监督学习

# Clustering 聚类

「机器学习实战」摘录 - 聚类

Partitional Clustering 分区聚类
just group objects to minimize intra cluster distances and maximize inter cluster distances

Example: Document Clustering

Density Based Clustering 基于密度的聚类
cluster objects based on the local connectivity and density functions

Each cluster has a considerable higher density of points than outside of the cluster

Hierarchical Clustering 分层聚类
a clustering process in order to discover the hierarchical structure, like a hierarchical tree

Example: categories and subcategories; taxonomies

# Partitional Clustering 分区聚类

Partitional Clustering
a unsupervised way to group objects

• Goal:
Finding groups of objects in data such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
在数据中寻找对象组，使得一个组中的对象彼此相似 (或相关)，而与其他组中的对象不同 (或无关)

• Notion of a Cluster can be Ambiguous
簇的概念可能是模糊的

• Basic idea

• Measure similarity or distance between each two objects
测量每两个对象之间的相似性或距离
• Group the objects based on these similartiies
根据这些相似性对对象进行分组

• Distance or Similarity Measures 距离或相似性度量
Common Distance Measures:
常见的距离度量方法
1. Manhattan distance:

$X=\left\langle x_{1}, x_{2}, \cdots, x_{n}\right\rangle$

$Y=\left\langle y_{1}, y_{2}, \cdots, y_{n}\right\rangle$

$\operatorname{dist}(X, Y)=\left|x_{1}-y_{1}\right|+\left|x_{2}-y_{2}\right|+\cdots+\left|x_{n}-y_{n}\right|$

2. Euclidean distance:

$\operatorname{dist}(X, Y)=\sqrt{\left(x_{1}-y_{1}\right)^{2}+\cdots+\left(x_{n}-y_{n}\right)^{2}}$

3. Cosine distance:

$\operatorname{dist}(X, Y)=1-\operatorname{sim}(X, Y)$

$\operatorname{sim}(X, Y)=\frac{\sum_{i}\left(x_{i} \times y_{i}\right)}{\sqrt{\sum_{i} x_{i}^{2} \times \sum_{i} y_{i}^{2}}}$

# K-Means Clustering Algorithm K - 均值聚类算法

「机器学习实战」摘录 - K-均值聚类算法

K - 均值是发现给定数据集的$k$ 个簇的算法。

• 优点：容易实现。
• 缺点：可能收敛到局部最小值，在大规模数据集上收敛较慢。
• 适用数据类型：数值型数据。
• Assume we have many examples/instances, each example can be represented by a vector of features, where the features must be numerical ones, e.g., weight, size, price, profits, etc
假设我们有许多例子 / 实例，每个例子可以由特征向量表示，其中特征必须是数字的，例如重量、尺寸、价格、利润等

• So that, we can use the distance measures to calculate the similarity or the dissimilarity (i.e., distance) between each two examples.
因此，我们可以使用距离度量来计算每两个示例之间的相似性或不相似性 (即，距离)。

• With such setting, we are able to apply a K-Means clustering algorithms to perform the normal clustering task.
通过这样的设置，我们能够应用 K-Means 聚类算法来执行正常的聚类任务。

# Steps

1. Init: initialize K and K clusters
初始化 K 和 K 个簇
There are multiple ways to define the initial cluster:
定义初始簇有多种方式:
• You can randomly choose K instances and each one of them is an individual cluster;
可以随机选择 K 个实例，每个实例都是一个独立的簇；
• Or, you can randomly assign all or parts of your instances into K groups.
或者，可以将所有或部分实例随机分配到 K 个组中。
Each group is an individual cluster.
每个组都是一个单独的簇。
2. Step 1. Calculate centroids for K clusters
计算 K 个簇的质心
3. Step 2. Assign data points to each cluster based on the distance between data and centroids
根据数据点和质心之间的距离，将数据点分配给每个簇
4. Step 3. get new K clusters, compare them with previous clusters
获得 K 个新的簇，将它们与先前的簇进行比较
5. Step 4. Repeat 1,2,3 until convergence (i.e., no points move between clusters)
重复 1、2、3 直到收敛 (即没有点在簇之间移动)
「机器学习实战」摘录 - K-均值算法的工作流程

创建k个点作为起始质心（经常是随机选择）

对数据集中的每个数据点
对每个质心
计算质心与数据点之间的距离
将数据点分配到距其最近的簇
对每一个簇，计算簇中所有点的均值并将均值作为质心


# Stopping Criterion in Iterative learning 迭代学习中的停止准则

• We need to stop the learning iterations when it is converged
收敛时需要停止学习迭代
• How to determine it is converged?
如何确定它是收敛的？
• Criterion 1: new clusters = old clusters
准则 1: 新簇 = 旧簇
stop learning when no changes on clusters
当簇没有变化时停止学习
• Criterion 2: setup a maximal learning iterations
准则 2: 设置最大的学习迭代次数
stop learning when it got to maximal learning iterations
当它达到最大的学习迭代次数时，停止学习
• In practice, we usually use 2nd criterion, since clustering may converge after several/unexpected iterations, especially when the data set is large
在实践中，我们通常使用第二准则，因为聚类可能会在几次 / 意想不到的迭代后收敛，特别是当数据集很大的时候

# Example

T1T2T3T4T5
D103302
D241012
D304002
D403033
D501301
D622004
D710320
D831002
1. Init: initialize K and K clusters

• 拟创建 3 个组，即 set K=3，但不代表簇个数 K = 3 是最好的，还需要尝试不同的方法
• create initial clusters 创建初始簇
完全随机的，把样本分成三组（C1-C3）
Initial (arbitrary)
assignment:
C1 = {D1,D2},
C2 = {D3,D4},
C3 = {D5,D6}

2. Step 1. Calculate centroids for K clusters
计算 K 个簇的质心

算每一列的均值，即簇中所有点的中心，即质心 centroid

T1T2T3T4T5
D103302
D241012
D304002
D403033
D501301
D622004
D710320
D831002
C14/24/23/21/24/2
C20/27/20/23/25/2
C32/23/23/20/25/2
3. Step 2. Assign data points to each cluster based on the distance between data and centroids
根据数据点和质心之间的距离，将数据点分配给每个簇

Now compute the similarity (or distance) of each item with each cluster, resulting a cluster-document similarity matrix.
现在计算每个 item 与每个簇的相似性 (或距离)，得到一个聚类 - 文档相似性矩阵
Here we use dot product as the similarity measure for simplicity
为了简单起见，这里使用点积作为相似性度量。

计算质心与各个数据点之间的距离
如果使用距离，需要将数据点放入距离最小的聚类
如果使用相似性，需要将数据点放入具有最大相似性的聚类。
本例使用了相似性来度量

dot product & Cosine similarity

Recall that the Cosine similarity of two vectors is their dot product divided by the product of their norms.
回想一下，两个向量的余弦相似度是它们的点积除以它们的范数的乘积。

For example, Consider the two vectors $X$ and $Y$:

$X = <3, 0, 1, 2, 0, 3>$

$Y = <2, 0, 0, 3, 8, 4>$

The dot product is given by sum of the coordinate-wise multiples:
点积由坐标倍数的总和给出：

\begin{align} \operatorname{dot-product}(X, Y) & = 3 \times 2+0 \times 0+1 \times 0+2 \times 3+0 \times 8+3 \times 4 \\ & = 6+0+0+6+0+12 \\ & = 24 \end{align}

The norm of each vector is the square-root of the sum of the squares of its dimension values.
每个向量的范数是其尺寸值平方和的平方根。
So, the norms of X and Y are:
所以，X 和 Y 的范数是：

$\operatorname{norm}(X)=\sqrt{3^{2}+1^{2}+2^{2}+3^{2}}=4.8$

$\operatorname{norm}(Y)=\sqrt{2^{2}+3^{2}+8^{2}+4^{2}}=9.6$

and the Cosine similarity of X and Y is given by:
X 和 Y 的余弦相似性由下式给出：

$\operatorname{Sim}(X, Y)=\frac{\operatorname{dot-product}(X, Y)}{\operatorname{norm}(X) \times \operatorname{norm}(Y)}=\frac{24}{4.8 \times 9.6}=0.52$

D1D2D3D4D5D6D7D8
C129/229/224/227/217/232/215/224/2
C231/220/238/245/212/234/26/217/2
C328/221/222/224/217/230/211/219/2

For each document, reallocate the document to the cluster to which it has the highest similarity (shown in red in the above table).
对于每个文档，将该文档重新分配到与其具有最高相似性的簇中 (在上表中以红色显示)。
After the reallocation we have the following new clusters.
重新分配后，我们有以下新的簇。

New assignment:
C1 = {D2,D7,D8},
C2 = {D1,D3,D4,D6},
C3 = {D5}


Note that the previously unassigned D7 and D8 have been assigned, and that D1 and D6 have been reallocated from their original assignment.
请注意，之前未分配的 D7 和 D8 已被分配，D1 和 D6 已从其原始分配中重新分配。

This is the end of first iteration (i.e., the first reallocation).
这是第一次迭代 (即第一次重新分配) 的结束。

Next, we repeat the process for another reallocation…
接下来，我们重复该过程进行另一次重新分配…

4. Step 3. get new K clusters, compare them with previous clusters
获得 K 个新的簇，将它们与先前的簇进行比较

Now compute new cluster centroids using the original document-term matrix
现在使用原始的文档 - 项矩阵计算新的簇质心

T1T2T3T4T5
D103302
D241012
D304002
D403033
D501301
D622004
D710320
D831002
C18/32/33/33/34/3
C22/412/43/43/411/4
C30/11/13/10/11/1

This will lead to a new cluster-doc similarity matrix similar to previous slide.
这将产生一个新的簇 - 文档相似性矩阵，与前类似。

D1D2D3D4D5D6D7D8
C17.6715.015.349.005.0012.007.6711.34
C216.7511.2517.5019.508.006.684.2510.00
C314.003.006.006.0011.009.349.003.00

Again, the items are reallocated to clusters with highest similarity.
同样，文档被重新分配到具有最高相似性的簇中。

New assignment:
C1 = {D2,D6,D8},
C2 = {D1,D3,D4},
C3 = {D5,D7}


Note: This process is now repeated with new clusters.

However, the next iteration in this example will show no change to the clusters, thus terminating the algorithm.

# Evaluations 评价

• There are no clear evaluations:
没有明确的评价:
clustering is good as long asit can serve for your usage or applications
只要能为您的使用或应用服务，聚类就是好的

• Most common measure is Sum of Squared Error SSE
最常见的衡量标准是误差平方和 SSE

• For each point, the error is the distance to the nearest cluster
对于每个点，误差是到最近簇的距离

• To get SSE , we square these errors and sum them.
为了得到 SSE ，我们将这些误差平方并求和。

$SSE=\sum_{i=1}^{K} \sum_{x \in C_{i}} dist^{2}\left(m_{i}, x\right)$

It is not a metric to evaluate clustering results
它不是评估聚类结果的指标

• $x$ is a data point in cluster $C_{i}$ and $m_{i}$ is the representative point for cluster $C_{i}$
$x$ 是簇$C_{i}$ 中的数据点，而$m_{i}$ 是簇$C_{i}$ 的代表点
can show that micorresponds to the center (mean) of the cluster
可以显示对应于簇中心 (平均值) 的微反应

• Drawback: if K is increased, SSE can be decreased
缺点：如果 K 增加， SSE 可以减少

• It is used to measure how well the clustering process isIt cannot tell how well the clustering results are
它用于衡量聚类过程在多大程度上无法判断聚类结果的好坏

• SSE can also be used to find the best K value
还可用于查找最佳 K 值
Try K = 3, 5, 7, 10, 13, 20, etc…
Observe the K value which can lower SSE

「机器学习实战」摘录 - 度量聚类效果 SSE

K - 均值算法收敛但聚类效果较差的原因是，K - 均值算法收敛到了局部最小值，而非全局最小值（局部最小值指结果还可以但并非最好结果，全局最小值是可能的最好结果）。

SSE 值越小表示数据点越接近于它们的质心，聚类效果也越好。

# How to evaluate the clustering results?

Solution 1
compare clusters by using centroid and tell the significant differences among different clusters, to better understand why they were put together

CentroidGenderGPAStudy HoursCourse Completed
C112.52010
C20.64.0403
C303.02511
Solution 2
add the clustering results into a supervised learning process to learn whether they are able to improve supervised learning

StudentGenderGPAStudy HoursCourse CompletedTA?
S112.52010N
S204.0403Y
S303.02511Y
StudentGenderGPAStudy HoursCourse CompletedTA?Cluster
S112.52010Nc1
S204.0403Yc2
S303.02511Yc2

# Pros and Cons

Strength of the K-means K 均值的强处
• Relatively efficient: $O(tkn)$, where $n$ is # of objects, $k$ is # of clusters, and $t$ is # of iterations.
Normally, $k, t \ll n$
相对高效
• Often terminates at a local optimum
经常终止于局部最优
Weakness of the K-means K - 均值的弱处
• What about categorical data? 分类数据呢
• Performance is sensitive to initializations, e.g., K, initial clusters, and the definition of centriods
性能对初始化的值很敏感，例如 K、初始簇和质心的定义
• Need to specify K, the number of clusters, in advance
需要提前指定 K，也就是簇的数量
• Unable to handle noisy data and outliers
无法处理嘈杂的数据和异常值
Variations of K-Means usually differ in: K - 均值的变化通常由于以下的不同
• Selection of the initial K Means
初始 K 均值的选择
• Dissimilarity calculations
相异度计算
• Strategies to calculate cluster means
计算簇平均值的策略

「机器学习实战」摘录 - 对聚类结果进行改进

# Pre-processing 预处理

• Normalize the data 标准化数据
• Eliminate outliers 消除异常值

# Post-processing 后处理

• Eliminate small clusters that may represent outliers
消除可能代表异常值的小簇
• Split 'loose' clusters, i.e., clusters with relatively high SSE
分割 “松散” 簇，即具有相对较高 SSE 的簇
• Merge clusters that are ‘close’ and that have relatively low SSE
合并 “接近” 且 SSE 相对较低的簇

# Variations of K-Means Clustering K - 均值聚类的变异

• K-Means Clustering: centroid is defined as means
k - 均值聚类：质心被定义为均值
• K-Median Clustering: centroid is defined as medians
k - 中位数聚类：质心被定义为中位数
• K-Medoids Clustering: medoids as centroid
K - 中心点聚类：以中心点为质心
• X-Means Clustering: figure out a way to find best K
X - 均值聚类：找出找到最佳 K 的方法
• Fuzzy C-Means Clustering: fuzzy degree as confidence
模糊 C 均值聚类：模糊度作为置信度
• Many more…

# K-Medoids Clustering

K-Medoids Clustering
is built as one of partitonal clustering approaches

Medoids as centroids

A medoid
is defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.
medoid 被定义为一个簇的对象，其与该簇中所有对象的平均相异度是最小的
In other words, a medoid is the most centrally located points in the cluster.

PAM (Kaufman and Rousseeuw , 1987), built in Splus

• Use real object to represent the cluster

1. Select $k$ representative objects arbitrarily
任意选择$k$ 个代表对象
2. For each pair of non selected object $h$ and selected object $i$, calculate the total swapping cost $TC_{ih}$
对于每对非选定对象$h$ 和选定对象$i$，计算总交换成本TC_
3. For each pair of $i$ and $h$ ,
• If $TC_{ih} < 0$, $i$ is replaced by $h$
• Then assign each non-selected object to the most similar representative object
4. repeat steps 2-3 until there is no change
• Pros and Cons 利弊

• The centroid is defined as the medoid which is the most centrally located object in one cluster
质心被定义为中心点，也就是簇中位于最中心的对象
• To some extent, it helps alleviate the situation of outliers
在某种程度上，它有助于缓解离群值的情况
• But this approach is not scalable time consuming for large scale of the data set
但是这种方法对于大规模的数据集来说是不可扩展的、耗时的
• Still sensitive to K, initialization, etc
仍然对 K 值、初始化等敏感

# Density-Based Clustering

• Major features: 主要特点

• Discover clusters of arbitrary shape
发现任意形状的簇
• Handle noise
处理噪声
• Several interesting studies:

• DBSCAN: Ester, et al. (KDD’96)
• GDBSCAN: Sander, et al. (KDD’98)
• OPTICS: Ankerst, et al (SIGMOD’99).
• DENCLUE: Hinneburg& D. Keim(KDD’98)
• CLIQUE: Agrawal, et al. (SIGMOD’98)

# Concepts

Two global parameters

• $Eps$: Maximum radius or distance of the neighborhood
邻域的最大半径或距离
• $MinPts$: Minimum number of points in the neighborhood of that point
这个点的邻域内的最小点数
Core Object 核心对象
its neighborhood has at least MinPts objects

Border Object 边缘对象
object that on the border of a cluster

Eps-Neighborhood Eps 邻域
$N_{Eps}(p)$: { $q$ belongs to $D|dist(p,q) \le Eps$ }

Eps 是邻域距离的最大值，为半径

Directly density-reachable 直接密度可达
A point $q$ is directly density-reachable from a point $p$ wrt.$Eps$, $MinPts$ if
$q$ 从点$p$ 的 Eps 领域中直接密度可达，需满足以下两个条件
1. $q$ belongs to $N_{Eps}(p)$
$q$$p$ 的 Eps 邻域中
2. $p$ is core object
$p$ 是核心对象
Density-reachable 密度可达
A point $p$ is density-reachable from a point $q$ wrt.$Eps$, $MinPts$ if there is a chain of points $p_{1}, \dots , p_{n}$, $p_{1} = q$, $p_{n} = p$ such that $p_{i+1}$ is directly density-reachable from $p_{i}$

Density-connected 密度相关
A point $p$ is density-connected to a point $q$ wrt.$Eps$, $MinPts$ if there is a point $o$ such that both, $p$ and $q$ are density-reachable from $o$ wrt.$Eps$ and $MinPts$

# Example (Eps, MinPts as parameters)

• There are 5 points: $o$, $m$, $n$, $p$, $q$
• Assume $o$ is the core object
• Directly density-reachable : $m$ and $o$, $n$ and $o$,
$m$ and $n$ are in $N_{Eps}(o)$, and $o$ is core object
• Density-reachable: $p$ and $o$, $q$ and $o$
$o$ -> $m$ -> $p$, ($o$, $m$) and ($m$, $p$) are directly density-reachable
$o$ -> $n$ -> $q$, ($o$, $n$) and ($n$, $q$) are directly density-reachable
• Density-connected: $p$ and $q$
There is a route, ($p$, $o$), ($o$, $q$) are density-reachable

# DBSCAN

• DBSCAN is a popular density based clustering method
DBSCAN 是一种流行的基于密度的聚类方法

• It relies on a density based notion of cluster:
它依赖于以密度为基础的簇概念:
A cluster is defined as a maximal set of density connected points
簇被定义为最大的密度连接点集合

• It can discover clusters of arbitrary shape
它可以发现任意形状的簇

# Steps

• Randomly select a point $p$
随机选择一个点$p$
• Retrieve all points density-reachable from $p$ wrt $Eps$ and $MinPts$
$p$wrt$Eps$$MinPts$ 中检索所有密度可达的点
• If $p$ is a core point, a cluster is formed.
如果$p$ 是一个核心点，那么就形成了一个簇。
• If $p$ is a border point, no points are density-reachable from $p$ and DBSCAN visits the next point of the database.
如果$p$ 是一个边界点，那么从$p$ 无法到达任何点，DBSCAN 将访问数据库的下一个点。
• Continue the process until all of the points have been processed.
继续此过程，直到处理完所有点。

# DBSCAN vs CLARANS

CLARANS is an efficient medoid based clustering algorithm
CLARANS 是一种高效的基于 medoid 的聚类算法

The two parameters, $Eps$ and $MinPts$ , need to be carefully tuned up.
$Eps$$MinPts$ 这两个参数需要仔细调整。
Otherwise, results may be significantly different

# K-Means vs DBSCAN

K-MeansDBSCAN
• Partitional Clustering
分区聚类
• Density Based Clustering
基于密度聚类
• Need to pre-define the value of K
需要预先定义 K 的值
• Do not need to pre define the number of clusters
不需要预先定义簇的数量
• Need to pre-define $Eps$ and $MinPTs$
需要预先定义$Eps$$MinPTs$
• Sensitive to initial settings
对初始设置敏感
• Sensitive to initial settings
对初始设置敏感
• Sensitive to noise data
对噪音数据敏感
• Not sensitive to noise data
对噪音数据不敏感

# Hierarchical Algorithms 分层算法

• Use distance matrix as clustering criteria
使用距离矩阵作为聚类标准
• does not require the No. of clusters as input, but needs a termination condition
不需要簇个数作为输入，但需要一个终止条件

• In K-Means, we need to use similarity or distance metrics to measure the distance between two objects
在 K-Means 中，我们需要使用相似性或距离度量来度量两个对象之间的距离
• In hierarchical clustering, we need to measure the distance between two clusters
在层次聚类中，我们需要测量两个聚类之间的距离
• It is more complicated, since there are multiple objects within a cluster
它更复杂，因为一个簇中有多个对象

# Agglomerative Method 凝聚法

• Each individual object is considered as a single cluster at the beginning
每个单独的对象在开始时都被视为一个簇
• Choose a way to represent the cluster, such as means centroid
选择一种表示簇的方式，例如 “平均质心”
• Iterate all clusters, find the two clusters with smallest distance, and merge them to a new cluster
迭代所有簇，找到距离最小的两个簇，并将它们合并为一个新簇
• Repeat the step above until all objects are grouped to a single cluster
重复上述步骤，直到所有对象都分组到一个簇中

# Distance Between Clusters 簇间距离

Distance = distance between two closest objects from two cluster
= 距离两个簇最近的两个对象之间的距离
Distance = distance between two farthest objects from two clusters
= 距离两个簇最远的两个对象之间的距离
Distance = how much the sum of squares (i.e., within cluster distance) will increase when we merge them
= 当我们合并它们时，平方和（即簇内距离）将增加多少
UPGMA
Distance = average distance of the distance of every two objects in the two clusters
= 两个簇中每两个对象距离的平均距离
Centroid Method 质心法
Distance = distance between the centroids of the two clusters
= 两个簇的质心之间的距离

# Example

• HAC starts with unclustered data and performs successive pairwise joins among items (or previous clusters) to form larger ones
HAC 从未聚集的数据开始，在项（或之前的簇）之间执行连续的成对连接，以形成更大的项
• this results in a hierarchy of clusters which can be viewed as a dendrogram
这就形成了一个簇的层次结构，可以看作是一个树状图
• useful in pruning search in a clustered item set, or in browsing clustering results
用于修剪聚集项集中的搜索，或浏览聚集结果

• Given a list of numbers: 9, 13, 7, 3, 4
给出一个数字列表

• Build hierarchical clustering tree structure from bottom to the up
建立自下而上的层次聚类树结构
• Use the mean as the representative of each cluster, Use centroid method to merge clusters
用平均值作为每个簇的代表，用质心法合并簇
• Use Manhattan distance as metric
使用曼哈顿距离作为度量方法
• At the beginning, each number is an individual cluster
一开始，每个数字都是一个单独的簇

[3][4][7][9][13]

• We calculate the distance between every two centroids
计算每两个质心之间的距离

[3][4][7][9][13]
[3]0
[4]10
[7]430
[9]6520
[13]109640
• And merge the two clusters with smallest distance
以最小距离合并两个簇

[3, 4] = 3.5[7][9][13]

• Right now, we only have 4 clusters, re-calculate centroids
现在，我们只有 4 个簇，重新计算质心

• Next: calculate the distance between remaining clusters
下一步：计算剩余簇之间的距离

[3, 4] = 3.5[7][9][13]
[3, 4] = 3.50
[7]3.50
[9]5.520
[13]9.5640
• Merge the two clusters with smallest distance
以最小距离合并两个簇

[3, 4] = 3.5[7, 9] = 8[13]

• Right now, we only have 3 clusters, re-calculate centroids
现在，我们只有 3 个簇，重新计算质心

• Next: calculate the distance between remaining clusters
下一步：计算剩余簇之间的距离

[3, 4] = 3.5[7, 9] = 8[13]
[3, 4] = 3.50
[7, 9] = 84.50
[13]9.550
• Merge the two clusters with smallest distance
以最小距离合并两个簇

• Right now, we only have 2 clusters, re-calculate centroids
现在，我们只有 2 个簇，重新计算质心

• Next: calculate the distance between remaining clusters
下一步：计算剩余簇之间的距离

• Repeat until all the members are put into a single cluster
重复此操作，直到所有成员都放入一个簇

# Extensions from the Example

• Each object is a one-dimension data point in our previous example
在前面的示例中，每个对象都是一维数据点
• In general, each object is a multi-dimension vector
一般来说，每个对象都是一个多维向量
• The hierarchical clustering process is still the same
分层聚类过程仍然是一样的

# How useful it is?

• The hierarchical clustering tree can tell the inner structure or relationships, such as parent/children, category/subcategory.
层次聚类树可以判断内部结构或关系，例如父 / 子、类别 / 子类别。
You need to look into the objects after constructing such a hierarchical clustering tree.
在构建这样的层次聚类树之后，您需要查看对象。

• Hierarchical clustering results can also be used to create partitional clusters.
层次聚类结果也可用于创建分区聚类。
You just need to find the appropriate number of the clusters from the top to the bottom levels
您只需要从上到下找到适当数量的簇

• You can still use SSE to find the best number of clusters
您仍然可以使用 SSE 来找到最佳数量的集群
• But again, the evaluation process is still the same
但评估过程仍然是一样的

# From Hierarchical to Partitional Clustering 从层次聚类到分区聚类

• The dendrogram tells u the underlying structure of the data
树状图告诉我们数据的基本结构
• We can utilize dendrogram to produce partitional clusters
我们可以利用树状图产生分区簇