常用聚类算法 kmeans

概念

K-means 是 非监督学习算法,经典的聚类算法,数据集没有标签。

相比较而言,KNN 算法作为有监督的分类算法,数据集上有标签,有一个很出名的 knn代码仓库

K-means 算法过程非常简单:

1、随机选择 k 个点作为初始中心;

2、在每次迭代中,对于任意一个样本,计算样本到各中心的距离,将该样本放到距离最短的那个中心所在的类。

3、更新各个簇的中心值;

4、对于所有的 k 个簇新,经过 2、3 多轮迭代后,簇心值保持不变 / 达到约定边界条件,则结束迭代。

算法的原理非常简单,但写起来却不是很容易,我特别喜欢面试的时候问候选人这个问题 😂

代码实现

这里以 python 为例,进行实现。

假定,点距离和簇心方法都已经给出,比如这个样子。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import math
input_data = [[1,1],[1,1.5],[5,5],[5,5.5]]
k = 2

# 点之间的距离
def distance(point_a,point_b):
x = abs(point_a[0]-point_b[0])
y = abs(point_a[1]-point_b[1])
return math.sqrt(x*x+y*y)


# 当前簇的新中心
def point_mean(point_list):
x = sum([point[0] for point in point_list]) / len(point_list)
y = sum([point[1] for point in point_list]) / len(point_list)
return (x,y)

对应 kmeans 代码为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# 1、随机选择两个点作为 簇心

k_cluster = {}
for i in range(k):
k_cluster[tuple(input_data[i])] = []

# 2、根据距离选择中心
for point in input_data:

max_distance = math.inf
target_kernel = None

for kernel in k_cluster:
if distance(kernel, point) < max_distance:
max_distance = distance(kernel, point)
target_kernel = kernel

k_cluster[tuple(target_kernel)].append(point)
print("now cluster",k_cluster)

# 3、开始迭代
k_cluster_old = k_cluster.copy()

while True:
# 新的一轮迭代
k_cluster = {}
for old_kernel in k_cluster_old:
new_kernel = point_mean(k_cluster_old[old_kernel])
k_cluster[new_kernel] = []

# 2、根据距离选择中心
for point in input_data:

max_distance = math.inf
target_kernel = None

for kernel in k_cluster:
if distance(kernel, point) < max_distance:
max_distance = distance(kernel, point)
target_kernel = kernel

k_cluster[tuple(target_kernel)].append(point)

if k_cluster_old == k_cluster:
print("no change")
break

k_cluster_old = k_cluster.copy()
print("now cluster",k_cluster)

得到了稳定的结果

spark 应用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.BisectingKMeans
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{ SaveMode, SparkSession }

object ImageClustering {
val logger: Logger = LoggerFactory.getLogger(getClass)

def main(args: Array[String]): Unit = {
val objectName = getClass.getSimpleName
val spark = SparkSession.builder
.enableHiveSupport()
.appName(objectName)
.getOrCreate

import spark.implicits._

val newImageDf = spark
.sql(
s"""
|select
| id,
| raw,
| embedding,
|from
| databse.table
|where
| p_date = '2021-10-23'
""".stripMargin
)
.as[DocEmbedding]
.map { doc =>
(doc.id, doc.raw, Vectors.dense(doc.embedding.map(_.toDouble)))
}
.toDF("id", "raw", "embedding")
.cache()

val bkm = new BisectingKMeans()
.setK(5000)
.setSeed(1)
.setMinDivisibleClusterSize(100)
.setFeaturesCol("embedding")
.setPredictionCol("label")

val pipeline = new Pipeline()
.setStages(Array(bkm))

val bisectingKmeansModel = pipeline.fit(newImageDf)

val predictionResult = bisectingKmeansModel
.transform(newImageDf)
.select("id", "raw", "label")
.cache()

bisectingKmeansModel.write
.overwrite()
.save(
"/some_path/save_model"
)


predictionResult
.orderBy($"label".desc)
.repartition(1)
.write
.mode(SaveMode.Overwrite)
.parquet("/some_path/save_data")


}

}


作者

mmmwhy

发布于

2021-10-24

更新于

2022-10-30

许可协议

评论