K-Means 알고리즘 기본 개념

k-Means: 지정한 k개 만큼의 중심점 좌표를 랜덤하게 정한다. 중심점으로 부터 가장 가까운 점들을 구분한다. 구분된 그룹에 속한 점들의 중심점을 다시 계산하고 중심점을 이동한다.
k-Means++ (default): 데이터셋에서 1개의 점을 랜덤으로 선택하여 1번 중심점으로 지정한다. 1번 중심점에서 가장 거리가 먼 점을 다음 중심점으로 지정한다. k개 중심점 선택까지 반복한다.

Python Coding

Basic Import Statements

import os
os.environ['OMP_NUM_THREADS'] = '1'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Reading Data and Preprocessing (Normalizing)

dataset = pd.read_csv('./input/KMeansData.csv')
x = dataset.iloc[:, :].values
//normalization / feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)

Finding Optimal K using Elbow Method

from sklearn.cluster import KMeans
inertia_list = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', n_init=1, random_state=5)
    kmeans.fit(x)
    inertia_list.append(kmeans.inertia_) # 중심점들과 데이터셋 간 거리의 총합 = kmeans.inertia_
plt.plot(range(1,11), inertia_list)
plt.xlabel('num of k')
plt.ylabel('inertia')

Optimal K개 군집으로 군집화 모델링 및 시각화

// Optimal K를 4로 결정
opt_k = 4
kmeans = KMeans(n_clusters=opt_k, init='k-means++', n_init=1, random_state=0)
kmeans.fit(x)
plt.scatter(x[:, 0], x[:, 1], c=kmeans.labels_, cmap=plt.cm.Set3)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = 'black')
//cluster에 번호 텍스트를 추가하기 위해 for loop 사용했는데, 다른 방법은 없을까?
for num_c in range(opt_k):
    plt.text(kmeans.cluster_centers_[num_c, 0], kmeans.cluster_centers_[num_c, 1], num_c)

//시각화 - 다른 방식 코드
for cluster in range(opt_k):
    plt.scatter(x[kmeans.labels_ == cluster, 0], x[kmeans.labels_ == cluster, 1])
    plt.scatter(kmeans.cluster_centers_[cluster, 0], kmeans.cluster_centers_[cluster, 1], color = 'black', marker='*')
    plt.text(kmeans.cluster_centers_[cluster, 0], kmeans.cluster_centers_[cluster, 1], cluster)

스케일링된 데이터를 최초 데이터 스케일로 원복한 후 시각화

//scaling된 데이터를 원복하여 원래 데이터 스케일에 맞춰 plotting할 수 있도록 전처리
x_org = sc.inverse_transform(x)
cluster_centers_org = sc.inverse_transform(kmeans.cluster_centers_)
//원복된 데이터로 시각화
plt.scatter(x_org[:, 0], x_org[:, 1], c=kmeans.labels_, cmap=plt.cm.Set3)
plt.scatter(cluster_centers_org[:, 0], cluster_centers_org[:, 1], color = 'black')
for num_c in range(0,4):
    plt.text(cluster_centers_org[num_c, 0], cluster_centers_org[num_c, 1], num_c)

[참고] Elbow Method 외에 Optimal K를 구하는 다른 방법들

1) Silhouette scores

The silhouette method measures how similar an object is to its own cluster compared to other clusters. A high silhouette score indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. The average silhouette score is calculated for each k value, and the k value with the highest average score is selected as the optimal number of clusters. Here’s an example code to use the silhouette method to find the optimal k value:

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
//load data
data = np.random.rand(1000, 2)
//list of k values to test
k_values = range(2, 20)
//calculate silhouette score for each k value
silhouette_scores = []
for k in k_values:
    kmeans = KMeans(n_clusters=k, n_init=1)
    labels = kmeans.fit_predict(data)
    score = silhouette_score(data, labels)
    print(score)
    silhouette_scores.append(score)
//find k value with highest silhouette score
best_k = k_values[np.argmax(silhouette_scores)]
print("Best k value:", best_k)

실루엣 스코어는 -1에서 1 사이의 값을 가지며, 높은 값은 클러스터 내의 데이터가 잘 모여 있고, 다른 클러스터와는 멀리 떨어져 있다는 것을 의미합니다. 이 스코어는 클러스터 간의 거리와 클러스터 내부의 밀집도를 모두 고려하여 계산됩니다. 데이터 과학자들은 이를 통해 최적의 클러스터 수를 결정하거나 클러스터링 알고리즘의 성능을 평가할 수 있습니다. 실루엣 스코어를 사용하면 더 정확하고 효율적인 데이터 분석 결과를 얻을 수 있습니다.

2) Gap-statistic

The gap statistic method compares the sum of within-cluster variances for the data to the expected within-cluster variances for a reference dataset with no obvious clustering structure. The gap statistic is calculated for each k value, and the k value with the highest gap statistic is selected as the optimal number of clusters. Here’s an example code to use the gap statistic method to find the optimal k value:

note: this needs to install the gap-statistic package. In order to install the package, the following can be used in the terminal. ‘pip install gap-stat’

from gap_statistic import OptimalK
//find optimal k value using gap statistic
optimal_k = OptimalK(parallel_backend='multiprocessing')
n_clusters = optimal_k(data, cluster_array=np.arange(1, 20))
print("Optimal k value:", n_clusters)

Gap-statistic은 임의의 데이터를 생성하고, 이를 실제 데이터와 비교하여 ‘갭’을 계산합니다. 이 갭이 최대가 되는 클러스터 수를 최적의 클러스터 수로 선택합니다. 이 방법은 클러스터링 결과의 객관적인 평가를 가능하게 하며, 다양한 유형의 데이터에 적용될 수 있습니다. Gap-statistic을 이용하면 데이터의 구조를 더 잘 이해하고 유의미한 인사이트를 도출할 수 있습니다.