说明: K-means在鸢尾花数据集上的聚类python实现, (K-means clustering python implementation on Iris flower data set,)
文件列表:
clustering.py (4194, 2023-10-01)
# clustering
K-means clustering python implementation on Iris flower data set: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_ dataset.html
## Context
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. A cluster refers to a collection of data points aggregated together because of certain similarities.
A target number k is defined, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
**Algorithm**
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
- The centroids have stabilized — there is no change in their values because the clustering has been successful.
- The defined number of iterations has been achieved.
## Program
The program that was written first reads in the 4 input features of the iris dataset and selects k random starting centroids from the dataset. Each centroid is stored in a dictionary object consisting of