rph_kmeans
所属分类:聚类算法
开发工具:Python
文件大小:2469KB
下载次数:0
上传日期:2023-03-08 03:21:04
上 传 者:
sh-1993
说明: 由随机投影(RP)产生的具有初始中心的KMeans
(KMeans with initial centers produced by Random Projection (RP))
文件列表:
LICENSE (11357, 2023-03-08)
MANIFEST.in (72, 2023-03-08)
Makefile (401, 2023-03-08)
examples (0, 2023-03-08)
examples\k_selection (0, 2023-03-08)
examples\k_selection\K-BIC.png (26020, 2023-03-08)
examples\performance_test (0, 2023-03-08)
examples\performance_test\log.txt (37796, 2023-03-08)
examples\pipeline_draw (0, 2023-03-08)
examples\pipeline_draw\kmeans(kmeans++)_cluster_centers.png (335339, 2023-03-08)
examples\pipeline_draw\kmeans(kmeans++)_y_pred.png (332485, 2023-03-08)
examples\pipeline_draw\kmeans(random)_cluster_centers.png (335775, 2023-03-08)
examples\pipeline_draw\kmeans(random)_y_pred.png (331997, 2023-03-08)
examples\pipeline_draw\rph_kmeans_cluster_centers.png (336076, 2023-03-08)
examples\pipeline_draw\rph_kmeans_reduced_points.png (88442, 2023-03-08)
examples\pipeline_draw\rph_kmeans_y_pred.png (330419, 2023-03-08)
examples\pipeline_draw\y_true.png (326827, 2023-03-08)
examples\simulate.py (9138, 2023-03-08)
requirements.txt (89, 2023-03-08)
rph_kmeans (0, 2023-03-08)
rph_kmeans\__init__.py (207, 2023-03-08)
rph_kmeans\_point_reducer_cy.by_cython.cpp (1248556, 2023-03-08)
rph_kmeans\_point_reducer_cy.pyx (3748, 2023-03-08)
rph_kmeans\_point_reducer_cy_lib.cpp (876, 2023-03-08)
rph_kmeans\_point_reducer_cy_lib.h (4591, 2023-03-08)
rph_kmeans\k_selection.py (6804, 2023-03-08)
rph_kmeans\point_reducer_base.py (2373, 2023-03-08)
rph_kmeans\point_reducer_cy.py (6104, 2023-03-08)
rph_kmeans\point_reducer_py.py (7096, 2023-03-08)
rph_kmeans\rph_kmeans_.py (7599, 2023-03-08)
rph_kmeans\utils.py (798, 2023-03-08)
setup.py (1760, 2023-03-08)
# RPH-KMeans
**RPH-KMeans** is a variant of kmeans algorithm in which the initial centers are produced by point reduction process using one of the **local sensitive hashing (LSH)** techniques called **random projection (RP)**.
# Installation
To install RPH-Kmeans, simply run the setup script:
```
python3 setup.py install
```
Or use `pip`:
```
pip3 install rph_kmeans
```
Or install from git repo directly:
```
pip3 install git+https://github.com/tinglabs/rph_kmeans.git
```
To run basic clustering:
```python
# Note:
# type(X) = np.ndarray or scipy.sparse.csr_matrix;
# X.shape = (n_samples, n_features)
from rph_kmeans import RPHKMeans
clt = RPHKMeans()
labels = clt.fit_predict(X)
```
To estimate the number of clusters:
```python
# Note:
# type(X) = np.ndarray or scipy.sparse.csr_matrix;
# X.shape = (n_samples, n_features)
# type(kmax) = int;
# Let k_ be the possible cluster number, it's recommended to set kmax = k_ * 3
from rph_kmeans import select_k_with_bic
optimal_k, _, _ = select_k_with_bic(X, kmax=kmax)
```
## Note
For **Mac OSX**, it maybe help to run the command first before running `setup.py`:
```
export MACOSX_DEPLOYMENT_TARGET=10.9
```
which can fix
- `ld: library not found for -lstdc++`
- `C++ STL headers not found`
If the installation fails when compiling c++ extension, you can just add the path of `rph_kmeans` to `PYTHONPATH` and use **python** version (`point_reducer_version = "py"`) instead of **cython** version (`point_reducer_version = "cy"`).
# Demo
The experiments show that **RPH-KMeans** can deal with imblanced data much better than **KMeans (k-means++ initialization)** and **KMeans (random initialization)** .
## Simulation
Run the script `examples/simulate.py`:
```
python3 simulate.py
```
### Simlated Data
2-d simulated data is generated:
- clusters number: 5
- gaussian distribution
- label `0`: `mean=(0, 0); cov=1.0`
- label `1`: `mean=(5, 5); cov=1.0`
- label `2`: `mean=(-5, -5); cov=1.0`
- label `3`: `mean=(5, -5); cov=1.0`
- label `4`: `mean=(-5, 5); cov=1.0`
- samples number:
- label `0`: 5000
- label `1`: 100
- label `2`: 100
- label `3`: 100
- label `4`: 100
It looks like:
![y_true](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/y_true.png)
### Performance of RPH-KMeans
Run rph-kmeans with default config, we get
- **ARI**: 0.99
- **NMI**: 0.***
The final cluster centers:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/rph_kmeans_cluster_centers.png)
The predicted label:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/rph_kmeans_y_pred.png)
The reduced points generated by random projection and the initial centers:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/rph_kmeans_reduced_points.png)
### Performance of KMeans (kmeans++)
Run Kmeans (init='kmeans++'; n_init=10), we get
- **ARI**: 0.13
- **NMI**: 0.37
The final cluster centers:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/kmeans(kmeans++)_cluster_centers.png)
The predicted label:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/kmeans(kmeans++)_y_pred.png)
Use KMeans (kmeans++; n_init=10) to cluster:
### Performance of KMeans (random)
Run Kmeans (init='random'; n_init=10), we get
- **ARI**: 0.23
- **NMI**: 0.51
The final cluster centers:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/kmeans(random)_cluster_centers.png)
The predicted label:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/pipeline_draw/kmeans(random)_y_pred.png)
### Summary
A more detailed result is as follow (Metric mean and standard deviation of 10 repeat). The running log are in `examples/performance_test/log.txt`.
| Method | Real Time (s) | CPU Time (s) | ARI | NMI |
| ---- | ---- | ---- | ---- | ---- |
| RPH-KMeans (n_init=1) | 0.208 (0.105) | 0.206 (0.105) | 0.997 (0.000) | 0.992 (0.000) |
| KMeans (kmeans++; n_init=1) | 0.074 (0.042) | 0.074 (0.042) | 0.175 (0.062) | 0.441 (0.079) |
| KMeans (kmeans++; n_init=5) | 0.325 (0.168) | 0.325 (0.168) | 0.316 (0.227) | 0.570 (0.141) |
| KMeans (kmeans++; n_init=10) | 0.636 (0.318) | 0.636 (0.318) | 0.693 (0.373) | 0.805 (0.230) |
| KMeans (random; n_init=1) | 0.089 (0.052) | 0.089 (0.052) | 0.202 (0.039) | 0.470 (0.067) |
| KMeans (random; n_init=5) | 0.390 (0.207) | 0.390 (0.207) | 0.200 (0.055) | 0.463 (0.073) |
| KMeans (random; n_init=10) | 0.800 (0.399) | 0.800 (0.399) | 0.215 (0.047) | 0.491 (0.063) |
### Estimation of cluster number
The cluster number can be correctly estimated by finding the knee point of **BIC** curve:
![RPH-KMeans](https://github.com/tinglabs/rph_kmeans/blob/master/examples/k_selection/K-BIC.png)
近期下载者:
相关文件:
收藏者: