gmeans

所属分类:人工智能/神经网络/深度学习
开发工具:C++
文件大小:70KB
下载次数:101
上传日期:2005-09-15 18:54:29
上 传 者simplesky
说明:  gmeans-- Clustering with first variation and splitting 文本聚类算法Gmeans ,使用了3种相似度函数,cosine,euclidean ,KL.文本数据使用的是稀疏矩阵形式.
(gmeans clustering with first variation and splitting Gmeans,a text clustering algorithm, uses 3 functions,cosine,euclidean and KL in similarity measuring.Text data are described by sparse matrix.)

文件列表:
gmeans (0, 2005-09-01)
gmeans\Constants.h (1605, 2003-04-06)
gmeans\DenseMatrix.cc (14955, 2003-04-06)
gmeans\DenseMatrix.h (2326, 2003-04-06)
gmeans\DenseMatrixDouble.cc (5324, 2003-04-06)
gmeans\DenseMatrixDouble.h (1540, 2003-04-06)
gmeans\Diametric_k_means.cc (27219, 2004-08-19)
gmeans\Diametric_k_means.h (1572, 2004-08-19)
gmeans\Euclidean_k_means.cc (29309, 2004-08-19)
gmeans\Euclidean_k_means.h (1355, 2004-08-19)
gmeans\Gmeans.cc (12835, 2003-11-12)
gmeans\Gmeans.h (5335, 2003-04-06)
gmeans\IB.cc (7794, 2004-08-19)
gmeans\IB.h (1386, 2004-08-19)
gmeans\Kullback_leibler_k_means.cc (31405, 2004-08-19)
gmeans\Kullback_leibler_k_means.h (1630, 2004-08-19)
gmeans\Makefile (1088, 2004-08-19)
gmeans\Matrix.cc (8845, 2003-04-06)
gmeans\Matrix.h (3655, 2003-04-06)
gmeans\mat_vec.cc (7441, 2003-04-06)
gmeans\mat_vec.h (1942, 2003-04-06)
gmeans\powermethod.cc (7585, 2003-04-06)
gmeans\RandomGenerator.cc (9022, 2003-04-06)
gmeans\RandomGenerator.h (6127, 2003-04-06)
gmeans\SparseMatrix.cc (18076, 2003-11-12)
gmeans\SparseMatrix.h (2443, 2003-04-06)
gmeans\SparseMatrixDouble.cc (6065, 2003-04-06)
gmeans\SparseMatrixDouble.h (1776, 2003-04-06)
gmeans\Spherical_k_means.cc (26937, 2004-08-19)
gmeans\Spherical_k_means.h (1484, 2004-08-19)
gmeans\timerUtil.h (3519, 2003-04-06)
gmeans\tools.cc (9629, 2003-11-25)
gmeans\tools.h (1375, 2003-04-06)

/* gmeans-- Clustering with first variation and splitting Copyright(c) 2002 Yuqiang Guan (version 1.0) */ To compile: make To remove .o and executable files: make clean To run: gmeans [switches] word-doc-file -a algorithm s: spherical k-means algorithm (default) e: euclidean k-means algorithm b: information bottleneck algorithm k: kullback_leibler k-means algorithm d: diametric k-means algorithm -c number-of-clusters -D delta, this float number is used as threshold for first-variations (default is 10E-6) -d dump the clustering process (default is false) -E number-of-kmeans-iterations, this integer sets when to use similarity estimation (default is 5) -e epsilon, this float number is used as threshold for k-means (default is 10E-3) -F matrix format s: sparse matrix in CCS (default) d: dense matrix i.e. each column represents a data point t: dense matrix transpose i.e. each row represents a data point -f number-of-first variations (default is 0) -i [S|s|p|r|f|c] initialization method: p -- random perturbation r -- random cluster ID f -- read from cluster ID seeding file S -- choose the 1st centroid the farthest point from the center of the whole data set, well separate all centroids (defualt) s -- same as 'S' but get the 1st centroid randomly c -- randomly pick vectors as centroids -L use Laplacian for Kullback_leibler distance measure (default is false) c alpha -- use prior n -- no prior -O the name of output matrix (default is the prefix of the matrix file name) -P set prior values\n"); u -- uniform priors (default) f filename -- read priors from a file -p perturb magnitude -S skip spherical K-Means, i.e. no k-means just first-variations (default is false) -s perturbation with a positive integer seed -T true label file -t scaling scheme (default is tfn) -V verify obj. func. (default is false) -v version-number -U cluster_file (to valuate a given clustering, default is false) -u upper bound of cluster number -W if you want word clusters too -w Omiga value is between 0 and -1 If you use sparse matrix, then 'word-doc-file' the prefix of 5 matrix file names (_dim, _col_ccs, _row_ccs, _tfn_nz and _docs); Otherwise if you use dense matrix (#word-by-#docs) then you should use '-F d' option or if you use dense matrix (#docs-by-#word), e.x. gene expression data, you should use '-F t' option; In either of these 2 cases the matrix file is #row #column ............. ............. . ............. first line gives the size of this rectangular matrix, then the matrix itself. --------------------------------------------------------- The format of initial seeding file is 3893 1 0 0 . . . 2 where '3893' is the number of documents, '1' is for format purpose. Then the numbers in the column below are initial cluster ID for each document. If option 'c' is 3, for example, the cluster IDs should range from 0 to 2. --------------------------------------------------------- The format of true label file is the same as initial seeding file. --------------------------------------------------------- The role of 'w' is to penalize on the number of clusters. Its value is 0 or negative. You may play with different 'w', say on Classic3 , setting a big upper bound on cluster number, to see when 'spmeans' stops what cluster number you get eventually. What 'w' gives you 3 clusters? Or if you get 5 clusters do they look good? So far we don't know what can be a good 'w' for a given data set yet. But maybe different cluster numbers are all OK if the clustering itself is good. --------------------------------------------------------- Example of execution: gmeans-linux -c 3 -T classic3_truelabel classic3 gmeans-linux -c 3 -a e -T classic3_truelabel -O classic3_output -f 20 classic3 gmeans-linux -c 3 -a b -T classic3_truelabel -i r classic3 gmeans-linux -c 3 -a k -L c -P 10000 -T classic3_truelabel -O classic3_output classic3

近期下载者

相关文件


收藏者