LDAForTextClassification 联合开发网

Pudn.com > 下载中心 > matlab编程 > LDAForTextClassification

LDAForTextClassification

lda EM text classification LDA classification 变分EM

所属分类：matlab编程
开发工具：matlab
文件大小：24KB
下载次数：23
上传日期：2009-04-25 10:50:08
上传者：zhangyif

说明：使用变分EM方法实现的LDA文本模型输入为文本即可分解实现，含多个小函数可单独使用
(LDA For Text Classification)

文件列表:

LDA For Text Classification\accum_beta.m (396, 2004-10-26)
LDA For Text Classification\assert.m (195, 2004-10-20)
LDA For Text Classification\AUTHOR (164, 2004-11-12)
LDA For Text Classification\converged.m (341, 2004-05-27)
LDA For Text Classification\diff_vec.m (178, 2004-11-12)
LDA For Text Classification\features.m (292, 2004-11-08)
LDA For Text Classification\fmatrix.m (959, 2004-10-26)
LDA For Text Classification\isint.m (150, 2004-10-20)
LDA For Text Classification\LDA For Text Classification.txt (81, 2009-04-25)
LDA For Text Classification\lda.m (1609, 2004-11-10)
LDA For Text Classification\ldamain.m (534, 2004-11-08)
LDA For Text Classification\lda_lik.m (380, 2004-10-26)
LDA For Text Classification\mnormalize.m (386, 2004-10-21)
LDA For Text Classification\newton_alpha.m (1054, 2004-10-12)
LDA For Text Classification\normalize.m (157, 2004-09-14)
LDA For Text Classification\rtime.m (267, 2004-10-25)
LDA For Text Classification\train (50452, 2004-11-10)
LDA For Text Classification\vbem.m (773, 2004-11-08)
LDA For Text Classification (0, 2009-04-25)

lda, a Latent Dirichlet Allocation package. Daichi Mochihashi ATR Spoken Language Communication Research Laboratories, Kyoto, Japan $Id: index.html,v 1.3 2004/12/04 12:47:35 daiti-m Exp $ Overview lda is a Latent Dirichlet Allocation (Blei et al., 2001) package written both in MATLAB and C (command line interface). This package provides only a standard variational Bayes estimation that is first proposed, but has a simple textual data format that is almost the same as SVMlight or TinySVM. This package can be used as an aid to understand LDA, or simply as a regularized alternative to PLSI, which has a severe overfitting problem due to its maximum likelihood structure. For advanced users who wish to benefit from the latest result, consider using npbayes or MPCA: though, they have non-trivial data structures. Requirements C version: * ANSI C compiler. Systems below are confirmed to compile. - Linux 2.4.20, Redhat 9, gcc 3.2.2 - Linux 2.6.5, Fedora core release 2, gcc 3.3.3 - FreeBSD 4.8-STABLE, gcc 2.95.4 (GNU make) - SunOS 5.8, gcc 2.95.3 (GNU make) MATLAB version: * A MATLAB environment. Statistical Toolbox may be needed for psi() function (but in case it is not installed, consider using Minka's Lightspeed MATLAB toolbox). * Octave is not supported. Install C version 1. Take a glance at Makefile; and type make. 2. C version is not intended to be used by those who are not familiar with C. Makefile and source files are very simple, so you can modify it as needed if it does not compile (If severe problems are found, please contact to the author.) MATLAB version simply add a directory where you have unpacked *.m into MATLAB path. For example: % cd ~/work % tar xvfz lda-0.1-matlab.tar.gz % cd lda-0.1-matlab/ % matlab >> addpath ~/work/lda-0.1-matlab Download * C version: lda-0.1.tar.gz * MATLAB version: lda-0.1-matlab.tar.gz Performance * C version runs about 8 times or more faster than MATLAB (while MATLAB codes are fully vectorized). * However, MATLAB version is closed under MATLAB environment; so it is easy to investigate and manipulate the parameters (especially graphically using plot or surf). Moreover, MATLAB codes are so simple and easy to understand. * To estimate the parameters of 50 class LDA decomposition of the standard Cranfield collection (1397 documents, 5177 unique terms), - C version took 1 minute 32 seconds, - MATLAB version took 38 minutes 55 seconds, on a Xeon 2.8GHz. It runs in low memory efficiently: in the experiment above, it uses only 6.8MB (C) and 29MB (MATLAB) of memory. Getting Started This package contains a sample data file "train" which was compiled from the first 100 documents of the Cranfield collection. Each feature id corresponds to the respective line of file "train.lex"; that is, feature 20 means a word "accuracy", feature 21 means "accurate", and so on. After compilation, you can test it using "train" data as follows. C version: % lda -N 20 train model MATLAB version: % matlab >> [alpha,beta] = ldamain('train',20); C version creates two files "model.alpha" and "model.beta"; MATLAB version creates 1x20-dimensional vector alpha and 1324x20-dimensional matrix beta. Parameters of the resulting models are explained in the sections below. Data Format Data format is common to both C and MATLAB version and almost the same as widely-used SVMlight, except that there is no label in Latent Dirichlet Allocation since LDA is an unsupervised method. A data file is a ASCII text file, where each line represents a document (NB. document is simply a synonym for "a group of data"; So you can interpret it as you like whenever it means a group of data.) Typical data file is as follows: 1:1 2:4 5:2 1:2 3:3 5:1 6:1 7:1 2:4 5:1 7:1 * Each line can be maximum 65535 bytes (about 820 lines in 80-column text) by default. For a standard document this value is sufficient, but if you wish to increase this limit, modify BUFSIZE in feature.c as you like. * Each line consists of pairs of :. Here, feature_id is an integer from 1 (this is the same as SVMlight); count can be an integer or a real number that must be positive. * : pairs are separated by (possibly multiple) white spaces. The program is coded to work even if there are any empty lines, but it is preferable that there are no such unnecessary lines. * For a complete specification, please refer to SVMlight's page. Command Line Syntax C version lda is typically invoked simply as: % lda -N 100 train model where % is a command prompt. train is a data file that has a format described above, and model is a basename of output files of model parameters. Specifically, lda uses two outputs: model.alpha and model.beta that represent \alpha and \beta in the LDA described in (Blei et al., 2001); that is, \alpha is the parameter of prior Dirichlet distribution over the latent classes, and \beta is the set of class unigrams for each latent class. -N 100 is the number of latent classes to assume in the data. For the standard model of LDA, this is the only parameter we must provide in advance. In this case, 100 latent classes are assumed. Besides, there are several rarely-used options: % lda -h lda, a Latent Dirichlet Allocation package. Copyright (c) 2004 Daichi Mochihashi, All rights reserved. usage: lda -N classes [-I emmax -D demmax -E epsilon] train model -I emmax Maximum # of iteration of the outer VB-EM algorithm, which is exited when converged. (default 100) -D demmax Maximum # of iteration of the inner VB-EM algorithm for each document, which is exited when converged. (default 20) -E epsilon A threshold to determine the whole convergence of the estimation. It is a lower threshold of the relative increase in the total data likelihood. (default 0.0001) -h displays help. MATLAB version First, you must load a data file into MATLAB data structure: % matlab >> d = fmatrix('train'); And run a function lda to estimate the parameters. The second argument is the number of latent classes that you assume. (in the example below, 20) >> help lda Latent Dirichlet Allocation, standard model. Copyright (c) 2004 Daichi Mochihashi, all rights reserved. $Id: index.html,v 1.3 2004/12/04 12:47:35 daiti-m Exp $ [alpha,beta] = lda(d,k,[emmax,demmax]) d : data of documents k : # of classes to assume emmax : # of maximum VB-EM iteration (default 100) demmax : # of maximum VB-EM iteration for a document (default 20) >> [alpha,beta] = lda(d,20); Optional two parameters emmax and demmax can be fed into lda, which has the same meaning as the C version. If you find that loading text data into MATLAB structure in advance is troublesome, there is a wrapper function ldamain that works exactly the same as the C version: >> [alpha,beta] = ldamain('train.dat',20); Output Format MATLAB version In the example above, alpha is a N-dimensional row vector of \alpha for corresponding latent topics, and beta is a [V,N]-dimensional matrix of \beta where beta(v,n) = p(v|n) (n = 1 .. N, v = 1 .. V; V is the size of the lexicon). You can save them to file using standard MATLAB function save, for example, as: >> [alpha,beta] = ldamain('train.dat',20); number of latent classes = 20 number of documents = 100 number of words = 1324 iteration 26/100.. likelihood = 339.167 ETA: 0:01:03 (1 sec/step) converged. >> save('alpha.dat', 'alpha', '-ascii'); >> save('beta.dat', 'beta', '-ascii'); C version If you invoke lda as the following, % lda -N 100 train model two files "model.alpha" and "model.beta" are created. These two files are exactly of the same format as those which are saved from MATLAB: "model.alpha" is a space-separated N-dimensional vector of \alpha, and "model.beta" is a space-separated V x N matrix of \beta. These parameters can be loaded into MATLAB using standard ways: >> beta = load('model.beta'); And you can manipulate these parameters within MATLAB. References 1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. Neural Information Processing Systems 14, 2001. [citeseer] 2. David M. Blei, Andrew Y. Ng and Michael I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, vol. 3, pp.993--1022, 2003. [citeseer] 3. Daichi Mochihashi. A note on a Variational Bayes derivation of full Bayesian Latent Dirichlet Allocation. unpublished manuscript, 2004. [PDF] Acknowledgements Digamma and trigamma codes are by Thomas Minka. (url) Thanks for Taku Kudo for his plsi tool that has been a good reference for the development of lda. Note Almost at the same time(!), Blei himself made LDA-C package written in C to the public. According to some experiments, our package runs about 4 to 10 times as fast as his (this may be due to reusing preallocated buffers in VB-EM), and also provides a MATLAB version for easy experimentation. Of course, the package by the original author is more desirable, so I wish both ones are equally useful. ------------------------------------------------------------------------------ daichi.mochihashi atr.jp Last modified: Thu Jun 9 16:05:39 2005

近期下载者：

相关文件：

评论：[我要评论] [举报此文件]

收藏者：