BoosTexter2_1_windows

所属分类:Windows编程
开发工具:Windows_Unix
文件大小:223KB
下载次数:8
上传日期:2010-06-26 04:28:20
上 传 者bagulho51
说明:  BoosTexter2_1 Boosting text mining multi-label

文件列表:
boostexter.exe (90624, 2001-04-16)
sample.data (483, 2001-04-16)
sample.names (222, 2001-04-16)
sample.test (144, 2001-04-16)
SchapireSi98b.ps (736973, 2010-04-04)

============================================================================ ====== ====== ====== BoosTexter ====== ====== ====== ====== Created by Erin Allwein (eallwein@swri.edu) ====== ====== Rob Schapire (schapire@research.att.com) ====== ====== Yoram Singer (singer@cs.huji.ac.il) ====== ====== ====== ====== Version 2.1 (4/12/2001) ====== ====== ====== ====== Copyright 2001 AT&T. All rights reserved. ====== ====== ====== ============================================================================ ============ INSTALLATION ============ Download the binary archive: BoosTexter2_1..tar.gz where is one of the supported binary releases. Then unpack the archive. On unix systems, you can use gunzip and tar, e.g., gunzip -c BoosTexter2_1..tar.gz | tar xf - This will create a root directory BoosTexter2_1 containing the distribution. Note to Windows users: Although a Windows distribution is available, the instructions given below are intended for unix systems. To run on a Windows machine using these instructions, BoosTexter should be run in a unix-like shell (for instance, using Cygwin (cygwin.com) or U/Win (www.research.att.com/sw/tools/uwin)). ============== INCLUDED FILES ============== The archive includes the following files: boostexter - binary executable for the selected architecture README - this README file sample.names, sample.data, sample.test - sample data and specification files (see detailed description below) ================ USING BOOSTEXTER ================ General: ======== Here are instructions how to use the program "BoosTexter" to build a classifier for text and/or attribute-value classification and how to classify new instances. Further background on BoosTexter can be found in the paper: "BoosTexter: A boosting-based system for text categorization" by Robert E. Schapire and Yoram Singer, Machine Learning, 39(2/3):135-168, 2000. BoosTexter works with data which may be of various forms. In general, each instance is broken into multiple fields. These fields may be of four types: a continuous-valued attribute (such as "age"), a discrete-valued attribute (such as "eye color"), a text string (such as "body of an email message"), or a "scored text" string (in which scores are associated with each word of the text, such as "tf-idf" scores used in information retrieval). BoosTexter works by combining many simple rules. These rules are constructed in a sequence of rounds. Each rule consists of a simple binary test, and predictions for each outcome of this test. This test has one of the following forms depending on the type of input field associated with the test: * For discrete attributes, the test asks if the attribute has a particular value. * For continuous attributes, the test asks if the attribute value is above or below some threshold. * For textual input fields, the test asks whether or not a particular sequence of words is present in the given text. The form of this sequence may be a simple sequence or "ngram" if in ngram or fgram mode, or a "sparse ngram" if in sgram mode. (An example of a sparse ngram is the pattern "the * boy" which matches any three word sequence beginning with the word "the" and ending with the word "boy".) The style of ngram that is used is determined using the "-N" flag, and the length is determined by the "-W" flag. * For scored text, the test asks whether the score associated with a particular word is above or below some threshold. The "predictions" associated with each outcome of a given rule are described by a set of weights over the possible labels. See the detailed example below. Some of the possible run-time parameters are given below. Contact us for details on some of the other, more advanced options. Options are: -h : list all options -f : frequency of displaying hypotheses (0 = never; def. = 1) -l : use "discrete AdaBoost.MR" (rather than "real AdaBoost.MH") -z : use "discrete AdaBoost.MH" (rather than "real AdaBoost.MH") (see the BoosTexter paper for details) -n : specify number of rounds (usually in the range 300-1000, unless running with -l or -z turned on) -o : output predictions in long form on train/test data -p : load the model stored currently in .shyp - this mode is used to continue training beginning with a model that was built on a previous run of BoosTexter (unimplemented with -l and -z options) -r : specify random seed -W : window length for creating word-grams -N : type of word-grams (maximal length specified by -W parm): sgram - all sparse word-grams up to a maximal length ngram - all full word-grams up to a maximal length fgram - full word-grams of maximal length only -S : input/output/strong-hypothesis file stem -C : turn on classification mode - this mode is used when applying a classifier that has already been built to new data -V : Verbose mode - prints more detailed information about each weak hypothesis during training Example runs are given below. Input/Output files and their formats: ===================================== BoosTexter receives several files as input and may produce output files. In training mode the program gets a names files, a data file used for training, and an optional test file. As a result, it produces a strong hypothesis. The stem of all the files is the same and given via the run-time parameter -S. For example, if BoosTexter is called with "-S sample", then the following files will be used or created: sample.names : names (description) file sample.data : input training file sample.test : (optional) input test file sample.shyp : resulting strong hypothesis In classification mode (-C parameter turned on) the description file (.names) and strong hypothesis file (.shyp) are read, and the test data is read from the standard input (stdin). A summary of the error rate is printed to the standard error, and the per-example prediction is printed to the standard output. "names" file (stem.names): -------------------------- The names file defines the format of the data to be read in. White space is ignored throughout this file. The first line of the names file specifies the possible class labels of the data instances. It has the form: , , ... , . where each is any string of letters or numbers. Certain punctuation marks, including comma and period, may be problematic. Case is significant. (Likewise for all other string names below.) The remaining lines specify the form and type of the input fields. For continuous-valued input fields, the form is: : continuous. where is any string naming the field. For discrete-valued input fields, the form is: : , , ... , . where is any string naming the field, and , ... , are strings representing the possible values which can be assumed by this attribute. For text input fields, the form is: : text. where is any string naming the field. For the sake of backwards compatibility, these fields may also have the equivalent form: : set. For scored-text intput fields, the form is: : scoredtext. where is any string naming the field. A detailed example is given below. data and test file (stem.data and stem.test): --------------------------------------------- These files each describe a sequence of labeled examples. Each example has the following form: , , ... , , ... . Here, specifies the value of the i-th field, where the ordering of the fields is as specified in the names file. If this field is continuous, then a real number should appear. If this field is discrete, then the value must be one of the values (strings) specified in the names file. If the field is a text string, then any string of words may appear. If the field is a scored text string, then a real number must follow each word in the string. Each is one of the labels specified in the first line of the names file. These labels are the "correct" or desired labels associated with this example. Fields for which values are unknown may contain the single symbol "?". Strong hypothesis file (stem.shyp): ----------------------------------- This is an ascii file. For more information on its format, contact one of us. Detailed Example: ================= Here is a toy example in which the goal is to predict if a person is rich, smart, happy (or some combination of these). Instances describe individual people. Here is an example "names" file, with the three classes, and a description of the fields. It is included in this distribution as the file "sample.names", as are the other files mentioned below. rich, smart, happy. age: continuous. income: continuous. sex: male, female. highest-degree: none, high-school, bachelors, masters, phd. goal-in-life: text. hobbies: scoredtext. The scores appearing in the "hobbies" field would encode number of hours per week on each of a list of hobbies. The interpretation of the other fields should be obvious. Here is an example "data" (training) file, called "sample.data". 34, 40000, female, bachelors, to find inner peace, pottery 3 photography 1, smart happy. 40, 100000, male, high-school, to be a millionaire, movies 7, rich. 29, 80000, male, phd, win turing prize, reading 8 stamp-collecting 2, smart. 59, 50000, female, phd, win pulitzer prize, reading 40, smart happy. 16, 1000, female, ?, sleep as much as i want, tv 14 homework 7, . 21, 25000, male, high-school, have a big family, tv 8 fishing 2, happy. Here is an example "test" file (optional for training), called "sample.test". 51, 52000, male, phd, be with my family, fishing 4, happy smart. 24, 1000000, male, bachelors, retire at 25, movies 8 tv 8, rich. We can train BoosTexter on this data using the command: boostexter -n 10 -W 1 -N ngram -S sample -V "-n 10" specifies 10 rounds of boosting. "-S sample" specifies the file stem. "-W 1 -N ngram" says to use ngram's of length at most 1. "-V" specifies verbose printing of output. Executing this command gives the (slightly abbreviated) output: Weak Learner parameters: ------------------------ Window = 1 Classes = all Expert = NGRAM goal-in-life:be C0: -1.199 0.168 0.168 C1: 0.549 -0.549 -0.549 rnd 1: wh-err= 0.724633 th-err= 0.724633 test= 1.0000000 train= 0.3333333 hobbies:tv Threshold: 11.000 C0: -0.149 0.563 -0.017 C1: -0.303 -0.725 0.602 C2: -0.303 -0.725 -0.725 rnd 2: wh-err= 0.765474 th-err= 0.554688 test= 1.0000000 train= 0.1666667 . . . income: Threshold: 90000.000 C0: 0.000 0.000 0.000 C1: -0.761 0.266 0.044 C2: 0.659 -0.932 -0.509 rnd 8: wh-err= 0.785509 th-err= 0.125341 test= 0.5000000 train= 0.1666667 hobbies:reading Threshold: 24.000 C0: 0.187 -0.382 0.074 C1: -0.315 0.506 -0.862 C2: -0.141 0.372 0.615 rnd 9: wh-err= 0.824456 th-err= 0.103339 test= 1.0000000 train= 0.1666667 highest-degree:bachelors C0: -0.003 -0.443 -0.450 C1: -0.250 0.959 0.912 rnd 10: wh-err= 0.735***4 th-err= 0.076020 test= 1.0000000 train= 0.1666667 On each round, BoosTexter prints the rule which was found with the associated predictions. For text input fields (e.g., round 1), this is the chosen ngram ("be") and the vector of predictions if the word is present (C1) or absent (C0). Thus, the rule found on round 1 would read in English: IF the word "be" is present in the field "goal-in-life" THEN predict "rich" with weight +0.549 "smart" with weight -0.549 "happy" with weight -0.549 ELSE predict "rich" with weight -1.199 "smart" with weight +0.168 "happy" with weight +0.168 The magnitude of these weights indicates the strength or confidence of the prediction. A positive or negative value corresponds to a prediction that the label should or should not be applied to this example, respectively. For scored-text input fields (e.g., round 2), the chosen ngram is shown ("tv" in this case) and a threshold value (11). The prediction vectors are to be used if the word is absent (C0), present with score below threshold (C1) or present with score above threshold (C2). For discrete input fields (e.g., round 10), a selected attribute value is given ("bachelors" in this case). The prediction vectors are to be used if the attribute equals (C1) or does not equal (C0) the given value (i.e., if "highest-degree" is or is not "bachelors"). If the attribute is unknown, C0 is used. For continuous input fields (e.g., round 8), a threshold value is given (90000 in this case). The prediction vectors are to be used if the attribute value is above threshold (C2), below threshold (C1) or if the attribute value is unknown (C0). BoosTexter also prints out various raw error rates, the most relevant being the error on the training data and on the test data. For examples with multiple labels, BoosTexter counts an error for the example if and only if the highest weighted label is not among the labels assigned to the example. After training, BoosTexter creates a hypothesis file "sample.shyp". We can now use this hypothesis to classify test examples in any file. For instance, the command boostexter -C -S sample -o < sample.test will evaluate the stored hypothesis "sample.shyp" on the data in "sample.test". "-C" specifies that we are using classification mode rather than training mode "-S sample" specifies the file stem "-o" says to produce "verbose" output which is more easily read by a human This will print the raw classification error to stderr, and print the following to standard output: age: 51.0000000 income: 52000.0000000 sex: male highest-degree: phd goal-in-life: be with my family hobbies: fishing 4 correct label = smart happy > -0.004819 : rich ** -0.093118 : smart ** -0.069115 : happy age: 24.0000000 income: 1000000.0000000 sex: male highest-degree: bachelors goal-in-life: retire at 25 hobbies: movies 8 tv 8 correct label = rich ** -0.183517 : rich -0.285135 : smart > 0.13***53 : happy For each example, the correct label is printed, followed by the sum of weights associated with each label. The prediction of the hypothesis is the label receiving the greatest weight (indicated by a ">" sign). The correct label is indicated with a "*". Note that the weights associated with the labels may tend to be close to 0. These weights certainly should NOT be interpreted as probabilities (although they can be converted into reasonable estimates of probabilities -- contact us for details). A reasonable measure of confidence for a given example is the difference between the largest weight of any label, and the second largest weight. There are other reasonable measures of confidence that can be used. When the output will be read by another program, the "-o" option may be dropped as in: boostexter -C -S sample < sample.test This produces the same information as above, but in the following terse format: 0 1 1 -0.004818921580 -0.093117718340 -0.069114***7280 1 0 0 -0.183517493590 -0.285134738480 0.13***53186240 One line of output is printed for each example. On each line are a sequence of bits which indicate the correct labels. Next follow the total weight associated with each label. Beginning with Version 2.1, training may be continued using the -p option. For instance, in the example above, the command boostexter -n 5 -W 1 -N ngram -S sample -V -p will cause boostexter to read in the hypothesis stored in sample.shyp and to continue training beginning with this pre-loaded model for an additional 5 rounds. Thus, the result will be essentially equivalent to running boostexter from scratch for 15 rounds. Note that options, such as "-W 1 -N ngram", are not remembered from one run to the next, and need not be the same on each run. Also note that the old .shyp file will be overwritten. When printing out round-by-round information, rounds numbered ..., -2, -1, 0 are rounds on which the old hypothesis is being rebuilt, while the rounds in which new rules are being added are numbered 1, 2, 3, ....

近期下载者

相关文件


收藏者