X-Git-Url: https://fleuret.org/cgi-bin/gitweb/gitweb.cgi?a=blobdiff_plain;f=README.txt;fp=README.txt;h=fe552d0f929caa9fe4bbe340596ac72a210be653;hb=4a09699b90f56d4d029bc81c2ec440150b2e7b3f;hp=0000000000000000000000000000000000000000;hpb=8ea0c152349515fc4f23c606ae1cadd694b9d9cb;p=cmim.git diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..fe552d0 --- /dev/null +++ b/README.txt @@ -0,0 +1,104 @@ +-*- mode: text -*- + ++----------------------------------------------------------------------+ +| This archive contains a simple implementation of the Conditional | +| Mutual Information Maximization for feature selection. | ++----------------------------------------------------------------------+ +| Written by François Fleuret | +| Contact for comments & bug reports | +| Copyright (C) 2004 EPFL | ++----------------------------------------------------------------------+ + +$Id: README,v 1.3 2007-08-23 08:36:50 fleuret Exp $ + +0/ INTRODUCTION + + The CMIM feature selection scheme is designed to select a small + number of binary features among a very large set in a context of two + class classification. It consists in picking features one after + another to maximize the conditional mutual information between the + selected feature and the class to predict given any one of the + features already picked. Such a criterion picks features which are + both individually informative yet pairwise weakly dependent. CMIM + stands for Conditional Mutual Information Maximization. See + + Fast Binary Feature Selection with Conditional Mutual Information + Francois Fleuret + JMLR 5 (Nov): 1531--1555, 2004 + http://www.jmlr.org/papers/volume5/fleuret04a/fleuret04a.pdf + +1/ INSTALLATION + + To compile and test, just type 'make test' + + This small test consists of generating a sample set for a toy + problem and testing CMIM, MIM and a random feature selection with + the naive Bayesian learner. The two populations of the toy problem + live in the [0, 1]^2 square. The positive population is in x^2+y^2 < + 1/4 and the negative population is everything else. Look at + create_samples.cc for more details. The features are responses of + linear classifiers generated at random. + +2/ DATA FILE FORMAT + + Each data file, either for training or testing, starts with the + number of samples and the number of features. Then follow for every + single sample two lines, one with the value of the features (0/1) + and one with the value of the class to predict (0/1). Check the + train.dat and test.dat generated by create_samples to get an + example. + + The test file has the same format, and the real class is used to + estimate the error rates. During test, the response of the naive + bayse before thresholding is saved in a result file (3rd parametre + of the --test option) + +3/ OPTIONS + + --silent + + Switch off all the outputs to stdout + + --feature-selection + + Selects the feature selection method + + --classifier + + Selects the classifier type + + --error + + Choses which error to minimize during bias estimation for the CMIM + + naive Bayesian. + + standard = P(f(X) = 0, Y = 1) + P(f(X) = 1, Y = 0) + + ber = (P(f(X) = 0 | Y = 1) + P(f(X) = 1 | Y = 0))/2 + + --nb-features + + Selects the number of selected features + + --cross-validation + + Do cross-validation + + --train + + Build a classifier and save it on disk + + --test + + Load a classifier and test it on a dataset + +4/ LICENCE + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License version 3 as + published by the Free Software Foundation. + + This program is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details.