-*- mode: text -*- +----------------------------------------------------------------------+ | This archive contains a simple implementation of the Conditional | | Mutual Information Maximization for feature selection. | +----------------------------------------------------------------------+ | Written by François Fleuret | | Contact for comments & bug reports | | Copyright (C) 2004 EPFL | +----------------------------------------------------------------------+ $Id: README,v 1.3 2007-08-23 08:36:50 fleuret Exp $ 0/ INTRODUCTION The CMIM feature selection scheme is designed to select a small number of binary features among a very large set in a context of two class classification. It consists in picking features one after another to maximize the conditional mutual information between the selected feature and the class to predict given any one of the features already picked. Such a criterion picks features which are both individually informative yet pairwise weakly dependent. CMIM stands for Conditional Mutual Information Maximization. See Fast Binary Feature Selection with Conditional Mutual Information Francois Fleuret JMLR 5 (Nov): 1531--1555, 2004 http://www.jmlr.org/papers/volume5/fleuret04a/fleuret04a.pdf 1/ INSTALLATION To compile and test, just type 'make test' This small test consists of generating a sample set for a toy problem and testing CMIM, MIM and a random feature selection with the naive Bayesian learner. The two populations of the toy problem live in the [0, 1]^2 square. The positive population is in x^2+y^2 < 1/4 and the negative population is everything else. Look at create_samples.cc for more details. The features are responses of linear classifiers generated at random. 2/ DATA FILE FORMAT Each data file, either for training or testing, starts with the number of samples and the number of features. Then follow for every single sample two lines, one with the value of the features (0/1) and one with the value of the class to predict (0/1). Check the train.dat and test.dat generated by create_samples to get an example. The test file has the same format, and the real class is used to estimate the error rates. During test, the response of the naive bayse before thresholding is saved in a result file (3rd parametre of the --test option) 3/ OPTIONS --silent Switch off all the outputs to stdout --feature-selection Selects the feature selection method --classifier Selects the classifier type --error Choses which error to minimize during bias estimation for the CMIM + naive Bayesian. standard = P(f(X) = 0, Y = 1) + P(f(X) = 1, Y = 0) ber = (P(f(X) = 0 | Y = 1) + P(f(X) = 1 | Y = 0))/2 --nb-features Selects the number of selected features --cross-validation Do cross-validation --train Build a classifier and save it on disk --test Load a classifier and test it on a dataset 4/ LICENCE This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.