README

   1 -*- mode: text -*-
   2
   3 +----------------------------------------------------------------------+
   4 | This archive contains a simple implementation of the Conditional     |
   5 | Mutual Information Maximization for feature selection.               |
   6 +----------------------------------------------------------------------+
   7 | Written by François Fleuret                                          |
   8 | Contact <francois.fleuret@epfl.ch> for comments & bug reports        |
   9 | Copyright (C) 2004 EPFL                                              |
  10 +----------------------------------------------------------------------+
  11
  12 $Id: README,v 1.3 2007-08-23 08:36:50 fleuret Exp $
  13
  14 0/ INTRODUCTION
  15
  16   The CMIM feature selection scheme is designed to select a small
  17   number of binary features among a very large set in a context of two
  18   class classification. It consists in picking features one after
  19   another to maximize the conditional mutual information between the
  20   selected feature and the class to predict given any one of the
  21   features already picked. Such a criterion picks features which are
  22   both individually informative yet pairwise weakly dependent. CMIM
  23   stands for Conditional Mutual Information Maximization. See
  24
  25   Fast Binary Feature Selection with Conditional Mutual Information
  26   Francois Fleuret
  27   JMLR 5 (Nov): 1531--1555, 2004
  28   http://www.jmlr.org/papers/volume5/fleuret04a/fleuret04a.pdf
  29
  30 1/ INSTALLATION
  31
  32   To compile and test, just type 'make test'
  33
  34   This small test consists of generating a sample set for a toy
  35   problem and testing CMIM, MIM and a random feature selection with
  36   the naive Bayesian learner.  The two populations of the toy problem
  37   live in the [0, 1]^2 square. The positive population is in x^2+y^2 <
  38   1/4 and the negative population is everything else.  Look at
  39   create_samples.cc for more details.  The features are responses of
  40   linear classifiers generated at random.
  41
  42 2/ DATA FILE FORMAT
  43
  44   Each data file, either for training or testing, starts with the
  45   number of samples and the number of features. Then follow for every
  46   single sample two lines, one with the value of the features (0/1)
  47   and one with the value of the class to predict (0/1).  Check the
  48   train.dat and test.dat generated by create_samples to get an
  49   example.
  50
  51   The test file has the same format, and the real class is used to
  52   estimate the error rates.  During test, the response of the naive
  53   bayse before thresholding is saved in a result file (3rd parametre
  54   of the --test option)
  55
  56 3/ OPTIONS
  57
  58   --silent
  59
  60     Switch off all the outputs to stdout
  61
  62   --feature-selection <random|mim|cmim>
  63
  64     Selects the feature selection method
  65
  66   --classifier <bayesian|perceptron>
  67
  68     Selects the classifier type
  69
  70   --error <standard|ber>
  71
  72     Choses which error to minimize during bias estimation for the CMIM
  73     + naive Bayesian.
  74
  75     standard = P(f(X) = 0, Y = 1) + P(f(X) = 1, Y = 0)
  76
  77     ber      = (P(f(X) = 0 | Y = 1) + P(f(X) = 1 | Y = 0))/2
  78
  79   --nb-features <int: nb of features>
  80
  81     Selects the number of selected features
  82
  83   --cross-validation <file: data set> <int: nb test samples> <int: nb loops>
  84
  85     Do cross-validation
  86
  87   --train <file: data set> <file: classifier>
  88
  89     Build a classifier and save it on disk
  90
  91   --test <file: classifier> <file: data set> <file: result>
  92
  93     Load a classifier and test it on a dataset
  94
  95 4/ LICENCE
  96
  97   This program is free software; you can redistribute it and/or modify
  98   it under the terms of the GNU General Public License version 3 as
  99   published by the Free Software Foundation.
 100
 101   This program is distributed in the hope that it will be useful, but
 102   WITHOUT ANY WARRANTY; without even the implied warranty of
 103   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
 104   General Public License for more details.