bonzaiboost documentation

Christian Raymond

January 26, 2017


Download bonzaiboost:

  1. bonzaiboost v1.6.9 binary for linux 64

    1. added flag "{-}{-}info" to output information about models
    2. fixed bug in error estimation for multilabels tasks
    3. added "-prune" to prune tree wrt. to data in .dev for tree mode or with out of bag in bagging mode
    4. corrected the "exclude_level" option with sgram mode
    5. fixed some bugs (and probably add some new ones)
    6. code refactoring
  2. bonzaiboost v1.6.4 binary for linux 64, for linux 32, for Mac OSX.

    1. fixed bug introduced in v1.6.2 using text features (no problem for AdaBoost.MH)
    2. added bagging generalisation error estimation (beta)
    3. flag "-draw" works in cross validation
    4. fixed other minor bugs
  3. bonzaiboost v1.6.3 binary

    1. fixed bug when the output in test mode printed text-index instead of text himself
    2. fixed a rare bug when loading model in boosting mode
    3. add -fp option to format the prediction output
    4. fixed the "icsi" output
  4. bonzaiboost v1.6.2 binary

    1. fixed a big bug in text feature extraction present in v1.6
    2. added "exclude_level" option for text attribute to ignore one or several levels of information
    3. added attribute based cross validation to control manually the folds
    4. all gain (entropy,gini,etc.) are now expressed in percentage (so are stopping criteria with -g flag)
  5. bonzaiboost v1.6 binary

    1. speed and memory optimizations
    2. implementation of bagging (randomForest)
    3. simple k-nearest neighbor implementation
    4. add option "-m" to bypass the .XXX output filenames
    5. models are saved in xml format
    6. add option "-format" to keep compatibility with old models
    7. add option "{-}{-}discretize" to discretize numeric attributes according to MDLPC
    8. reorganized the help message
  6. bonzaiboost v1.4 binary for linux 64, for linux 32, for Mac OSX.

    1. part of code has been rewritten with template to facilitate maintenance
    2. N-Fold cross validation implemented (and parallelized)
    3. cutoff is available for numeric features
    4. simple features selection methods are implemented based on entropy,idf and mutual information
  7. bonzaiboost v1.2 binary for linux 64, for linux 32.

    1. all parallel mechanisms has been rewritten with OpenMP
    2. some code has been modified to reduce significantly memory consumption
    3. save boosting weights at end of boosting procedure (AdaBoost.MH only) for fast resuming
    4. some file format conversion process are introduced (arff(weka) and libsvm like)
    5. some bugs are fixed
      1. option "-leaf" in test mode works
      2. when a weight is assigned to an example in data file: now works for trees and AdaBoost.MH
      3. false feature selection for no pertinent attributes
  8. bonzaiboost v1.1 binary for linux 64.

    1. some bugs fixed
  9. bonzaiboost v1.0 binary for linux 64.

Contents

1 What is bonzaiboost ?
2 Command line and option
3 Using bonzaiboost
4 Input/Output files and their formats
 4.1 “names”file (stem.names)
 4.2 Data and test file (stem.data and stem.test)
 4.3 Detailed Example
 4.4 Particular options
  4.4.1 Specific ngram extraction on multiple levels text input
5 Use cases
 5.1 Tree mode
  5.1.1 Training
  5.1.2 Testing
 5.2 Boosting mode
  5.2.1 Training
  5.2.2 Testing
  5.2.3 Feedback from the model
 5.3 Bagging mode
  5.3.1 Training
  5.3.2 Testing
 5.4 Knn mode
 5.5 Output flags
 5.6 Cross-validation experiments
6 F.A.Q
 6.1 What should I use bonzaiboost instead of boostexter/icsiboost
 6.2 What can I do to increase speed when processing huge amount of data ?
 6.3 I have my data in a file format different from bonzaiboost format
 6.4 How can I do a clean Interruption ?
7 Citing bonzaiboost
8 Publications where bonzaiboost is applied

1 What is bonzaiboost ?

bonzaiboost is a general purpose machine-learning program based on decision tree for building a classifier from text and/or attribute-value data.
Currently one configuration of bonzaiboost is ranked first on http://mlcomp.org bonzaiboost can handle:

Note that most of existing implementations of boosting1 use it on top of “decision stumps”, bonzaiboost can learn general decision trees and can apply boosting on top of them.

2 Command line and option

Usage: bonzaiboost -S stem [options]:

*** Training options:
  
--- Tree options:
-g <gain> stop criterion:min gain
-d <depth> stop criterion:max depth
-p <proba> stop criterion: stop if the prediction is >=proba (default:1)
-cr <criterion> gain criterion: gini,entropie or m5 (for ordered numeric output labels only)
-leaf <nbexemples> stop criterion:minimum number of examples in a leaf
-mdlpc use maximum description length as stopping criterium
-ep <file> when -cr is setted to error, you can specify weight for each error in "file"
--- Meta learning options:
-boost <algo> run boosting with algo: ada AdaBoost adam1 AdaBoost.M1 adamh AdaBoost.MH
-bagging run bagging algorithm
-n <nblearners> nb learners of boosting
-rr <mode> random reduction of features(no,sqrt,log,fixed number)
-bs <]0,100[> % of data keeped for the bag(default:62)
-resume continue learning
--- Kppv options:
-kppv <k> output majority vote with the k-nearest neighbors
-knppv <k> output median vote with the k-nearest neighbors (for numeric ordonned target labels)
-dist <distance> distance among (scalar,euclidienne,cosinus,deft)
-pond <weighting> among (okapi,tfidf,deft)
*** Testing options:
-C turn on classification mode: reads examples from <stdin>
-o <mode> output classif mode among (normal,all,[0,1],backoff) (see 5.5)
-oracle <nbest> compute oracle error with nbest size
-c <single/multi> output single class(default for TREE) or multi class(default for boosting)
-fp <mode> format prediction (classical:default or icsi)
-draw <measure> output a gnuplot file according to measure(recall or precision,fmesure,cer) and number of rounds
*** Threads options:
-jobs <nbthreadsmax>max running threads
-split <nbexamples> process input data in parallel to extract features
-pt <nbfeatures> parallelize textual attributes if features number is > to nbfeatures (boosting only)
-pa process attributes in parallel (boosting only)
*** Others options:
-v <[0 - 6]> verbose mode
-discretize produce <stem>.discret files with all numeric attributes dicretized according to MDLPC criterion
-cross <nbtest> perform cross validation on the .data file using nbtest test samples for each fold (1=leave-one-out)
-convert <mode> convert to/from other data format
--version display version information
--examples display command line examples
-format <format> model format among (raw,xml:default)
--info output a html file resuming what bonzaiboost has learnt
-m bypass <stem>.XXX filenames for files created by bonzaiboost
-h this help

3 Using bonzaiboost

Here are instructions how to use the program “bonzaiboost” to build a classifier for text and/or attribute-value classification and how to classify new instances.

bonzaiboost works with data which may be of various forms. In general, each instance is broken into multiple fields. These fields may be of four types: a continuous-valued attribute (such as “age”), a discrete-valued attribute (such as “eye color”), a text string (such as “body of an email message”), or a “scored text”string (in which scores are associated with each word of the text, such as “tf-idf”scores used in information retrieval).

bonzaiboost learns decision tree and can combine them with boosting or bagging. To each node of the tree a binary feature is attached. Features are tests built from the training data and have one of the following forms depending on the type of input field associated with it:

4 Input/Output files and their formats

bonzaiboost receives several files as input and produce output files. In training mode the program gets a names files, a data file used for training. The stem of all the files is the same and given via the run-time parameter -S. For example, if bonzaiboost is called with “-S sample”, then the following files will be used or created:

  1. used
    1. sample.names : names (description) file
    2. sample.data : input training file
  2. created depending on the algorithm invoked
    1. sample.X.xml : model dump with X among {bagging,boost,tree} depending on the invoked algorithm
    2. sample.X.eval.html :html report of classification error statistics (in test mode when “-C”is invoked)

In classification mode (-C parameter turned on) the description file (<stem>.names) and model file (<stem>.X.xml are read depending of the selected algorithm, and the test data is read from the standard input (stdin). A detailed statistic by label of the error rate is printed to the standard error, and the per-example prediction is printed to the standard output.

4.1 “names” file (stem.names)

The names file defines the format of the data to be read in. White space is ignored throughout this file. The first line of the names file specifies the possible class labels of the data instances. It has the form:

where each <label_i> is any string of letters or numbers. Certain punctuation marks, including comma and period, may be problematic. Case is significant. (Likewise for all other string names below.)
The remaining lines specify the form and type of the input fields. For continuous-valued input fields, the form is:

where <name> is any string naming the field.
For discrete-valued input fields, the form is:

where <name> is any string naming the field, and <value_1> , ... , <value_k> are strings representing the possible values which can be assumed by this attribute. For text input fields, the form is:

where <name> is any string naming the field.
For scored-text intput fields, the form is:

where <name> is any string naming the field.

A detailed example is given below.

4.2 Data and test file (stem.data and stem.test)

These files each describe a sequence of labeled examples. Each example has the following form:
<field_1>, <field_2>, ... , <field_n>, <label_1> <label_2> ... <label_x>.
Here, <field_i> specifies the value of the i-th field, where the ordering of the fields is as specified in the names file. If this field is continuous, then a real number should appear. If this field is discrete, then the value must be one of the values (strings) specified in the names file. If the field is a text string, then any string of words may appear. If the field is a scored text string, then a real number must follow each word in the string.
Each <label_i> is one of the labels specified in the first line of the names file. These labels are the “correct”or desired labels associated with this example.

4.3 Detailed Example

Here is a toy example in which the goal is to predict if a person is rich, smart or happy. Instances describe individual people.
Here is an example “names”file (table 1, with the three classes, and a description of the fields. It is included in this distribution as the file “sample.names”, as are the other files mentioned below.




rich, smart, happy.
age: continuous.
income: continuous.
sex: male, female.
highest-degree: none, high-school, bachelors, masters, phd.
goal-in-life: text: expert_length=1 expert_level=1 expert_type=ngram cutoff=0.
hobbies: scoredtext.



Table 1: Example of a .name file

The scores appearing in the “hobbies”field would encode number of hours per week on each of a list of hobbies. The interpretation of the other fields should be obvious.
Here is an example “data”(training) file, called “sample.data”, table 2.



34, 40000, female, bachelors, to find inner peace, pottery 3 photography 1, smart happy.
40, 100000, male, high-school, to be a millionaire, movies 7, rich.
29, 80000, male, phd, win turing prize, reading 8 stamp-collecting 2, smart.
59, 50000, female, phd, win pulitzer prize, reading 40, smart happy.
21, 25000, male, high-school, have a big family, tv 8 fishing 2, happy.


Table 2: Example of a .data file

4.4 Particular options

for each field you may want to ask for specific option. These options must be set in the name file at the end of the field description. These options are the following:

  1. Features selection:
    1. cutoff=x feature not present at least x times in the training data are not considered (default=0), for numeric fields the behavior is different: the numeric value is rounded to x digit after comma.
    2. cutentropy=x feature with entropy-gain under x is eliminated (in general set > 0)
    3. cutidf=x feature with idf over x is eliminated (in general set in [0 - 1])
    4. cutim=x feature with mutual information over x is eliminated (in general set < 0.001)
  2. for text fields only:
    1. expert_type=x: type of the generated feature for learning:x=
      • sgram - all sparse word-grams up to a maximal length
      • ngram - all full word-grams up to a maximal length
      • fgram - full word-grams of maximal length only
    2. expert_length=x: x is length of the gram(default=1)
    3. expert_length_min=x: x is minimum length of the gram(default=1)
    4. expert_level=x: number of level in text field(default 1) see next:
    5. exclude_level=x: don’t consider the level x (repeat this command to exclude several level):

4.4.1 Specific ngram extraction on multiple levels text input

When working with text field you may want to give several word level information in addition to the word itself. To do so, you have to alternate word with other levels and inform the name file that the field must be read according to that. For example, word ngram pattern may not be enough general while POS pattern too less specific. You can give the two levels and all possible combinations will be evaluated by the algorithm whose select the best:
“have a big family”would be enriched by POS information “have VERB a DET big ADJ family CN”.

5 Use cases

5.1 Tree mode

5.1.1 Training

Using the example “sample”, run:

You will see in the figure 1 the screen copy for the run result. Each node construction of the tree print informations (you can adjust verbose level with the parameter “-v”up to 5):

  1. the first line give two informations: how many examples of the database are used in this current node and the depth
  2. second line indicates for each field in the .names file the number of features extracted.
  3. other lines print informations about individual features: gain (according to selected measure, entropy or gini), error made by the split according to this feature, and the feature name
    1. - - -: indicate current evaluated feature (only with verbose level > 4)
    2. - x -: indicate current selected feature (only with verbose level > 3)
    3. x - x: indicate best field selected feature (only with verbose level > 2)
    4. x x x: indicate best selected feature among all fields.


PIC

Figure 1: Informations printed on screen after invocation of the command line “bonzaiboost -S sample”


Several option could be fixed to control the tree construction and when to stop it:

5.1.2 Testing

Using the example “sample”. Run:

In tmpfile you will find sample.data with the prediction made by the tree at each end of line, see figure 2. On stderr are printed detailed errors statistics about all labels like in figure 3.


PIC

Figure 2: stdout Output of “bonzaiboost -S sample -C < sample.data”



PIC

Figure 3: stderr Output of “bonzaiboost -S sample -C < sample.data”


When the classification mode is invoked, 2 files are build:

  1. stem.tree.eval.html. This file contain evaluation statistics, see figure 4:
    1. A summary of the option used for the evaluation
    2. The per label error statistics
    3. A label confusion matrix indicating thanks to level of red the main confusions made by the classifier
  2. stem.tree.dot: information to produce a graphical representation of the tree thanks to the library Graphviz http://www.graphviz.org/. You can see an example in figure 5 generated with the command line “cat sample.dot | dot -Tpng > sample.png”.


PIC

Figure 4: Html evaluation statistic report



PIC

Figure 5: Tree visualization using the dot library, each node indicate the selected feature, each leaf indicate the Population in the node (#examples)”with the majority label and his probability.


As you saw, this example is a multi-class classification problem where we want to assign several labels to an instance. Even if decision tree is not really designed to solve this kind of problem you may want to assign all labels which have a high probability (at least 0.5), you can simply change the bonzaiboost behavior in multiclass by setting the flag -c to multi and put a threshold on the probability by setting the flag -o to 0.5:

Now the new output is presented in the figure 6.


PIC
Figure 6: stdout and stderr output of “bonzaiboost -S sample -C -o 0.5 -c multi < sample.data”


The same options could be fixed in classification mode than in training mode (together with the parameter “-C”), if the tree has been learned with less constrained options, the constrained are fixed for testing. This is also interesting to control the dot file produced to visualize the tree. The tree could be very big, if you want to see only the first nodes, you can invoke the command:

Then a new sample.tree.dot file will be produced but all nodes deeper than 2 will be removed. (same effect with the other parameters).

5.2 Boosting mode

5.2.1 Training

Typical command should be invoked for training:

Typical command:

Figure 7 show the stderr output.


PIC
Figure 7: stdout output of “bonzaiboost -S sample -boost adamh -n 2”


5.2.2 Testing

Testing is similar to tree algorithm, just use “-C”parameter, you can also adjust the “-n”parameters to control the number of rounds you want to use for testing.
Typical command:

Figure 8 show the stderr output.


PIC

Figure 8: stdout+stderr outputs of “bonzaiboost -S sample -boost adamh -C < sample.data”


5.2.3 Feedback from the model

With the following command:

bonzaiboost will generate the file sample.boost.log.html which summarize what it has learnt


PIC
Figure 9: tables that summarize what bonzaiboost has learnt: the first table shows the selection frequency for each attribute during the boosting phase, the second table shows for each line the rule learned and the vote it does for each target class (small vote between 0.5 and -0.5 have been omitted)


5.3 Bagging mode

5.3.1 Training

Typical command should be invoked for training:

Typical command:

5.3.2 Testing

Testing is similar to tree algorithm, just use “-C”parameter, you can also adjust the “-n”parameters to control the number of trees you want to use for testing.
Typical command:

5.4 Knn mode

Knn is a model-less algorithm, so you have to invoke it using the “-C”flag with test data directly:

5.5 Output flags

In classification mode you can also use the “-o”parameter to control the kind of decision output you want, printing AND error computation is done according to this parameter:

This summarize what is happens in default classification output mode for each algorithm: decision tree output at most one label while the boosting algorithm is multiclass and may output more than one label. You can modify this default behavior and make the boosting algorithm mono-class or the decision tree multi-class by setting the flag “-c”to “single”ot “multi”in classification mode “-C”. Table 3 summarizes the behavior of bonzaiboost according to parameters -o,-c and mode (boost or tree).







Tree or bagging mode
Boosting mode





-o\-c

single:default

multi

single

multi:default











normal:default

select label with the highest score

select label whose best score >= 0

select all labels whose score >= 0






backoff

select label with best score

select all labels whose score >= 0 if no label selected choose label with best score






all
print all labels with scores, error statistics are computed according to default behavior





reel

select label if score >=

select all labels whose score >=

select label if score >=

select all labels whose score >=







Table 3: Processing according to -o and -c output parameters and mode tree or boosting, error statistics are computed according to the considered output. Empty cases means default behavior.

5.6 Cross-validation experiments

You can run cross-validation experiments adding the flag “-cross”followed by the number of samples wanted for the test at each fold:

You may want to choose your own fold, by using an attribute encoding the fold, then run:

6 F.A.Q

6.1 What should I use bonzaiboost instead of boostexter/icsiboost

Using shallow trees as weak learner for boosting works quite better than decision stumps. Compare these 2 configurations:

6.2 What can I do to increase speed when processing huge amount of data ?

bonzaiboost code has been parallelized, threads can be invoked using the “-jobs” switch. Different settings can be adjusted to match your specific conditions.

  1. Extracting features: before running any algorithms, bonzaiboost extract features from data. When extracting Ngrams features with N big this first operation could be slow.
    `→ recommended setting:
    1. if you have few attributes: option -split
      Take a look at the following command: bonzaiboost -S stem -jobs 2 -split 50000
      2 threads will be invoked and data will be splitted in packet of 50000 examples. Settings are then pertinent if you have 100000 inputs examples in your data file.
    2. if you have a lot of attributes: option -pa
      Attributes will be processed in parallel by as many threads as specified by the -jobs option
  2. Tree algorithm: tree induction is parallelized use "-jobs" to control the max number of threads
  3. Bagging algorithm: trees induction are not parallelized any more, but each tree is inducted in parallel: "-jobs" control the max number of simultaneous trees
  4. Boosting algorithm: trees induction are not parallelized any more, since boosting is an iterative algorithm rounds can’t be parallelized however if each round is slow because a high number of features or/and attributes you can process as follow:
    1. High attributes number: when running any algorithms, bonzaiboost look for the best feature at each tree node or/and boosting round. Selection is done examining every attributes. In presence of high number of attributes or several attributes from which a lot of features have been extracted, you can ask for a parallel processing of attributes invoking the flags “-pa”.
    2. High text features: sometimes an huge amount of textual features (several millions) is extracted from textual attributes when Ngram size is high, finding the best feature among them could be slow. You can parallelize this processing setting the flag “-pt”.
      Take a look at the following command: bonzaiboost -S stem -jobs 2 -pt 10000000
      to find the best feature among features from a textual attributes, 2 threads will be invoked each times a textual attribute count more than 10000000 features
    3. High text features: another solution is the following: instead of extracting N-gram with N big from one attribute only, duplicate N times this attribute and extract fgram of 1 to N in each attribute with the -pa option. This solution is probably faster than the previous one because features extraction is parallelized too.
  5. Cross validation: only folds are processed in parallel: use "-jobs" to control the max number of simultaneous folds

6.3 I have my data in a file format different from bonzaiboost format

Simple format conversion processes have been implemented. You can convert 2 well used formats:

  1. arff format (used by weka)
  2. libsvm sparse like format

If you have a training data and testing data files you can convert them by the following command lines:

Inverse conversions are possible: ask bonzaiboost -S sample -convert ? to know all possible conversions

6.4 How can I do a clean Interruption ?

  1. You can stop properly the bonzaiboost execution (in mode, tree, boosting and bagging) by doing one Control-C, you will have to wait the end of the actual running node (in tree node), running weak learner (boosting mode) or running trees (bagging mode) and a model will be saved.
  2. A second Control-C will stop immediately but nothing is saved.
  3. If you want to continue the execution, execute the same command line as before and add the flag -resume.

This is useful if you want to change some parameters:

7 Citing bonzaiboost

If you use bonzaiboost for a scientific work, please refer to:

8 Publications where bonzaiboost is applied

bonzaiboost itself:

Publications where bonzaiboost is applied

References

[Breiman, 2001]   Breiman, L. (2001). Random forests. Machine Learning, 45:5–32.

[Fayyad and Irani, 1993]   Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuousvalued attributes for classification learning. In Thirteenth International Joint Conference on Articial Intelligence, volume 2, pages 1022–1027. Morgan Kaufmann Publishers.

[Laurent et al., 2014]   Laurent, A., Camelin, N., and Raymond, C. (2014). Boosting bonsai trees for efficient features combination : application to speaker role identification. In InterSpeech, Singapour.

[Quinlan, 1992]   Quinlan, R. J. (1992). Learning with continuous classes. In 5th Australian Joint Conference on Artificial Intelligence, pages 343–348, Singapore. World Scientific.

[Schapire and Singer, 2000]   Schapire, R. E. and Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning, 39:135–168. http://www.cs.princeton.edu/~schapire/boostexter.html.