C4.5(1)

NAME

c4.5 - form a decision tree from a file of examples

SYNOPSIS

c4.5 [ -f filestem ] [ -u ] [ -s ] [ -p ] [ -v verb ] [ -t trials ] [ -w wsize ] [ -i incr ] [ -g ] [ -m minobjs ] [ -c cf ]

DESCRIPTION

C4.5 is a program for inducing classification rules in the form of decision trees from a set of given examples.

All files read and written by C4.5 are of the form filestem.ext where filestem is a file name stem that identifies the induction task and ext is an extension that defines the type of file. The program expects to find at least two files: a names file filestem.names defining class, attribute and attribute value names, and a data file filestem.data containing a set of objects, each of which is described by its values of each of the attributes and its class.

The program can generate trees in two ways. In batch mode (the default), the program generates a single tree using all the available data. In iterative mode, the program starts with a randomly-selected subset of the data (the window), generates a trial decision tree, adds some misclassified objects, and continues until the trial decision tree correctly classifies all objects not in the window or until it appears that no progress is being made. Since iterative mode starts with a randomly-selected subset, multiple trials with the same data can be used to generate more than one tree.

All trees generated in the process are saved in filestem.unpruned. After each tree is generated, it is pruned in an attempt to simplify it. The "best" pruned tree (selected by the program if more there is more than one trial) is saved in machine-readable form in filestem.tree.

All trees produced, both pre- and post-simplification, are evaluated on the training data. If required, they can also be evaluated on unseen data in the file filestem.test.

FILE FORMATS

The names file filestem.names is a series of entries defining names of attributes, attribute values and classes. The file is free-format with the exception that the vertical bar "|" causes the remainder of that line to be ignored. Each entry is terminated by a period which may be omitted if it is the last character of a line.

The file commences with the names of the classes, separated by commas and terminated with a period. Each name consists of a string of characters that does not include comma, question mark or colon (unless preceded by a backslash). A period may be embedded in a name provided it is not followed by a space. Embedded spaces are also permitted but multiple whitespace is replaced by a single space. The rest of the file consists of a single entry for each attribute. An attribute entry begins with the attribute name followed by a colon, and then either the word "ignore" (indicating that this attribute should not be used), the word "continuous" (indicating that the attribute has real values), the word "discrete" followed by an integer n (indicating that the program should assemble a list of up to n possible values), or a list of all possible discrete values separated by commas. (The latter form for discrete attributes is recommended as it enables input to be checked.) Each entry is terminated with a period (but see above).

The data file filestem.data contains one line per object. Each line contains the values of the attributes in order followed by the object's class, with all entries separated by commas. The rules for valid names in the names file also hold for the names in the data file. An unknown value of an attribute is indicated by a question mark "?". If a test file filestem.test is used, it has the same format as the data file.

OPTIONS

Options and their meanings are:

-f filestem Specify the filename stem (default DF)
-u Evaluate trees produced on unseen cases in file filestem.test.
-s Force "subsetting" of all tests based on discrete attributes with more than two values. C4.5 will construct a test with a subset of values associated with each branch.
-p Probabilistic thresholds used for continuous attributes (see Quinlan, 1987a).
-t trials Set iterative mode with specified number of trials.
-v verb Set the verbosity level [0-3] (default 0). This option generates more voluminous output that may help to explain what the program is doing (but don't count on it); see the manual entry for verbose.

The following options are also available but need not be used except for experimentation with tree construction:

-w wsize Set the size of the initial window (default is the maximum of 20 percent and twice the square root of the number of data objects).
-i incr Set the maximum number of objects that can be added to the window at each iteration (default is 20 percent of the initial window size).
-g Use the gain criterion to select tests. The default uses the gain ratio criterion.
-m minobjs In all tests, at least two branches must contain a minimum number of objects (default 2). This option allows the minimum number to be altered.
-c cf Set the pruning confidence level (default 25%).

FILES

c4.5
filestem.data
filestem.names
filestem.unpruned (unpruned trees)
filestem.tree (final decision tree)
filestem.test (unseen data)

SEE ALSO

consult(1)

BUGS