Factors Affecting Golf

Purpose: to illustrate, by a simple golf example, how the C4.5 and C4.5rules programs function.

Problem: given the training instances below, use C4.5 and C4.5rules to generate rules as to when to play, and when not to play, a game of golf.

Training Data

Outlook Temperature Humidity Windy Play (positive) / Don't Play (negative)
sunny 85 85 false Don't Play
sunny 80 90 true Don't Play
overcast 83 78 false Play
rain 70 96 false Play
rain 68 80 false Play
rain 65 70 true Don't Play
overcast 64 65 true Play
sunny 72 95 false Don't Play
sunny 69 70 false Play
rain 75 80 false Play
sunny 75 70 true Play
overcast 72 90 true Play
overcast 81 75 false Play
rain 71 80 true Don't Play

The column headers - the attribute names - become part of "golf.names", the filestem.names file.
The subsequent rows - the training instances - are entered into "golf.data", the filestem.data file.

Command-Line Syntax

The following commands were executed to produce results at the default verbosity level
(% represents the prompt; it is not typed):

"-f [filestem]" specifies the file stem to be used.

It may also be handy to redirect the output generated by these programs to files for future review using the "greater than" operator at the command-line as follows:

Additionally, higher verbosity levels may be specified to obtain statistical data calculated at runtime:

"-v [1-3]" specifies the verbosity level to be used from a scale of 1 to 3. Omitting this switch forces C4.5 and C4.5rules to use the default verbosity level.

Finally, it is possible for both C4.5 and C4.5rules to be executed with command-line switches to encorporate test instances into the analysis, if available (see Example 3):

"-u" tells C4.5 and C4.5rules to use the unseen test instances in the "filestem.test" file following evaluation on the training data.

Downloadable Files

The following files were generated using the above commands for the purpose of illustrating the differences between the different verbosity levels:

As you can see, higher verbosity levels provide much more quantitative data than those at the lower levels.
Since we are primarily concerned with qualitative results, however, only the generated output at the default verbosity level is examined in detail below.

Diagram

golf tree

Interpreting Output at the Default Verbosity Level

Preliminaries

golf 1
golf.names
golf 2
golf.data

C4.5 Results

golf 3
The resulting decision tree.

The output generated by C4.5 at the default verbosity level is interpreted as follows:

C4.5rules Results

golf 4
The resulting decision rules.

The output generated by C4.5rules at the default verbosity level is interpreted as follows:

Summary

Rule 1 suggests that if "outlook = sunny" and "humidity > 75" then "Don't Play".
Rule 2 suggests that if "outlook = overcast" then "Play".
Rule 3 suggests that if "outlook = rain" and "windy = true" then "Don't Play".
Rule 4 suggests that if "outlook = rain" and "windy = false" then "Play".
Otherwise, "Play" is the default class.