Factors Affecting Sunburn

Purpose: to revisit the sunburn example with C4.5 and C4.5rules.

Problem: given the training instances below, use C4.5 and C4.5rules to generate rules to indicate what factor(s) out of the four given attributes, may affect sunburn.

Training Data

Name Hair Height Weight Lotion Result
Sarah blonde average light no sunburned (positive)
Dana blonde tall average yes none (negative)
Alex brown short average yes none
Annie blonde short average no sunburned
Emily red average heavy no sunburned
Pete brown tall heavy no none
John brown average heavy no none
Katie blonde short light yes none

Files for Downloading


sunburn tree

Screenshots at the Default Verbosity Level


sunburn 1
sunburn 2

C4.5 Results

sunburn 3
The resulting decision tree.

Here we can see clearly see how errors are encountered in C4.5.

Apparently, 2 out of the 5 instances belonging to the path "if lotion = no, then sunburned" have been misclassified.

Looking at the training data, we may presume that these instances were Pete and John, the only two people to not use lotion and not be sunburned.

Name Hair Height Weight Lotion Result
Pete brown tall heavy no none
John brown average heavy no none

It may come to your attention that both Pete and John have brown hair and are heavyweights. Has C4.5 made an error? Did it overlook the obvious when constructing the decision tree? Shouldn't there be additional attribute tests to correctly classify these two instances?

The answer is no. Keep Occam's Razor in mind and remember that overfitting the data is not desirable. Remember that we are trying to minimize the entropy, not the problems associated with it.

To reduce the number of errors, however, be aware of the fact that an error-prone decision tree is the direct result of supplying C4.5 with error-prone training data.

C4.5rules Results

sunburn 4
The resulting decision rules.


Rule 1 suggests that if "lotion = yes" then the person will not be sunburned.
Otherwise, by default, the person will be sunburned.