## Factors Affecting Sunburn

Purpose: to revisit the sunburn example with C4.5 and C4.5rules.

Problem: given the training instances below, use C4.5 and C4.5rules to generate rules to indicate what factor(s) out of the four given attributes, may affect sunburn.

Training Data

 Name Hair Height Weight Lotion Result Sarah blonde average light no sunburned (positive) Dana blonde tall average yes none (negative) Alex brown short average yes none Annie blonde short average no sunburned Emily red average heavy no sunburned Pete brown tall heavy no none John brown average heavy no none Katie blonde short light yes none

## Diagram

Screenshots at the Default Verbosity Level

Preliminaries

 sunburn.names
 sunburn.data

C4.5 Results

 The resulting decision tree.

Here we can see clearly see how errors are encountered in C4.5.

Apparently, 2 out of the 5 instances belonging to the path "if lotion = no, then sunburned" have been misclassified.

Looking at the training data, we may presume that these instances were Pete and John, the only two people to not use lotion and not be sunburned.

 Name Hair Height Weight Lotion Result Pete brown tall heavy no none John brown average heavy no none

It may come to your attention that both Pete and John have brown hair and are heavyweights. Has C4.5 made an error? Did it overlook the obvious when constructing the decision tree? Shouldn't there be additional attribute tests to correctly classify these two instances?

The answer is no. Keep Occam's Razor in mind and remember that overfitting the data is not desirable. Remember that we are trying to minimize the entropy, not the problems associated with it.

To reduce the number of errors, however, be aware of the fact that an error-prone decision tree is the direct result of supplying C4.5 with error-prone training data.

C4.5rules Results

 The resulting decision rules.

Summary

Rule 1 suggests that if "lotion = yes" then the person will not be sunburned.
Otherwise, by default, the person will be sunburned.