Factors Affecting Sunburn
Purpose: to revisit the sunburn example with C4.5 and C4.5rules.
Problem: given the training instances below, use C4.5 and C4.5rules to generate rules to indicate what factor(s) out of the four given attributes, may affect sunburn.
Training Data
Name | Hair | Height | Weight | Lotion | Result |
Sarah | blonde | average | light | no | sunburned (positive) |
Dana | blonde | tall | average | yes | none (negative) |
Alex | brown | short | average | yes | none |
Annie | blonde | short | average | no | sunburned |
Emily | red | average | heavy | no | sunburned |
Pete | brown | tall | heavy | no | none |
John | brown | average | heavy | no | none |
Katie | blonde | short | light | yes | none |
Files for Downloading
- sunburn.names
- Class, attribute, and value names.
- sunburn.data
- Training data.
- The resulting decision tree.
- sunburn.dt: default verbosity level.
- The resulting decision rules.
- sunburn.r: default verbosity level.
Diagram
Screenshots at the Default Verbosity Level
Preliminaries
sunburn.names |
sunburn.data |
C4.5 Results
The resulting decision tree. |
Here we can see clearly see how errors are encountered in C4.5.
Apparently, 2 out of the 5 instances belonging to the path "if lotion = no, then sunburned" have been misclassified.
Looking at the training data, we may presume that these instances were Pete and John, the only two people to not use lotion and not be sunburned.
Name | Hair | Height | Weight | Lotion | Result |
Pete | brown | tall | heavy | no | none |
John | brown | average | heavy | no | none |
It may come to your attention that both Pete and John have brown hair and are heavyweights. Has C4.5 made an error? Did it overlook the obvious when constructing the decision tree? Shouldn't there be additional attribute tests to correctly classify these two instances?
The answer is no. Keep Occam's Razor in mind and remember that overfitting the data is not desirable. Remember that we are trying to minimize the entropy, not the problems associated with it.
To reduce the number of errors, however, be aware of the fact that an error-prone decision tree is the direct result of supplying C4.5 with error-prone training data.
C4.5rules Results
The resulting decision rules. |
Summary
Rule 1 suggests that if "lotion = yes" then the person will not be sunburned.
Otherwise, by default, the person will be sunburned.