Machine Learning/Inductive Inference/Rough Sets

References:

Z. Pawlak, "Rough Sets: Theoretical Aspects of Reasoning about Data," in Theory and Decision Library: Series D, vol. 9, W. Leinfellner and G. Eberlein eds, Kluwer Academic Publishers, 1991.
Xiaohua Hu and Nick Cercone, "Learning in Relational Databases: A Rough Set Approach," in Computational Intelligence, v.11, no.2 (1995), pp. 323-338.

Terminology from Rough Set Theory

Information Systems

Formally, an information system is a triple S = (U, A, V) where U = {x₁,...,x_n} is a (finite) set of objects (called the universe), and A is a set of attributes for which each object of the universe has a value (from the domain set V). The set of attributes is partitioned into the condition attributes (C), and the decision attributes (D). In the context of machine learning, there is typically only one decision attribute, which takes a value of true/false to indicate positive examples and negative examples.

Indiscernibility

A pair of objects x_i, x_j are indiscernible over a set of attributes B iff they have the same value for every attribute of B. Obviously, objects that are indiscernible over B will also be indiscernible over any subset of B. However, objects that were indiscernible over B may become discernible over a superset of B. In the context of machine learning, where we want to create simple rules, we will be interested in eliminating "redundant" attributes, which have little or no effect on the discernibility of objects.

Approximation Spaces

Indiscernibility over a set of attributes B defines an equivalence relation on the universe of objects. The lower approximation of a target set X (denoted B_*(X)) is the union of all the equivalence classes that are fully contained in X. The upper approximation of X (denoted B^*(X)) is the union of all equivalence classes that intersect X. Clearly, B_*(X) ⊆ X ⊆ B^*(X).

Consider the information system presented in the following table:

Condition Attributes					Decision Attributes
Name	Hair	Height	Weight	Lotion	Result
Sarah	blonde	average	light	no	sunburned (positive)
Dana	blonde	tall	average	yes	none (negative)
Alex	brown	short	average	yes	none
Annie	blonde	short	average	no	sunburned
Emily	red	average	heavy	no	sunburned
Pete	brown	tall	heavy	no	none
John	brown	average	heavy	no	none
Katie	blonde	short	light	yes	none

In this case, the objects of the universe are people (Sarah, Dana, Alex, Annie, Emily, Pete, John, Katie). Suppose we wish to describe the set X of people who did get a sunburn (X = {Sarah, Annie, Emily}). When using the set of attributes B = C = {Hair, Height, Weight, Lotion}, we find that each equivalence class has exactly one element, so we can define X precisely as B_*(X) = X = B^*(X) = [Sarah] ∪ [Annie] ∪ [Emily]. If, however, we consider only the attribute Lotion, then there are only two equivalence classes:

[Sarah] = {Sarah, Annie, Emily, Pete, John}
[Dana] = {Dana, Alex, Katie}

Observe that we now cannot define X precisely, because [Sarah] contains X. We can, however, make the observation that [Dana] is disjoint from X, so we have the following:
∅ = B_*(X) ⊂ X ⊂ B^*(X) = [Sarah] ⊂ U.
In terms of rules, the above indicates that not wearing lotion may (but not necessarily) lead to sunburn, and that wearing lotion prevents sunburn.

Reduct of Attributes

In order to simplify the learned rules, we will be interested in eliminating condition attributes. First, we will look for absolute reducts, which are subsets B of the set C of condition attributes such that B preserves the indiscernibility equivalence classes of C. In the "sunburn" example above, we observe that taking B = {Height, Weight, Lotion} (i.e. eliminating Hair) leaves every element of the universe distinct. Other absolute reducts include the following:

{Hair, Height, Lotion}
{Hair, Height, Weight}

Consider, however, that for our purposes, absolute reducts are too strict. That is, we do not care if we make formerly discernible objects to become indiscernible, as long as they have the same value for the decision attribute(s). For example, eliminating the attribute Height makes Pete and John indiscernible, but since neither of them got a sunburn, we have not lost any ability to predict a sunburn. Thus, B = {Hair, Weight, Lotion} is a relative reduct of C with respect to D. Notice that Height is in every absolute reduct, but it need not be in a relative reduct. Next, we will discuss these concepts formally.

Positive Regions and Dependency

Let S = (U, A, V) be an information system, with A = C ∪ D and B ⊆ C.
Let IND(B) denote the set of equivalence classes of U with respect to B.
Let IND(D) denote the set of equivalence classes of U with respect to D.

Define the positive region of B in IND(D) as

POS_B(D) = ∪{ B_*(X) : X ∈ IND(D) }.

That is, POS_B(D) includes all objects that can be sorted into classes of IND(D) without error, based solely on the classification information in IND(B).

Furthermore, we say that the set of attributes D depends in degree k on the subset R of C if

k(R, D) = card(POS_R(D)) ÷ card(POS_C(D))

Clearly, k(R, D) ≤ k(C, D) ≤ 1. Also, the following are equivalent to each other:

k(R, D) = k(C, D).
R is a reduct of C with respect to D.
POS_R(D) = POS_C(D).

Also, since we have a numerical description of dependency, we can speak of near-perfect reducts (i.e. k(R, D) ≈ k(C, D)), so that we may discard even more attributes if they have little effect on the dependency.

Computing the Best Reduct

In general, the problem of finding all reducts is NP-hard. Therefore, we will consider a greedy algorithm for finding the "best" reduct.

Now, it is known that there is a (possibly empty) subset of attributes, called the core, which is common to all reducts. Furthermore, it can be shown that CO = CORE(C, D) = {a ∈ C : POS_C(D) ≠ POS_C-{a}(D)}. So, we test every attribute to see if it belongs to the core, then we pass the core to the following algorithm:

REDU = CO.
AR = C - REDU.
While k(REDU, D) < k(C, D) - margin, do the following:
1. Find attribute a ∈ AR such that k(REDU ∪ {a}, D) is maximized.
2. Set REDU = REDU ∪ {a}; AR = AR - {a}.
Output REDU.

This algorithm is greedy because it hopes to minimize the number of attributes by choosing the attribute that creates the greatest increase in k at each step.

Example

Consider the "sunburn" data above. First, let us compute the core. Every object is discernible, so POS_C(D) = U.

Attribute (a)	IND(C - {a})	POS_{C - {a}}(D)	In Core?
Hair	{Sarah}, {Dana}, {Alex}, {Annie}, {Emily, John}, {Pete}, {Katie}	U - {Emily, John}	Yes
Height	{Sarah}, {Dana}, {Alex}, {Annie}, {Emily}, {Pete, John}, {Katie}	U	No
Weight	{Sarah}, {Dana}, {Alex}, {Annie}, {Emily}, {Pete}, {John}, {Katie}	U	No
Lotion	{Sarah}, {Dana}, {Alex}, {Annie}, {Emily}, {Pete}, {John}, {Katie}	U	No

So, CORE(C, D) = CO = {Hair}, and POS_CO(D) = {Alex, Emily, Pete, John}, and k(CO, D) = 4/8 = 0.5. Now we proceed with the algorithm.

Step	REDU	AR	a ∈ AR	k(REDU ∪ {a}, D)	Choice
1	{Hair}	{Height, Weight, Lotion}	Height	0.5	Lotion
			Weight	0.5
			Lotion	1.0

The algorithm terminates after one iteration, with the answer REDU = {Hair, Lotion}. The reduced table is as follows:

	C		D
Vote	Hair	Lotion	Sunburn
2	Blonde	No	Yes
2	Blonde	Yes	No
1	Brown	Yes	No
2	Brown	No	No
1	Red	No	Yes

Learning Rules

Before generating rules, we want to combine similar tuples. Tuples are similar if they differ in only one attribute (call it a). Ideally, the similar tuples, taken together, would have all possible values for a (known as saturation), but this is not necessary. In the example, notice that Brown/Yes/No and Brown/No/No differ only in the Lotion attribute, and that Lotion is saturated. Therefore, we can combine these into one tuple (Brown/?/No with vote = 3). Similarly, if we assume that Blonde, Brown, and Red are the only hair colours, then we can combine Blonde/No/Yes with Red/No/Yes to get ¬Brown/No/Yes. So the simplified table is as follows:

	C		D
Vote	Hair	Lotion	Sunburn
3	¬Brown	No	Yes
2	Blonde	Yes	No
3	Brown	?	No

So, finally, we are ready to read the rules out of the table:

(Hair ≠ Brown ∧ Lotion = No) ⇒ Sunburn = Yes
(Hair = Blonde ∧ Lotion = Yes) ∨ (Hair = Brown) ⇒ Sunburn = No

Dealing with noisy data.
Comparison with Decision Trees.