Notes 03-6: Regression

Linear regression techniques attempt to model data using a straight line.

Given a set of data points (x₁, y₁), (x₂, y₂), …, (x_n, y_n), where y_i is some response corresponding to x_i, linear regression is a method for determining the function that best fits the observed data points.

The first step in fitting a straight line to the data points is to construct a scatter plot.

If the points appear to approximate a straight line, linear regression may be an appropriate analysis technique.

DIAGRAM = Classification.C.1.b1

If they don’t, some other technique is required.

DIAGRAM = Classification.C.1.b2

The method of least squares assumes the best-fit curve is one that has the minimal sum of the deviations squared from a given set of data points.

The general regression equation can be written as

where α and β are called the regression coefficients.

The regression coefficients can be estimated from the following two equations:

Where x̅ is the mean of the x values in the sample, y̅ is the mean of the y values, β represents the slope of the line through the points, and α represents the y-intercept.

Example – Linear regression

Consider the table shown below, where Salary is shown for various values of Years of Service. The objective is to use the data in this table to predict Salary based upon Years of Service. Salary is called the explanatory variable and Years of Service is called the response variable.

Salary	Years of Service
30	3
57	8
64	9
72	13
36	3
43	6
59	11
90	21
20	1
83	16

A scatter plot corresponding to the values in the table is shown below.

DIAGRAM = Classification.C.1.d

Based upon the values in the table, x̅ = 9.1, y̅ = 55.4, β = 3.54, and α = 23.19. Therefore ŷ = 23.19 + 3.54x.

Salary can now predicted for any value of Years of Service. However, keep in mind that it is just a prediction. For example, the actual versus predicted Salary for Years of Service from the original table is shown below.

Salary	Years of Service
30	3	33.81
57	8	51.51
64	9	55.05
72	13	69.21
36	3	33.81
43	6	44.43
59	11	62.13
90	21	97.53
20	1	26.73
83	16	79.83

When interpreting the regression coefficients:

The estimated slope β = 3.54 implies that each additional year of service results in an increase in salary of $3,450.

The regression line should not be used to predict the response ŷ when x lies outside the range of the initial values.

Example

DIAGRAM = Classification.C.1.e

Coefficient of Determination

The coefficient of determination represents the proportion of the total variability that is explained by the model.

The coefficient of determination is represented by

where the numerator is the measure of the total variability of the fitted values and the denominator is the measure of the total variability of the original values.

A value close to 1 implies that most of the variability is explained by the model.

A value close to 0 implies that the model is not appropriate.

Naïve Bayes

The Naïve Bayes classifier is a well-known and highly effective classifier based upon Bayes’ Rule, a technique used to estimate the likelihood of class membership of an unseen instance given the set of labeled instances.

The prior (or unconditional) probability, P(a), associated with a proposition a (i.e., an assertion that a is true) is the degree of belief accorded to it in the absence of any other information.

Example – Prior probability

P(rain = true) = 0.25 or P(rain) = 0.25

The posterior (or conditional) probability, P(a | b), associated with a proposition a is the degree of belief accorded to it given that all we know is b.

Example – Posterior probability

P(rain | thunder) = 0.8

A prior probability, such as P(rain), can be thought of as a special case of the posterior probability P(rain | ), where the probability is conditioned on no evidence.

Posterior probabilities can be defined in terms of prior probabilities. Specifically,

for P(b) > 0, which can also be written as

In a nutshell: For a and b to be true, we need b to be true, and we need a to be true given b.

Since conjunction is commutative , so

can be written equivalently as

Then, since , we have Bayes’ Rule

which can be written as

A Naïve Bayes classifier applies to classification tasks where each instance x is described by a conjunction of attribute values (i.e., a tuple ) and where the target class can take on any value from some finite set C (i.e., the set of possible class values).

A set of labeled instances is provided from which the prior and posterior probabilities can be derived.

Predicting with Naïve Bayes

When a new instance is presented, the classifier is asked to predict the class label.

The Bayesian approach considers a set of candidate hypotheses (i.e., the various possible class labels) and determines the hypothesis (i.e., the class label) that is most probable given the labeled instances (known as the maximum posteriori hypothesis (MAP)).

Given a new instance with attribute values , the most probable class label is given by

Using Bayes’ Rule, the above expression can be written as

That is, the denominator can be dropped because it is a constant term independent of C_j.

Since a Naïve Bayes classifier assumes the effect of an attribute value on a given class is independent of the values of the other attributes (called the class conditional independence assumption), given C_MAP, the probability of observing the conjunction is just the product of the probabilities of the individual attributes. That is,

Substituting for in the equation for C_MAP yields

where C_NB denotes the assigned class label output by the Naïve Bayes classifier.

Naïve Bayes Classifier for Categorical Attributes

Algorithm: Naïve Bayes Learner

Input: D = a set of labeled instances of the form ,

where each a_i corresponds to a value from the domain of

attributes A₁, A₂, ..., A_n, respectively, and an is the

assigned class label

Output: classProbability = an array of prior probabilities

attributeProbability = an array of posterior probabilities

v = an array of the number of unique values in the domain of each attribute

Method:

1. totalCount = 0

2. m = the number of unique classes in the domain of A_n

3. n = the number of attributes in the instances of D

4. for j = 1 to m

5. classCount [j] = 0

6. for i = 1 to n - 1

7. v [i] = the number of unique values in the domain of A_i

8. for k = 1 to v [i]

9. attributeCount [j, i, k] = 0

10. for each instance of D

11. totalCount ++

12. j = an integer corresponding to the class of the current instance

13. classCount [j] ++

14. for i = 1 to n - 1

15. k = an integer corresponding to the value of the current attribute

16. attributeCount [j, i, k] ++

17. for j = 1 to m

18. classProbability [j] = classCount [j] / totalCount

19. for i = 1 to n

20. for k = 1 to v [i]

21. attributeProbability [j, i, k] = attributeCount [j, i, k] / classCount [j]

Algorithm: NaiveBayesClassifier

Input: classProbability = an array of prior probabilities

attributeProbability = an array of posterior probabilities

m = the number of unique classes in the domain of A_n

n = the number of attributes in the instances of D

v = an array of the number of unique values in the domain of each attribute

= an unlabeled instance

Output: C_NB = the class label

Method:

1. C_NB = 0

2. for j = 1 to m

3. C_Temp = classProbability [j]

4. for i = 1 to n - 1

5. for k = 1 to v [i]

6. if a_i == the attribute value corresponding to v [i]

7. C_Temp = C_Temp * attributeProbability [j, i, k]

8. break

9. if C_Temp > C_NB

10. C_NB = C_Temp

Example – Predicting a class label using a Naïve Bayes classifier

Tuple	Age	Income	Student	Credit Rating	Buys Computer
t₁	<=30	high	no	fair	no
t₂	<=30	high	no	excellent	no
t₃	31..40	high	no	fair	yes
t₄	>40	medium	no	fair	yes
t₅	>40	low	yes	fair	yes
t₆	>40	low	yes	excellent	no
t₇	31..40	low	yes	excellent	yes
t₈	<=30	m	no	fair	no
t₉	<=30	low	yes	fair	yes
t₁₀	>40	medium	yes	fair	yes
t₁₁	<=30	medium	yes	excellent	yes
t₁₂	31..40	medium	no	excellent	yes
t₁₃	31..40	high	yes	fair	yes
t₁₄	>40	medium	no	excellent	no

The class label attribute is Buys Computer and it has two unique values: yes and no. The unlabeled instance to be classified is

áAge = “<=30”, Income = medium, Student = yes, Credit Rating = fairñ.

Let a₁ = “<=30”, a₂ = medium, a₃ = yes, and a₄ = fair. So, the problem is to determine for all j. Now,

P(C₁) = P(Buys Computer = yes) = 9/14

and

P(C₂) = P(Buys Computer = no) = 5/14.

To determine C_NB, we only need to concern ourselves with the conditional probabilities associated with the attribute values on the unlabeled instance. So,

= P(C₁) P(a₁|C₁) P(a₂|C₁) P(a₃|C₁) P(a₄|C₁)

= (9/14)(2/9)(4/9)(6/9)(6/9)

= (0.643)(0.222)(0.444)(0.667)(0.667)

= 0.028

and

= P(C₂) P(a₁|C₂) P(a₂|C₂) P(a₃|C₂) P(a₄|C₂)

= (5/14)(3/5)(2/5)(1/5)(2/5)

= (0.357)(0.6)(0.4)(0.2)(0.4)

= 0.007

We need to maximize . Therefore, C_NB = C₁ = (Buys Computer = yes).