**Reference**:

- D. Schuurmans, Machine Learning course notes, University of Waterloo, 1999.
- T. Mitchell, Machine Learning, McGraw-Hill, 1997, pp. 20-50.

**Machine learning** is a process which causes systems to improve with experience.

__Elements of a Learning Task__

Representation

- Items of Experience
- i I

- Space of Available Actions
- a A

- Evaluation
- v (a, i)

- Base Performance System
- b: I A

- Learning System
- L: (i
_{1}, a_{1}, v_{1})...(i_{n}, a_{n}, v_{n}) b - Maps training experience sequence to base performance system.

- L: (i

__Types of Learning Problems__

Fundamental Distinctions

- Between problems rather than methods.
- Influences choice of method.

- Batch versus Online Learning:
**Batch Learning**- Training phase (no responsibility for performance during learning), then testing.
- Synthesizing a base performance system.
- Pure exploration.

**Online Learning**- No separate phase.
- Learning while doing.
- Have to decide between choosing actions to perform well
__now__versus gaining knowledge to perform well later.

- Learning from Complete versus Partial Feedback:
**Complete Feedback**- Each item of experience evaluates
__every__action (or base system). - Tells what best action would have been.

- Each item of experience evaluates
**Partial Feedback**- Each item of experience evaluates only a
__subset__of possible actions (or base systems). - Only tells you how you did.
- Creates exploration/exploitation tradeoff in online setting.
- Should the reward be optimized with what is already known or should learning be attempted?
- In batch learning, exploration would be chosen since there is no responsibility for performance.

- Each item of experience evaluates only a
**Pointwise Feedback**- Evaluates single action.
- Optimization problem: (Controller, State) Action.

- Evaluates single action.

- Passive versus Active Learning
**Passive:**observation.**Active:**experimentation, exploitation.

- Learning in Acausal versus Causal Situations
**Acausal:**actions__do not__affect future experience.- e.g., rain prediction

**Causal:**actions__do__effect future experience.

- Learning in Stationary versus Nonstationary Environments
**Stationary:**evaluation doesn't change.**Nonstationary:**evaluation changes with time.

__Concept Learning__

The problem of inducing general functions from specific training examples is central to learning.

**Concept learning** acquires the definition of a general category given a sample of positive and negative training examples of the category, the method of which is the problem of searching through a hypothesis space for a hypothesis that best fits a given set of training examples.

A **hypothesis space**, in turn, is a predefined space of potential hypotheses, often implicitly defined by the hypothesis representation.

__Learning a Function from Examples__

- An example of concept learning where the concepts are mathematical functions.

Given:

- Domain X: descriptions
- Domain Y: predictions
- H: hypothesis space; the set of all possible hypotheses
- h: target hypothesis

__Idea:__ to extrapolate observed y's over all X.

__Hope:__ to predict well on future y's given x's.

__Require:__ there must be regularities to be found!

(Note type: batch, complete, passive (we are not choosing which x), acausal, stationary).

__Many Research Communities__

- Representation choice differs.
- Prefer maximum level of abstraction.

Traditional Statistics

- h: R
^{n}R - Squared prediction error.
- h is a linear function.

Traditional Pattern Recognition

- h: R {0, 1}
- Right/wrong prediction error.
- h is a linear discriminant boundary.

"Symbolic" Machine Learning

- h: {attribute-value vectors} {0, 1}
- h is a simple boolean function (eg. a decision tree).

Neural Networks

- h: R
^{n}R - h is a feedforward neural net.

Inductive Logic Programming

- h: {term structure} {0, 1}
- h is a simple logic program.

Learning Theory

- Postulate mathematical model of learning situations.
- Rigorously establish achievable levels of learning performance.
- Characterize/quantify value of prior knowledge and constraints.
- Identify limits/possibilities of learning algorithms.

__Standard Learning Algorithm__

Given a batch set of training data S = {(x_{1}, y_{1})...(x_{t}, y_{t})}, consider a fixed hypothesis class H and compute a function h H that minimizes empirical error.

__Example:__ Least-Squares Linear Regression

Here, the standard learning algorithm corresponds to least squares linear regression.

X = R^{n}, Y = R

H = {linear functions R^{n} R}

__Choosing a Hypothesis Space__

In practice, for a given problem X Y, which hypothesis space H do we choose?

Question: since we know the true function f: X Y, should we make H as "expressive" as possible and let the training data do the rest?

- e.g., quadratic
- e.g., high degree polynomial
- e.g., complex neural network.

Answer: no!

Reason: overfitting!

**Underfit**- Unable to express a good prediction.

**Overfit**- Hard to identify good from bad candidates.

__Basic Overfitting Phenomenon__

Overfitting is __bad__.

__Example:__ polynomial regression

Suppose we use a hypothesis space, with many classes of functions

Given n = 10 training points (x_{1}, y_{1})...(x_{10}, y_{10}).

Since n = 10, with 10 pieces of data, then in

_{n - 1}= H

_{9}

there is a function that gives an exact fit.

Which hypothesis would predict well on future examples?

- Although hypothesis h
_{9}fits these 10 data exactly, it is unlikely to fit the next piece of data well.

When does empirical error minimization overfit?

Depends on the relationship between

- hypothesis class, H
- training sample size, t
- data generating process