Machine learning is a process which causes systems to improve with experience.

Elements of a Learning Task


  1. Items of Experience
    • i epsilon I
  2. Space of Available Actions
    • a epsilon A
  3. Evaluation
    • v (a, i)
  4. Base Performance System
    • b: I mapsto A
  5. Learning System
    • L: (i1, a1, v1)...(in, an, vn) mapsto b
    • Maps training experience sequence to base performance system.

Types of Learning Problems

Fundamental Distinctions

  1. Batch versus Online Learning:
    • Batch Learning
      • Training phase (no responsibility for performance during learning), then testing.
      • Synthesizing a base performance system.
      • Pure exploration.
    • Online Learning
      • No separate phase.
      • Learning while doing.
      • Have to decide between choosing actions to perform well now versus gaining knowledge to perform well later.
  2. Learning from Complete versus Partial Feedback:
    • Complete Feedback
      • Each item of experience evaluates every action (or base system).
      • Tells what best action would have been.
    • Partial Feedback
      • Each item of experience evaluates only a subset of possible actions (or base systems).
      • Only tells you how you did.
      • Creates exploration/exploitation tradeoff in online setting.
        • Should the reward be optimized with what is already known or should learning be attempted?
        • In batch learning, exploration would be chosen since there is no responsibility for performance.
    • Pointwise Feedback
      • Evaluates single action.
        • Optimization problem: (Controller, State) mapsto Action.
  3. Passive versus Active Learning
    • Passive: observation.
    • Active: experimentation, exploitation.
  4. Learning in Acausal versus Causal Situations
    • Acausal: actions do not affect future experience.
      • e.g., rain prediction
    • Causal: actions do effect future experience.
  5. Learning in Stationary versus Nonstationary Environments
    • Stationary: evaluation doesn't change.
    • Nonstationary: evaluation changes with time.

Concept Learning

The problem of inducing general functions from specific training examples is central to learning.

Concept learning acquires the definition of a general category given a sample of positive and negative training examples of the category, the method of which is the problem of searching through a hypothesis space for a hypothesis that best fits a given set of training examples.

A hypothesis space, in turn, is a predefined space of potential hypotheses, often implicitly defined by the hypothesis representation.

Learning a Function from Examples



Idea: to extrapolate observed y's over all X.

Hope: to predict well on future y's given x's.

Require: there must be regularities to be found!

(Note type: batch, complete, passive (we are not choosing which x), acausal, stationary).

Many Research Communities

Traditional Statistics

Traditional Pattern Recognition

"Symbolic" Machine Learning

Neural Networks

Inductive Logic Programming

Learning Theory

Standard Learning Algorithm

Given a batch set of training data S = {(x1, y1)...(xt, yt)}, consider a fixed hypothesis class H and compute a function h epsilon H that minimizes empirical error.


Example: Least-Squares Linear Regression

Here, the standard learning algorithm corresponds to least squares linear regression.

X = Rn, Y = R


H = {linear functions Rn mapsto R}

Choosing a Hypothesis Space

In practice, for a given problem X mapsto Y, which hypothesis space H do we choose?


Question: since we know the true function f: X mapsto Y, should we make H as "expressive" as possible and let the training data do the rest?

Answer: no!
Reason: overfitting!

Basic Overfitting Phenomenon

Overfitting is bad.

Example: polynomial regression

Suppose we use a hypothesis space, with many classes of functions


Given n = 10 training points (x1, y1)...(x10, y10).



Since n = 10, with 10 pieces of data, then in

Hn - 1 = H9

there is a function that gives an exact fit.

Which hypothesis would predict well on future examples?

When does empirical error minimization overfit?

Depends on the relationship between