# Perceptron Learning

**Perceptron Learning**: Learning a perceptron means finding the right values for W. The hypothesis space of a perceptron is the space of all weight vectors. The perceptron learning algorithm can be stated as below.

- Assign random values to the weight vector
- Apply the weight update rule to every training example
- Are all training examples correctly classified?

a. Yes. Quit

b. No. Go back to Step 2.

**There are two popular weight update rules.**

i) The perceptron rule, and

ii) Delta rule

**The Perceptron Rule**

For a new training example X = (x1, x2, …, xn), update each weight according to this rule:

wi = wi Δwi

Where Δwi = η (t-o) xi

t: target output

o: output generated by the perceptron

η: constant called the learning rate (e.g., 0.1)

Comments about the perceptron training rule:

- If the example is correctly classified the term (t-o) equals zero, and no update on the weight is necessary.
- If the perceptron outputs –1 and the real answer is 1, the weight is increased.
- If the perceptron outputs a 1 and the real answer is -1, the weight is decreased.
- Provided the examples are linearly separable and a small value for η is used, the rule is proved to classify all training examples correctly (i.e, is consistent with the training data).

**The Delta Rule**

What happens if the examples are not linearly separable?

To address this situation we try to approximate the real concept using the delta rule.

The key idea is to use a gradient descent search. We will try to minimize the following error:

**E = ½ Σi (ti – oi) 2**

where the sum goes over all training examples. Here oi is the inner product WX and not sgn(WX) as with the perceptron rule. The idea is to find a minimum in the space of weights and the error function E.

**The delta rule is as follows:**

For a new training example X = (x1, x2, …, xn), update each weight according to this rule:

wi = wi Δwi

Where Δwi = -η E’(W)/wi

η: learning rate (e.g., 0.1)

It is easy to see that

E’(W)/ wi = Σi (ti – oi) (-xi)

**So that gives us the following equation:
wi = η Σi (ti – oi) xi**

There are two differences between the perceptron and the delta rule. The perceptron is based on an output from a step function, whereas the delta rule uses the linear combination of inputs directly. The perceptron is guaranteed to converge to a consistent hypothesis assuming the data is linearly separable. The delta rules converges in the limit but it does not need the condition of linearly separable data.

**There are two main difficulties with the gradient descent method:**

- Convergence to a minimum may take a long time.
- There is no guarantee we will find the global minimum.

These are handled by using momentum terms and random perturbations to the weight vectors.