Machine Learning in Action (5) —— logistic regression

 

1.Outline about regression problem:

We have a bunch of data, and with the data we try to build an equation to do classification for us. The equation is the function of some parameters and features about the dataset. When we are going to classify a new data, we have the values for features in the equation, so if we also know the values of parameters, we can calculate the equation and the result can directly tell us the class label of this new data.

And the name regression means that we try to find a best-fit of parameters which classify most accurately.

How to measure the accuracy? A natural way is to calculate the error between the predictive value and the true value, and if we can minimize the error, it’s possible to get the best-fit parameter. Function of error is called cost function, so our goal is to find the parameters that minimize the cost function.

 

How to calculate the predictive value? We should make a hypothesis about the equation used to classify. For example, we can assume that the equation is:

in this equation, x1 and x2 are features, θ0, θ1, θ2 are parameters.

 

How to measure the error or how to define the cost function? One way is the least-squares cost function, like this: . And there are many other cost functions.

When we have the cost function, we should choose an optimization algorithm to minimize the cost function and get the best-fit parameters.

 

 

2.Logistic regression

(1)Our hypothesis:

In two-class case, we should find a function that can spit out 0 or a 1, that is a Heaviside step function. In the plot of this function, there is point where the value of this function steps from 0 to 1. Thus the key is to find that point.

Our hypothesis is:

 

   θT X represents

function g(z) is called sigmoid, and the plot of g(z) are given in the following figures:

We can see that on a large enough scale, the sigmoid looks like a step function. And anything above 0.5 we’ll classify as a 1, and anything below 0.5 we’ll classify as a 0. Since the value of this function is between 0 and 1, it can also be seen as a probability estimate.

 

(2)Optimization algorithm ——Gradient descent algorithm

Some points:

# We write the gradient with the symbol and the gradient of a function f(x,y) is given by the equation:

# at every point, we choose a direction of greatest increase, and the gradient operation at each point will point in the direction of the greatest increase

 

# gradient operation decides the direction, we also need a step wise along that direction, the step wise is called the learning rate:

 

# In vector notation, learning rule of this algorithm is: ( θ is the vector of parameters)

In this rule, α is the learning rate, J(θ) is our cost function, we choose the least-mean square function as the cost function :

  ,

hθ(x(i)) is our hypothesis and y(i) is the true value.

Thus we can calculate the

For a single training example, this gives the update rule:

(LMS update rule or Widrow-Hoff learning rule)

And for the whole training dataset, the update rule for each parameter is:

 

# we keep learning with the learning rule until we reach a stopping condition :

Either a specified number of steps or the algorithm is within a certain tolerance margin (or convergence).

 

Batch gradient descent:


looks at every example in the entire training set on every step.

 

Stochastic gradient descent(incremental gradient descent)

We update every parameter according to the gradient of the error with respect to that single training example only.

 

Notice

First, according the learning rule above, we should give all the parameters initial values so as to start the algorithm.

Second, looking at the hypothesis we make, we have two features X1 and X2 but we have three parameters θ0, θ1, θ2 , so as to represent the hypothesis in vector notation, we should add a column {1} for vector X. Now vector X is [1, x1, x2], parameters θ is [θ0, θ1, θ2], thus the θT*X make sense.

 

 

3. Using logistic regression to find the best parameters for a dataset.

(1) Load dataset:

Notice that in this code segment we add a column X0 and set its value to 1 for all the training examples.

(2) Make hypothesis:

This function is used to calculate the predictive value.

(3)Use batch gradient ascent algorithm to find the best-fit parameter:

Notice that gradient ascent is the same thing as gradient descent except that the minus sign is changed to a plus sign:

According to the update rule we can update the parameters through the matrix operation.

This function returns the best-fit parameters.

Annotation about functions in Python:

# numpy.mat(list) can form a matrix through a list

# numpy.ones((m,n)) return an array

An array can add a matrix as long as they have the same dimension, and the type of the result is matrix

 

(4)Use stochastic gradient ascent algorithm to find the best-fit parameter:

Reason:

The batch gradient ascent will use the whole dataset on each update, if the dataset is very large, it’s unnecessarily expensive in terms of computational resources. An alternative is to update the weights using only one instance at a time. This is known as stochastic gradient ascent algorithm which is an example of an online learning algorithm. Because we can incrementally update the classifier as new data comes in rather than all at once.

Implementation:

Notice that stochastic gradient ascent is similar to gradient ascent except that the variables hypothesis and errorVlaue are now single values rather than vectors.

Also notice that we don’t need any matrix operations such as matrix conversion, so all of the variables are NumPy array.

 

(5)Use modified stochastic gradient ascent algorithm to find the best-fit parameters:

Reason:

If the size of the dataset is very small, that means fewer updates for the parameter vector which may not reach a steady value when the algorithm terminates.

Three aspects of improvement :

First, increase the iteration time of the stochastic gradient ascend algorithm.

Second, during each iteration, change the learning rate α according some rules following such principles that as the number of iterations increase, the misclassified data has less impact on the parameters.

Third, during each iteration, we randomly choose the training example at a time ignoring the initial order. But we should promise that during each iteration every training example will be used to update the parameters

Implementation:

Two points need to be paid attention to:

First, in each iteration, alpha = 4/(1 + i + j) + 0.01 , j is the index of the number of times we go through the dataset and i is the index of the example in the training set. This give an alpha that isn’t strictly decreasing when j << max(i)  —— avoidance of a strictly decreasing. Besides, this will improve the oscillations. Alpha decreases as the number of iterations increases, but it never reaches 0 because there’s a constant term in. By doing this, after a large number of cycles, new data still have some impact. If we’re dealing with something that’s changing with time, we should keep the constant term be larger to give more weight to new values.

Second, an optional argument to the function has been added, its default value is 150, which means the 150 iterations will be done.

 

4.Visualize the training dataset and the decision boundary

Tasks:

  1. Plot the scatter of the training data point
  2. Plot a line from the best-fit parameters as the decision boundary

 

Result ( using the batch gradient ascent algorithm to get the weights vector):

 

Result ( using the stochastic gradient ascent algorithm to get the weights vector):

 

Result ( using the modified stochastic gradient ascent algorithm to get the weights vector):

 

Annotation about functions in Python

(1)Difference between scatter() and plot()

When having some points, if we want to see the distribution of each point, we use function scatter(), if we use function plot() we’ll get a broken-line.

(2)Difference between range() and arrange()

 

 

Example:

So here we use function numpy.arange() rather than the built-in function range()

 

5.Example : estimating horse fatalities from colic

Description :

We have dataset(training set and test set) contains measurements from horses including some features and survivals state  seen by a hospital for colic, but there exists missing value in the dataset.

We’ll use logistic regression to try to predict if a horse with colic will live or die.

 

Main steps:

# parse a text file in Python, and fill in missing values

# use the training set and the modified stochastic gradient ascent algorithm to get the weights vector。

# use the test set to test the algorithm and calculate the error rate

 

(1) parse a text file in Python, and fill in missing values

some options to handle the missing values:

■ Use the feature’s mean value from all the available data.
■ Fill in the unknown with a special value like -1.
■ Ignore the instance.
■ Use a mean value from similar items.
■ Use another machine learning algorithm to predict the value.

In our logistic algorithm:

for those missing values of feature, we replace them with special number 0, we can’t throw them out because in NumPy arrays can’t contain a missing value. And the reason why we select number 0 is that it’ll not have any impact on the weight for the corresponding feature.

For those missing values of label, we just throw it out.

 

(2)We use the modified stochastic gradient ascent algorithm to get the weights vector.

(3)Test the algorithm:

 

 

To get more accurate error rate, we run the function testColic() for several times and take the average. Each time we’ll get slightly different results because of the random components in function modifiedStoAsc().

 

Finally, I get the average error rate: 0.308955223880597

 

6.Summary

Logistic regression is finding best-fit parameters to a nonlinear function called the sigmoid. Methods of optimization can be used to find the best-fit parameters. Among the optimization algorithms, one of the most common algorithms is gradient ascent. Gradient ascent can be simplified with stochastic gradient ascent. Stochastic gradient ascent can do as well as gradient ascent using far fewer computing resources. In addition, stochastic gradient ascent is an online algorithm; it can update what it has learned as new data comes in rather than reloading all of the data as in batch processing. One major problem in machine learning is how to deal with missing values in the data. There’s no blanket answer to this question. It really depends on what you’re doing with the data. There are a number of solutions, and each solution has its own advantages and disadvantages.

猜你喜欢

转载自blog.csdn.net/qq_39464562/article/details/81078437