Neural Networks and Deep Learning 第二周 Logistic Regression

1. 二分分类 Binary Classification

1.1 Logistic Regression


There is a cat picture. You want to know whether it is a cat picture, if yes, labeled () = 1;otherwise  = 0.

So if your input image is 64 pixels by 64 pixels,
 
then you would have 3 64 by 64 matrices
 
corresponding to the red, green and blue pixel intensity values for your images. So to turn these pixel intensity values- Into a feature vector, what we're going to do is unroll all of these pixel values into an input feature vector x.  And so we're going to use nx=12288 to represent the dimension of the input features x. So, the dimension of X depends on its feature numbers.

A single training examples


An entire training set consists of :

1:a set of .

: The number of samples. : the number of train samples; : the number of test samples.


y belongs to a 1*m space. (这个和我学的不同,一般不是m*n的矩阵吗?)


So, let's simply discuss the logistic regression below:


When given a x^(m) (X in a R^(nx) dimensions) (the input data), we need a y^(m) to label X^(m) like {(X, y)}.

But if you want your y is a possibility label which shows the possibility of result is one. For example, you want to know the possibility of to be sunny (not rainy) tomorrow. How to do that?  

(首先,y标记的是1的概率,不是0的概率;其次,我想输出的是一个概率,而不是数,概率的范围是(0,1),数的范围说不清。那么,如果将无穷范围的数缩小成(0,1)的概率呢?)

We know that the range of w.T.dotX is +∞ to -∞. The problem is transfer to how to narrow down the results in a (0,1) region. Notice the sigmoid function from above picture. The sigmoid function  sig = 1/1+e^(-z) has the ability. Imagining the z is w.T.dotX, when z close to +∞,(larger positive number) e^(-z) -> 0, sig -> 1 comparing that sig -> 0 when z close to -∞(small or larger negative number); when w.t.dotX is 0, sig = 0.5. So, we can set that when p(sig) > 0.5, y =1; p(sig) < 0.5, y = 0.


Cost function
Loss function definition:

 One thing you could do is define the loss when your algorithm outputs y-hat and the true label as Y to be. (也就是看自己预测的值和真实值之间的差距,那肯定是 as small as possible).


Some important info above:

  1. Given you a set of , you want to your prediction y-hat close to y. (这样才算预测准备)
  2. Finding a Loss functionj which measures the diference between y-hat and real y.  
  3. When  , y-hat should be predicted as 1; So,  should be as small as possible.  should be as large as possible, should as large as possible.  should close to 1.
  4. Loss function is dealing with single neural network; Cost finction to large samples.


Gradient Descent



  1. When w,b change, the cost function will change. So, the queation transfer to find a good w,b to let cost function as small as possible.
  2. α is the learning rate, and controls how big a step we take on each iteration or gradient descent.
  3. dw can be represented the deviation of dJ(w)/dw in python.

So, w and b gradient descent can be this:


其实就是分别对w和b求偏导,然后一步一步获得减去偏导。



Computation Graph (计算图)

When you want to compute a function J(a,b,c) = 3(a+bc)时,可以预设,。那么,就可以做出一个计算图出来,蓝色的线条表示 forward propagation 前向传播红色的线表示backward propagation反向传播那么,就能通过反向传播求出每个input和hidden layer相对于的导数。为什么要求这个导数呢?求导数就是求当因子改变时,对函数J的影响多少。下面,通过一个计算的例子来介绍这个计算方法。


好了,先从右向左求。当改变1时,改变3,那么=3;当改变1时,改变3,那么也是3;因为,所以。这就是一个链式法则。那么,求。这样,从右向左乘每个因子的导数就能求得input和函数的改变的关联情况了。

那么,在python命名中,因为所有的导数的分子都是,省去它,,其余以此类推即可。

用Logistic Regression做例子来看。


So, first, we list all the formulas needed to caculate loss function of logistic regression. Then, we can get the computation graph. w1,w2 and b are the parameters we needed to refine which depends the model quality.


So, in order to caculate the deriate of w1,w2 and b of L(a,y): , and , we can use backward propagation.

  1. caculate =.
  2. caculate  is the derivate of sigmoid function. So, dz = 
  3. caculate  and other.

So, we can get how to caculate or optimize the parameters use coumputation graph. (导数复合求导,讲的很不错,能够理解公式)

Gradient Descent in m samples?

When total samples = m = [1,2,3,4...m], we use  represent the sample represent the deriate of sample . For total sample , they should be the average of every sample (对每个sample的parameter参数的导数求平均)

python计算思路:


有2个缺点:

写2个for loop;for loop会降低性能;怎么办呢? Vectorization。


Broadcasting in Python

import numpy as np
A = np.array([[56.0,0.0,4.4,68.0],
             [1.2,104.0,52.0,8.0],
             [1.8,135.0,99.0,0.9]])


print(A)

[[ 56.    0.    4.4  68. ]
 [  1.2 104.   52.    8. ]
 [  1.8 135.   99.    0.9]]

cal = A.sum(axis=0)  #sum vertically 纵向加

#这是个横向的 因为是一维数组,一维数组默认横向 维度是(1,n)
cal
array([ 59. , 239. , 155.4,  76.9])

#这是个横向的 二维数组,一个子数组代表一个row
cal = cal.reshape(1,4)
cal
array([[ 59. , 239. , 155.4,  76.9]])

#如果是二维数组中的纵向 
cal_verti = cal.reshape(4,1)
cal_verti
array([[ 59. ],
       [239. ],
       [155.4],
       [ 76.9]])

percentage = 100 * A/cal
percentage

array([[94.91525424,  0.        ,  2.83140283, 88.42652796],
       [ 2.03389831, 43.51464435, 33.46203346, 10.40312094],
       [ 3.05084746, 56.48535565, 63.70656371,  1.17035111]])

#如果加上常数
B = np.array([1,2,3,4])
B + 100

array([101, 102, 103, 104])

c = np.array([[1,2,3],[4,5,6]])
c1 = np.array([100,200,300])
c2 = np.array([[100],[200]])
print(c + c1)
print(c+c2)

[[101 202 303]
 [104 205 306]]
[[101 102 103]
 [204 205 206]]

#也就是numpy的加减乘数能自动给你转成匹配的矩阵,当两个矩阵的维度不同时。

a = np.random.randn(5)
print(a)
#这是个(5,)的vector,rank1没有纵坐标 it's neither a row vector nor a column vector.
a.shape
(5,)

print(a.T)  #一样的
[ 0.61129989 -0.48008827  1.39754925 -0.90183129  0.13849732]

#所以最好不要用这种,可以把column加上去,这样就是二维矩阵
b = np.random.randn(5,1)
print(b)

[[ 1.46909732]
 [-0.70696235]
 [ 0.50947828]
 [-0.48711335]
 [-1.61225188]]

#这样就可以做转置了,因为是二维数组
print(b.T)

[[ 1.46909732 -0.70696235  0.50947828 -0.48711335 -1.61225188]]

#所以不要用rank1 vector。
assert(a.shape==(5,1))






猜你喜欢

转载自blog.csdn.net/gaoyishu91/article/details/80358561