Chapter 1 Deep Learning Specialization
1.1 Neural Networks and Deep Learning
Logistic regression
Single layer neural network
Activation functions
Gradient descent
Links
https://dennybritz.com/posts/wildml/implementing-a-neural-network-from-scratch/
https://community.deeplearning.ai/t/dls-course-1-lecture-notes/11862
1.1.1 Logistic regression
Traditional formulation using logit link
Link function
\[\log\left(\frac{a_i}{1-a_i}\right) = w_1x_{i1} + w_2x_{i2} + ... + b = z_i\] where, \(a_i = p(y_i = 1\;|\;\mathbf{x}_i)\)
Inverse link function
\[a_i = \sigma(z_i) = \frac{\exp(z_i)}{1+\exp(z_i)} = \frac{1}{1+\exp(-z_i)}\]
Likelihood
\[L(\mathbf{w}, b\;|\;\mathbf{X}) = \prod_ip(y_i)\] \[= \prod_ip(y_i = 1\;|\;\mathbf{x}_i)^{y_i}p(y_i = 0\;|\;\mathbf{x}_i)^{(1-y_i)}\] \[= \prod_ia_i^{y_i}(1-a_i)^{(1-y_i)}\]
Negative log-likelihood
\[-\ell(\mathbf{w}, b\;|\;\mathbf{X}) = -\log\left[\prod_ia_i^{y_i} (1-a_i)^{(1-y_i)}\right]\] \[= -\sum_i\log\left[a_i^{y_i} (1-a_i)^{(1-y_i)}\right]\] \[= -\sum_i\log\left(a_i^{y_i}\right)+\log\left((1-a_i)^{(1-y_i)}\right)\] \[= -\sum_iy_i\log\left(a_i\right)+(1-y_i)\log\left(1-a_i\right)\]
Derivatives
Equations
\(\mathcal{L} = -y\log\left(a\right)-(1-y)\log\left(1-a\right)\)
\(a = \sigma(z)\)
\(z = w_1x_1 + w_2x_2 + ... + b\)
Derivatives (using sigmoid inverse link)
\(\frac{d\mathcal{L}}{dw_1} = \frac{dz}{dw_1} \cdot \frac{da}{dz} \cdot \frac{d\mathcal{L}}{da} = \frac{dz}{dw_1} \cdot \frac{d\mathcal{L}}{dz} = x_1 \cdot \frac{d\mathcal{L}}{dz} = x_1 (a-y)\)
\(\frac{d\mathcal{L}}{dw_2} = \frac{dz}{dw_2} \cdot \frac{da}{dz} \cdot \frac{d\mathcal{L}}{da} = \frac{dz}{dw_2} \cdot \frac{d\mathcal{L}}{dz} = x_2 \cdot \frac{d\mathcal{L}}{dz} = x_2 (a-y)\)
\(\frac{d\mathcal{L}}{db} = \frac{dz}{db} \cdot \frac{da}{dz} \cdot \frac{d\mathcal{L}}{da} = \frac{dz}{db} \cdot \frac{d\mathcal{L}}{dz} = 1 \cdot \frac{d\mathcal{L}}{dz} = (a-y)\)
where,
\(\frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{da} \cdot \frac{da}{dz} = \left[-\frac{y}{a}+\frac{1-y}{1-a}\right] \cdot \left[-a(1-a)\right] = a-y\)
and where,
\(\frac{d\mathcal{L}}{da} = -\frac{y}{a}+\frac{1-y}{1-a}\)
\(\frac{da}{dz} = -a(1-a)\)
Gradient descent
\(\frac{\partial \mathcal{J}}{\partial w_1} = \frac{1}{m} \sum_i \frac{\partial \mathcal{L}^{(i)}}{\partial w_1}\)
\(\frac{\partial \mathcal{J}}{\partial w_2} = \frac{1}{m} \sum_i \frac{\partial \mathcal{L}^{(i)}}{\partial w_2}\)
\(\frac{\partial \mathcal{J}}{\partial b} = \frac{1}{m} \sum_i \frac{\partial \mathcal{L}^{(i)}}{\partial b}\)
1.1.2 Single layer neural network
A single hidden layer neural network is like logistic regression, but repeated a lot of times.
Notation
\(m\) : number of training exanmples
\(n^{[k]}\) : number of nodes in layer \(k\) (where \(k=0\) indiwcates the input layer)
\(\mathbf{W}^{[k]}\) : a (\(n^{[k]}\), \(n^{[1-k]}\)) matrix of weights for layer \(k\)
\(\mathbf{w}_j^{[k]}\) : a (1, \(n^{[1-k]}\)) matrix of weights for layer \(k\) node \(j\)
\(\mathbf{b}^{[k]}\) : a (…, …) matrix ….
\(\mathbf{A}^{[k]}\) : a (\(n^{[k]}\), \(m\)) matrix of activations for layer \(k\) (where \(\mathbf{A}^{[0]}\) = \(\mathbf{X}\))
\(\mathbf{x}^{(i)}\)= vector of input feature values for the \(i\)th training example
\(x_p^{(i)}\)= value of input feature \(p\) for the \(i\)th training example \(i\)
\(w_{j, p}^{[k]}\)= value of weight for feature \(p\) for layer \(k\) node \(j\)
\(\mathbf{a}^{[k]}\)= activations for layer \(k\)
\(\mathbf{a}_{j}^{[k]}\)= activations for layer \(k\) node \(j\)
\(a_{j}^{[k](i)}\)= activation for the \(i\)th training example for node \(j\) layer \(k\)
\(n^{[k]}\)= number of nodes in layer \(k\)
Formulation
Input layer
E.g. three nodes - one node per input feature (\(n^{[0]} = 3\)).
\[ \mathbf{a}^{[0]} = \mathbf{X} \]
equivalently,
\[ \begin{bmatrix} \mathbf{a}_1^{[0]} \\ \mathbf{a}_2^{[0]} \\ \mathbf{a}_3^{[0]} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1 \\ \mathbf{x}_2 \\ \mathbf{x}_3 \end{bmatrix} \]
\[ \begin{bmatrix} a^{[0](1)}_1 \ ... \ a^{[0](m)}_1 \\ a^{[0](1)}_2 \ ... \ a^{[0](m)}_2 \\ a^{[0](1)}_3 \ ... \ a^{[0](m)}_3 \\ \end{bmatrix} = \begin{bmatrix} x^{(1)}_1 \ ... x^{(m)}_1 \\ x^{(1)}_2 \ ... x^{(m)}_2 \\ x^{(1)}_3 \ ... x^{(m)}_3 \\ \end{bmatrix} \]
Output layer
Single node (\(n^{[2]} = 1\)).
\[\mathbf{a}^{[2]} = \sigma(\mathbf{z}^{[2]}) = \sigma(\mathbf{W}^{[2]}\mathbf{a}^{[1]} + b^{[2]}) = \widehat{\mathbf{y}}\] equivalently,
\[ \begin{bmatrix} a_1^{[2](1)} \ ... \ a_1^{[2](m)} \\ \end{bmatrix} = \sigma\left( \begin{bmatrix} w_{1, 1}^{[2]} \ w_{1, 2}^{[2]} \ w_{1, 3}^{[2]} \ w_{1, 4}^{[2]} \\ \end{bmatrix} \begin{bmatrix} a_1^{[1](1)} \ ... \ a_1^{[1](m)} \\ a_2^{[1](1)} \ ... \ a_2^{[1](m)} \\ a_3^{[1](1)} \ ... \ a_3^{[1](m)} \\ a_4^{[1](1)} \ ... \ a_4^{[1](m)} \\ \end{bmatrix} + b_1^{[2]} \right) = \begin{bmatrix} \hat{y}^{(1)} \ ... \ \hat{y}^{(m)} \\ \end{bmatrix} \]
1.1.3 Activation functions
\(\tanh(z)\) generally preferred to \(\sigma(z)\) since it outputs (-1, 1) rather than (0, 1) - although \(\sigma(z)\) might still be preferred for the output layer if probabilities are required.
ReLU (rectified linear unit) preferred to both of the above since it doesn’t suffer from having regions of low gradient and can therefore lead to faster convergence.
1.1.4 Gradient descent
Parameters
\[ \begin{array}{l} \mathbf{W}^{[1]} & (n^{[1]}, \; n^{[0]}) \\ \mathbf{b}^{[1]} & (n^{[1]}, \; 1) \\ \mathbf{W}^{[2]} & (n^{[2]}, \; n^{[2]}) \\ \mathbf{b}^{[2]} & (n^{[2]}, \; 1) \\ \end{array} \]
Cost function
\[ J(\mathbf{W}^{[1]}, \mathbf{b}^{[1]}, \mathbf{W}^{[2]}, \mathbf{b}^{[2]}) = \frac{1}{m} \sum_{i = 1}^m{loss(\hat{y}_i, y_i)} \]
Gradient descent algorithm
compute \(\hat{y}_i\) for all \(i\)
compute partial derivatives
\[ \begin{array}{l} d\mathbf{W}^{[1]} & = \frac{\partial J}{\partial \mathbf{W}^{[1]}} \\ d\mathbf{b}^{[1]} & = \frac{\partial J}{\partial \mathbf{b}^{[1]}} \\ d\mathbf{W}^{[2]} & = \frac{\partial J}{\partial \mathbf{W}^{[2]}} \\ d\mathbf{b}^{[2]} & = \frac{\partial J}{\partial \mathbf{b}^{[2]}} \\ \end{array} \]
- update parameters
\[ \begin{array}{l} \mathbf{W}^{[1]} &: = \mathbf{W}^{[1]} - \alpha \; d\mathbf{W}^{[1]} \\ \mathbf{b}^{[1]} &: = \mathbf{b}^{[1]} - \alpha \; d\mathbf{b}^{[1]} \\ \mathbf{W}^{[2]} &: = \mathbf{W}^{[2]} - \alpha \; d\mathbf{W}^{[2]} \\ \mathbf{b}^{[2]} &: = \mathbf{b}^{[2]} - \alpha \; d\mathbf{b}^{[2]} \\ \end{array} \] * repeat until convergence
1.1.4.0.1 python implementaion
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
m = 1000
nx = 3
x = np.random.rand(nx, m)
y = np.random.randint(2, size=(m, 1))
Non-vectorized
Vectorized
w = np.zeros((4, nx))
b = np.zeros((4, 1))
z = np.dot(w, x) + b
a = sigmoid(z)
dz = a - y
dw = x * dz.T / m
db = np.sum(dz) / m