# CSE 455 Homework 5 #
Welcome, it's time for neural networks! Needless to say, run
```
git pull
make clean
make
```
Then `./main test hw5` and/or `python tryhw5.py`.
## 1. Implementing neural networks ##
### 1.1 Activation functions ###
An important part of machine learning, be it linear classifiers or neural networks, is the activation function you use.
#### TO DO ####
Fill in `void activate_matrix(matrix m, ACTIVATION a)` to modify `m` to be `f(m)` applied elementwise where the function `f` is given by what the activation `a` is.
Remember that for our softmax activation we will take e^x for every element x, but then we have to normalize each element by the sum as well. Each row in the matrix is a separate data point so we want to normalize over each data point separately.
### 1.2 Taking gradients ###
As we are calculating the partial derivatives for the weights of our model we have to remember to factor in the gradient from our activation function at every layer. We will have the derivative of the loss with respect to the output of a layer (after activation) and we need the derivative of the loss with respect to the input (before activation). To do that, we take each element in our delta (the partial derivative) and multiply it by the gradient of our activation function.
Normally, to calculate `g'(x)` we would need to remember what the input to the function, `x`, was. However, all of our activation functions have the nice property that we can calculate the gradient `g'(x)` only with the output `g(x)`. This means we have to remember less stuff when running our model.
The gradient of a linear activation is just 1 everywhere. The gradient of our softmax will also be 1 everywhere because we will only use the softmax as our output with a cross-entropy loss function, for an explanation of why see [here](https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) or [here](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/#softmax-and-cross-entropy-loss).
The gradient of our logistic function is discussed [here](https://en.wikipedia.org/wiki/Logistic_function#Derivative):
g'(x) = g(x) * (1 - g(x))
As discussed in the backpropagation slides, we just multiply our partial derivative by `g'(x)` to backpropagate through an activation function.
#### TO DO ####
Fill in `void gradient_matrix(matrix m, ACTIVATION a, matrix d)` to multiply the elements of `d` by the correct gradient information where `m` is the output of a layer that uses `a` as it's activation function.
### 1.3 Forward propagation ###
Now we can fill in how to forward-propagate information through a layer. First check out our layer struct:
typedef struct {
matrix in; // Saved input to a layer
matrix w; // Current weights for a layer
matrix dw; // Current weight updates
matrix v; // Past weight updates (for use with momentum)
matrix out; // Saved output from the layer
ACTIVATION activation; // Activation the layer uses
} layer;
During forward propagation we will do a few things. We'll multiply the input matrix by our weight matrix `w`. We'll apply the activation function `activation`. We'll also save the input and output matrices in the layer so that we can use them during backpropagation. But this is already done for you.
#### TO DO ####
Fill in the TODO in `matrix forward_layer(layer *l, matrix in)`. You will need the matrix multiplication function `matrix matrix_mult_matrix(matrix a, matrix b)` provided by our matrix library.
Using matrix operations we can batch-process data. Each row in the input `in` is a separate data point and each row in the returned matrix is the result of passing that data point through the layer.
### 1.4 Backward propagation ###
We have to backpropagate error through our layer. We will be given `dL/dy`, the derivative of the loss with respect to the output `y` of the layer. The output `y` is given by `y = g(xw)` where `x` is the input, `w` is our weights for that layer, and `g` is the activation function. What we want to calculate is `dL/dw` so we can update our weights and `dL/dx` which is the backpropagated loss to the previous layer.
#### TO DO ####
Fill in `matrix backward_layer(layer *l, matrix delta)`
#### 1.4.1 Gradient of activation function ####
First we need to calculate `dL/d(xw)` using the gradient of our activation function. Recall:
dL/d(xw) = dL/dy * dy/d(xw)
= dL/dy * df(xw)/d(xw)
= dL/dy * g'(xw)
Use your `void gradient_matrix(matrix m, ACTIVATION a, matrix d)` function to change delta from `dL/dy` to `dL/d(xw)`. Note `xw` here is the result from matrix multiplication x times w (i.e. `l->out`). `dL/dy` is the Δo.
#### 1.4.2 Δw: Derivative of loss w.r.t. weights ####
Next we want to calculate the derivative with respect to our weights, `dL/dw`. Recall:
dL/dw = dL/d(xw) * d(xw)/dw
= dL/d(xw) * x
but remember from the slides, to make the matrix dimensions work out right we actually do the matrix operation of `xt * dL/d(xw)` where `xt` is the transpose of the input matrix `x`.
In our layer we saved the input as `l->in`. Calculate `xt` using that and the matrix transpose function in our library, `matrix transpose_matrix(matrix m)`. Then calculate `dL/dw` and save it into `l->dw` (free the old one first to not leak memory!). We'll use this later when updating our weights.
#### 1.4.3 Δo: Derivative of loss w.r.t. input ####
Next we want to calculate the derivative with respect to the input (i.e. the output of the previous layer) as well, `dL/dx`. Recall:
dL/dx = dL/d(xw) * d(xw)/dx
= dL/d(xw) * w
again, we have to make the matrix dimensions line up so it actually ends up being `dL/d(xw) * wt` where `wt` is the transpose of our weights, `w`. Calculate `wt` and then calculate dL/dx. This is the matrix we will return.
### 1.5 Weight updates ###
After we've performed a round of forward and backward propagation we want to update our weights. Remember from class with momentum and decay our weight update rule is:
Δw_t = dL/dw_t - λw_t + mΔw_{t-1}
w_{t+1} = w_t + α Δw_t
We'll be doing gradient ascent (the partial derivative component is positive) instead of descent because we'll flip the sign of our partial derivative when we calculate it at the output. We saved `dL/dw_t` as `l->dw` and by convention we'll use `l->v` to store the previous weight change `Δw_{t-1}`.
Calculate the current weight change as a weighted sum of `l->dw`, `l->w`, and `l->v`. The function `matrix axpy_matrix(double a, matrix x, matrix y)` will be useful here, it calculates the result of the operation `ax+y` where `a` is a scalar and `x` and `y` are matrices. Save the current weight change in `l->v` for next round (remember to free the current `l->v` first).
Finally, apply the weight change to your weights by adding a scaled amount of the change based on the learning rate.
#### TO DO ####
Fill in `void update_layer(layer *l, double rate, double momentum, double decay)`.
## 1.6 Read through the rest of the code ##
Check out the remainder of the functions which string layers together to make a model, run the model forward and backward, update it, train it, etc. Notable highlights:
#### `layer make_layer(int input, int output, ACTIVATION activation)` ####
Makes a new layer that takes `input` number of inputs and produces `output` number of outputs. Note that we randomly initialize the weight vector, in this case with a uniform random distribution between `[-sqrt(2/input), sqrt(2/input)]`. The weights could be initialize with other random distributions but this one works pretty well. This is one of those secret tricks you just have to know from experience!
#### `double accuracy_model(model m, data d)` ####
Will be useful to test