RNN averaging noisy samples yield less noisy results.
RNN - music input each row is a new note, like a beat in the song the row number is the number of beats in a song, the columns are mostly zero except one entry each row, it’s the note hard encoddd
timestep is a segment of the song length we are using to train. Ensures same sequence length
LSTM overcomes the vanishing gradient problem of RNN. Back propagation through time, can make gradient too small. Avoid loss of information
LSTM allows learning across many different steps. 1000 steps.
The cell is fully differentiable. All its functions have a derivative, and hence a gradient. That can be computed. Including: sigmoid, hyperbolic tangent, multiplication, addition. Easy use of backpropagation or SGD to update the weights.
Sigmoid threshold is the key to manage: what goes into the cell, what retains within the cell, what passes the output.
If RNN set hidden state as None then all the hidden state weights will just be zero.
At first the blue line is just flat, hasn’t learn anything yet. As it learns, it starts to track red line well. Eventually it gets close. But suddenly, in this Udacity lecture the graph looks like it flipped upside down?! This is the same graph but for better visualization, it is flipped, so that the two graph look like their track each other nicely on this new axis. But the lecturer didn’t point this out so it looked surprising.
If detach hidden variable, but assigning hidden.data to a new variable that means no need to do back propagation on this particular variable that is detached
GRU dimensions (num_layer, batch size, hidden dimensions )
Gated Recurrent Unit
Works well in practice. Only has one working memory, not two (LSTM has long term and short term memory). Has UPDATE GATE (combines learn and forget gate) and runs through COMBINE GATE.