# Hackathon 8

Topics:
- TensorFlow RNNs and Cells
- LSTM

In today's demo, we'll teach an RNN how to speak English.

This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code.

In [None]:
# we'll start with our library imports...
# tensorflow to specify and run computation graphs
# numpy to run any numerical operations that need to take place outside of the TF graph
import tensorflow as tf
import numpy as np
# these ones let us draw images in our notebook
import matplotlib.pyplot as plt

#### RNN/LSTM theory recap

Recurrent neural networks (RNNs) are computation graphs with loops (i.e., not directed acyclic graphs). Because the backpropagation algorithm only works with DAGs, we have to unroll the RNN through time. Tensorflow provides code that handles this automatically.

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png">


The most common RNN unit is the LSTM, depicted below:

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png">

We can see that each unit takes 3 inputs and produces 3 outputs, two which are forwarded to the same unit at the next timestep and one true output, $h_t$ depicted coming out of the top of the cell.

The upper output going to the next timestep is the cell state. It carries long-term information between cells, and is calculated as: 

<img src=http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png>

where the first term uses the forget gate $f_t$ to decide to scale the previous state (potentially making it smaller to "forget" it), and the second term is the product of the update gate $i_t$ and the state update $\tilde{C}_t$. Each of the forget and update gates are activated with sigmoid, so their range is (0,1).

The true output and the second, lower output on the diagram are calculated by the output gate:

<img src=http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png>

First, $o_t$ is calculated from the output of the previous timestep concatenated with the current input, but then it's mixed with the cell state to get the true output. Passing on this output to the next timestep as the hidden state gives the unit a kind of short term memory.

(Images sourced from [Colah's Blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/))

We're going to use some Google code to load the [PTB dataset](http://www.fit.vutbr.cz/~imikolov/rnnlm/), a text dataset to teach the RNN to speak English.

In [None]:
import ptb_reader

TIME_STEPS = 20
BATCH_SIZE = 20
DATA_DIR = '/work/cse496dl/shared/hackathon/08/ptbdata'

class PTBInput(object):
  """The input data.
  
  Code sourced from https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py
  """

  def __init__(self, data, batch_size, num_steps, name=None):
    self.batch_size = batch_size
    self.num_steps = num_steps
    self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
    self.input_data, self.targets = ptb_reader.ptb_producer(
        data, batch_size, num_steps, name=name)

raw_data = ptb_reader.ptb_raw_data(DATA_DIR)
train_data, valid_data, test_data, _ = raw_data
train_input = PTBInput(train_data, BATCH_SIZE, TIME_STEPS, name="TrainInput")
print("The time distributed training data: " + str(train_input.input_data))
print("The similarly distributed targets: " + str(train_input.targets))

Each datum is a string of 20 successive words from the corpus, and the target is a similar window, but shifted forward by one word. This is setup to train the model to, given a few preceding words, predict what the next word in the sequence will be.

Initially, in the data each word in the sequence is represented as an integer (notice the shape). This discrete representation fails to capture any semantic relationships between words. I.e., the model wouldn't know that "crimson" and "scarlet" are more similar than "red" and "blue". The solution is to learn an word embedding as the first part of the model to transform each integer into a relatively small, dense vector (as compared to a one-hot). Then, similar words will train to have similar embeddings.

We'll use [tf.nn.embedding_lookup](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) to do this which we provide a (usually trainable) VOCAB_SIZE x EMBEDDING_SIZE matrix.

In [None]:
VOCAB_SIZE = 10000
EMBEDDING_SIZE = 100

# setup input and embedding
embedding_matrix = tf.get_variable('embedding_matrix', dtype=tf.float32, shape=[VOCAB_SIZE, EMBEDDING_SIZE], trainable=True)
word_embeddings = tf.nn.embedding_lookup(embedding_matrix, train_input.input_data)
print("The output of the word embedding: " + str(word_embeddings))

TensorFlow separates the declaration of [RNNCells](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/RNNCell) from the [RNNs](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) that run them. In the code below, we declare an [LSTM cell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell), and create tensors for the inputs to the first unit. We use zeros for the initial hidden state and current state, but it's also possible to declare trainable variables for these as well.

In [None]:
LSTM_SIZE = 200 # number of units in the LSTM layer, this number taken from a "small" language model

lstm_cell = tf.contrib.rnn.BasicLSTMCell(LSTM_SIZE)

# Initial state of the LSTM memory.
initial_state = lstm_cell.zero_state(BATCH_SIZE, tf.float32)
print("Initial state of the LSTM: " + str(initial_state))

Then, we'll pass the newly declared cell and the training sequence of word embeddings to [tf.nn.dynamic_rnn](https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn) as the inputs over time to the LSTM. `dynamic_rnn` runs an `RNNCell` using an internal `while` loop, and returns the sequence of outputs from the LSTM at each timestep and the final state of the LSTM.

In [None]:
# setup RNN
outputs, state = tf.nn.dynamic_rnn(lstm_cell, word_embeddings,
                                   initial_state=initial_state,
                                   dtype=tf.float32)
print("The outputs over all timesteps: "+ str(outputs))
print("The final state of the LSTM layer: " + str(state))
logits = tf.layers.dense(outputs, VOCAB_SIZE)

And to calculate the loss between two sequences, we'll import a function from [tf.contrib.seq2seq](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq) called [sequence_loss](https://www.tensorflow.org/api_docs/python/tf/contrib/seq2seq/sequence_loss). It calculates the weighted cross-entropy loss between the first two arguments, and the third argument provides weights for averaging. We weight uniformly here, but weights could also be calculated based on where in the sequence the target is (e.g., penalize less earlier in the sequence, but more later) or based on the content of the target (e.g., low weight on guessing articles correctly and larger weight on getting nouns and verbs correct).

We'll optimize using TensorFlow's [RMSProp](https://www.tensorflow.org/api_docs/python/tf/train/RMSPropOptimizer) optimizer, which requires an explicit learning rate, but otherwise as usual. We switch from the Adam optmizer because we don't want the adaptive learning rate feature, which can interact badly with the recurrent gradients.

In [None]:
LEARNING_RATE = 1e-4

loss = tf.contrib.seq2seq.sequence_loss(
    logits,
    train_input.targets,
    tf.ones([BATCH_SIZE, TIME_STEPS], dtype=tf.float32),
    average_across_timesteps=True,
    average_across_batch=True)

optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE)
train_op = optimizer.minimize(loss)

Then, before trying to run any operations, we need code to start the queue runners. The Google code we're using internally uses TensorFlow queues to load the data rather than input placeholders. This means that, rather than passing a `feed_dict` to each `run` call, we have to start the queue runners which spin up CPU threads to enqueue data. We declare a [Coordinator](https://www.tensorflow.org/api_docs/python/tf/train/Coordinator) and use it to call [tf.train.start_queue_runners](https://www.tensorflow.org/api_docs/python/tf/train/start_queue_runners) which does the job. If you're using someone's TF code and notice that the program hangs at the first `run` call, chances are that no queue_runner has been started (I've done this a few times in my own code...).

In [None]:
session = tf.Session()
session.run(tf.global_variables_initializer())

# start queue runners
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=session, coord=coord)

# retrieve some data to look at
examples = session.run([train_input.input_data, train_input.targets])
# we can run the train op as usual
_ = session.run(train_op)

print("Example input data:\n" + str(examples[0][1]))
print("Example target:\n" + str(examples[1][1]))

## Hackathon 8 Exercise

Use a TF [Bidirectional RNN](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/stack_bidirectional_dynamic_rnn) to create a 2 layer [LSTMCell](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/BasicLSTMCell) with an [Attention Wrapper](https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/AttentionCellWrapper), just so you can say you did it once. Your code should use `train_input` inputs as above, use different cells going forward and backward, and your code should finish with the `loss` tensor. This should be pretty easy with the TensorFlow documentation.

This model is very large and trains for a long time, so please don't try to optimize it in this notebook.

In [None]:
# Your code here