# Hackathon 9

Topics:
- Neural Machine Translation (NTM)
- Sequence to Sequence (Seq2Seq)
- Attention in NMT

In today's demo, we'll teach an model how to translate from English to French. Significant portions sourced from [TensorFlow's documentation](https://www.tensorflow.org/tutorials/seq2seq).

This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code.

In [1]:
# we'll start with our library imports...
# tensorflow to specify and run computation graphs
# numpy to run any numerical operations that need to take place outside of the TF graph
import tensorflow as tf
import numpy as np
# these ones let us draw images in our notebook
import matplotlib.pyplot as plt

## Neural Machine Translation

Researchers at Google have pioneered Neural Machine Translation (NMT), a method of translating between natural languages using RNNs-based Seq2Seq models. Previous models relied on memorizing phrase translations and combining those without regard for context. This led, famously, to translations from the English "The spirit is willing, but the flesh is weak" to Russian, back to English as "The vodka is good, but the meat is rotten". NTM uses context and learned experience to translate using deep networks.

NTM relies on a model called Sequence to Sequence (Seq2Seq). Basically, it reads the input sentence one word at a time with an encoder, and outputs a fixed-length "thought vector". Then, a decoder unrolls the thought vector into a sentence in the target language.

<img src="https://www.tensorflow.org/images/seq2seq/encdec.jpg">

Specifically, the English encoder will take one input word per timestep, but not be expected to produce any output while it's reading the input sentence. After reading in the full sentence, we'll use the cell state as the thought vector that we'll use as the initial state of the French decoder. The decoder will then process the input sentence as it predicts the words of the output sentence.

<img src="https://www.tensorflow.org/images/seq2seq/seq2seq.jpg">
Here, "(s)" is a special character that marks the start of the decoding process while "(/s)" is a special character that tells the decoder to stop.

Just as in last week's hackathon, the first thing we'll do is a word embedding by mapping from integers to learned vectors. This week, we'll be working with time-major data. This simply switches the `batch` and `time` dimensions, but we need to carefully track this in the code to ensure correctness. We have some code here to load vocabularies and data from source and target languages. The data is an English-Vietnamese corpus of TED talks (133K sentence pairs) provided by the IWSLT Evaluation Campaign.

In [2]:
import codecs

def load_data(inference_input_file, hparams=None):
  """Load inference data."""
  with codecs.getreader("utf-8")(
      tf.gfile.GFile(inference_input_file, mode="rb")) as f:
    inference_data = f.read().splitlines()

  if hparams and hparams.inference_indices:
    inference_data = [inference_data[i] for i in hparams.inference_indices]

  return inference_data


def get_sequences(src_data, dst_data, src_encode_fn, dst_encode_fn, num, go_symbol, stop_symbol):
    # pull from source data
    decode_ids = np.random.randint(0, len(src_data) - 1, size=num)
    sentences = [src_data[x].split() for x in decode_ids]
    input_lengths = [len(s) for s in sentences]
    
    # pull from target data
    target_inputs = [[go_symbol] + dst_data[x].split() for x in decode_ids]
    target_outputs = [dst_data[x].split() + [stop_symbol] for x in decode_ids]
    target_lengths = [len(s) for s in target_inputs]

    # pad sequences to uniform lengths and be time-major
    def zero_pad(sequences, lengths):
        padded = np.zeros([np.max(lengths), len(sequences)], dtype=np.int32)
        for (i, s) in enumerate(sequences):
            padded[0:len(s), i] = s
        return padded
    
    input_sentences = zero_pad(src_encode_fn(sentences), input_lengths)
    target_inputs = zero_pad(dst_encode_fn(target_inputs), target_lengths)
    target_outputs = zero_pad(dst_encode_fn(target_outputs), target_lengths)
    
    return input_sentences, input_lengths, target_inputs, target_outputs, target_lengths


def code(vocab_dict, sequences):
    """Use vocab for int -> word and inv_vocab for word -> int"""
    return [[vocab_dict.get(x, list(vocab_dict.values())[0]) for x in s] for s in sequences]


list_to_dict = lambda l: {k: v for k, v in zip(l, range(len(l)))}
invert_dict = lambda d: {v: k for k, v in d.items()}

# load source vocab and data
base_path = '/work/cse496dl/shared/hackathon/09/'
src_vocab = list_to_dict(load_data(base_path + 'vocab.en'))
encode_src = lambda s: code(src_vocab, s)
decode_src = lambda s: code(invert_dict(src_vocab), s)
src_data = load_data(base_path + 'train.en')

# load target vocab and data
dst_vocab = list_to_dict(load_data(base_path + 'vocab.vi'))
encode_dst = lambda s: code(dst_vocab, s)
decode_dst = lambda s: code(invert_dict(dst_vocab), s)
dst_data = load_data(base_path + 'train.vi')

# set constants for later
SRC_VOCAB_SIZE = len(src_vocab)
DST_VOCAB_SIZE = len(dst_vocab)
GO_SYMBOL = '<s>'
END_SYMBOL = '</s>'

Just as in the last hackathon, we'll use learned word embeddings for both source and target languages. Different from the last hackathon, we're not using anyone else's code for loading the data, so we'll use placeholders.

In [3]:
EMBEDDING_SIZE = 100
MAX_TIME = 20

tf.reset_default_graph()
encoder_inputs = tf.placeholder(tf.int32, shape=[None, None])
source_sequence_length = tf.placeholder(tf.int32, shape=[None])

# Embedding
src_embedding_matrix = tf.get_variable('src_embedding_matrix', dtype=tf.float32,
                                   shape=[SRC_VOCAB_SIZE, EMBEDDING_SIZE], trainable=True)

# Look up embedding:
#   encoder_inputs: [max_time, batch_size]
#   encoder_emb_inp: [max_time, batch_size, embedding_size]
encoder_emb_inp = tf.nn.embedding_lookup(src_embedding_matrix, encoder_inputs)

# decoder placeholders
decoder_inputs = tf.placeholder(tf.int32, shape=[None, None])
decoder_lengths = tf.placeholder(tf.int32, shape=[None])

# Embed decoder input
dst_embedding_matrix = tf.get_variable('dst_embedding_matrix', dtype=tf.float32,
                                   shape=[DST_VOCAB_SIZE, EMBEDDING_SIZE], trainable=True)
decoder_emb_inp = tf.nn.embedding_lookup(dst_embedding_matrix, decoder_inputs)

We'll declare the `LSTMCell` just as in the last hackathon, but this time we'll make sure to pass the `time_major` and `sequence_length` arguments to `dynamic_rnn`. We can use the default zero state to start the LSTM by passing the data type argument.

In [4]:
NUM_UNITS = 200

# Build RNN cell
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(NUM_UNITS)

# Run Dynamic RNN
#   encoder_outputs: [max_time, batch_size, num_units]
#   encoder_state: [batch_size, num_units]
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
    encoder_cell, encoder_emb_inp,
    sequence_length=source_sequence_length, time_major=True, dtype=tf.float32)

We use the same seq2seq `sequence_loss` again, but this time we dynamically set the batch size and time steps to allow variable shaped inputs.

Google Brain's seq2seq code uses `Helper` objects to do dynamic decoding. Factoring out the decoding code from the rest allows us to easy switch between regular sampling for training and greedy sampling when doing inference. We use a flag to set which one to use.

Then, the seq2seq code uses `BasicDecoder`, which takes all the inputs the decoding process needs, and `dynamic_decode` to run the decoder rnn.

In [5]:
MODE = "train"

# Helper
if MODE == "train":
  helper = tf.contrib.seq2seq.TrainingHelper(
    inputs=decoder_emb_inp,
    sequence_length=decoder_lengths,
    time_major=True)
elif MODE == "infer":
  helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
      embedding=lambda params: tf.nn.embedding_lookup(embedding_matrix, params),
      start_tokens=tf.tile([GO_SYMBOL], [batch_size]),
      end_token=END_SYMBOL,
      time_major=True)

# Build RNN cell and projection layer
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(NUM_UNITS)
projection_layer = tf.layers.Dense(DST_VOCAB_SIZE, use_bias=False, name="output_projection")

# Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
    decoder_cell, helper, encoder_state,
    output_layer=projection_layer)
# Dynamic decoding
(final_outputs, _, _) = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True)
logits = final_outputs.rnn_output

One of the important steps in training RNNs is gradient clipping. Here, we clip by the global norm which clips the values of multiple tensors by the ratio of the sum of their norms. The max value, `max_gradient_norm`, is often set to a value like 5 or 1.

Google Brain uses standard SGD (tf.train.GradientDescentOptimizer) with a decreasing learning rate schedule, which yields better performance, but that takes a lot of tuning to work, so we'll stick with the `RMSProp` optimizer.

In [6]:
MAX_GRADIENT_NORM = 1.0
LEARNING_RATE = 1e-4

decoder_outputs = tf.placeholder(tf.int32, shape=[None, None])

dynamic_time_steps = tf.shape(logits)[0]
dynamic_batch_size = tf.shape(logits)[1]
train_loss = tf.contrib.seq2seq.sequence_loss(
    logits,
    decoder_outputs,
    tf.ones([dynamic_batch_size, dynamic_time_steps], dtype=tf.float32),
    average_across_timesteps=True,
    average_across_batch=True)

# Calculate and clip gradients
params = tf.trainable_variables()
gradients = tf.gradients(train_loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(
    gradients, MAX_GRADIENT_NORM)

# Optimization
optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE)
update_step = optimizer.apply_gradients(
    zip(clipped_gradients, params))

Finally, we use the `get_sequences` functions we declared so long ago to get a minibatch from the corpus and run the update step.

In [7]:
BATCH_SIZE = 10

session = tf.Session()
session.run(tf.global_variables_initializer())

src_sequences, src_lens, dst_inputs, dst_outputs, dst_lens = get_sequences(src_data, dst_data, encode_src, encode_dst, BATCH_SIZE, GO_SYMBOL, END_SYMBOL)
feed_dict = {encoder_inputs: src_sequences, source_sequence_length: src_lens,
             decoder_inputs: dst_inputs, decoder_lengths: dst_lens,
             decoder_outputs: dst_outputs}

# we can run the train op as usual
_ = session.run(update_step, feed_dict)

## Attention in NMT

Now we'll look at the details of an attention system described in ([Luong et al., 2015](https://arxiv.org/abs/1508.04025)) which is commonly used in NMT systems.

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg">

<img src="https://www.tensorflow.org/images/seq2seq/attention_vis.jpg" width=75%>

1. The current target hidden state is compared with all source states to derive **attention weights** (can be visualized as in the figure immediately above).
2. Based on the attention weights we compute a **context vector** as the weighted average of the source states.
3. Combine the context vector with the current target hidden state to yield the **final attention vector**
4. The attention vector is fed as an input to the next time step (input feeding). The first three steps can be summarized by the equations below:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg" width=80%>

Once computed, the attention vector $a$ is used to derive the softmax logit and loss. This is similar to the target hidden state at the top layer of a vanilla seq2seq model. The score function and the function $f$ can take other forms:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg" width=80%>

Practically, the code for using this attention mechanism in the decoder is below.

In [8]:
# attention_states: [batch_size, max_time, num_units], transposing because we used time major above
attention_states = tf.transpose(encoder_outputs, [1, 0, 2])

# Create an attention mechanism
attention_mechanism = tf.contrib.seq2seq.LuongAttention(
    NUM_UNITS, attention_states,
    memory_sequence_length=source_sequence_length)

# Wrap the decoder cell
decoder_cell = tf.contrib.seq2seq.AttentionWrapper(
    decoder_cell, attention_mechanism,
    attention_layer_size=NUM_UNITS)

# No exercise this week!

Instead, finish strong on homework 3 and make some progress on your projects. The next hackathon will be the first project help session. You should show up to work with your group, ask questions, and get my help with any roadblocks you've encountered.

Have a good spring break!