Photo by Kier... in Sight on Unsplash
Language Translation with Recurrent Neural Networks (RNNs)
Language Translation from with RNNs
Have you ever wondered how machine translation systems like Google Translate work? These systems use advanced algorithms to translate text from one language to another. One such algorithm is the Recurrent Neural Network (RNN), a type of neural network that can learn sequential data like sentences or time-series data.
In this tutorial, we will use RNNs to build a language translation system. We will implement the system using Python and TensorFlow, a popular machine learning library. By the end of this tutorial, you will have a basic understanding of how RNNs can be used for language translation, and you will have built your own translation system.
Let's get started!
- Setup
First, we need to set up our environment. We'll start by importing the necessary libraries:
import tensorflow as tf
import numpy as np
We'll also set the random seed for reproducibility:
tf.random.set_seed(123)
Data
Next, we need some data to train our model. We'll use a small dataset of English and French sentences, which you can download from the following links:
Each line in the files contains a sentence, with English sentences in en.txt
and their French translations in fr.txt
. We'll load the data into memory:
with open("en.txt", "r", encoding="utf-8") as f: input_text = f.read().splitlines()
with open("fr.txt", "r", encoding="utf-8") as f: output_text = f.read().splitlines()
We'll also define the maximum sequence length for input and output sentences, as well as the size of the encoder and decoder RNN layers:
max_input_len = 20
max_output_len = 20
encoder_units = 256
decoder_units = 256
Preprocessing
Before we can train our RNN, we need to preprocess the data. This involves converting the sentences to integer sequences and creating the vocabulary for the input and output languages.
We'll start by creating the vocabulary for each language:
input_vocab = sorted(set(" ".join(input_text)))
output_vocab = sorted(set(" ".join(output_text)))
input_vocab_size = len(input_vocab)
output_vocab_size = len(output_vocab)
We'll also create dictionaries for converting characters to integers and vice versa:
input_char_to_int = {c: i for i, c in enumerate(input_vocab)}
output_char_to_int = {c: i for i, c in enumerate(output_vocab)}
input_int_to_char = {i: c for i, c in enumerate(input_vocab)}
output_int_to_char = {i: c for i, c in enumerate(output_vocab)}
Now we can convert the sentences to integer sequences:
encoder_input_data = np.zeros((len(input_text), max_input_len), dtype="float32")
decoder_input_data = np.zeros((len(output_text), max_output_len), dtype="float32")
decoder_output_data = np.zeros((len(output_text), max_output_len, output_vocab_size), dtype="float32")
for i, (input_sentence, output_sentence) in enumerate(zip(input_text, output_text)):
for t, char in enumerate(input_sentence):
encoder_input_data[i, t] = input_char_to_int[char]
for t, char in enumerate(output_sentence):
decoder_input_data[i, t] = output_char_to_int[char]
if t > 0:
decoder_output_data[i, t - 1, output_char_to_int[char]] = 1.0
Here, we first create three numpy arrays to hold the integer sequences for the encoder input, decoder input, and decoder output. We then loop over each sentence in the input and output data, and convert each character in the sentence to its integer representation using the dictionaries we created earlier. For the decoder output data, we one-hot encode each character.
Model
Now that our data is preprocessed, we can define the model. We'll use an encoder-decoder architecture, where the encoder RNN processes the input sequence and outputs its final state, and the decoder RNN takes the encoder's final state as input and generates the output sequence.
Here's the code to define the model:
encoder_inputs = tf.keras.layers.Input(shape=(None,))
encoder_embedding = tf.keras.layers.Embedding(input_vocab_size, encoder_units)(encoder_inputs)
encoder_outputs, encoder_state_h, encoder_state_c = tf.keras.layers.LSTM(encoder_units, return_state=True)(encoder_embedding)
decoder_inputs = tf.keras.layers.Input(shape=(None,))
decoder_embedding = tf.keras.layers.Embedding(output_vocab_size, decoder_units)(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(decoder_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=[encoder_state_h, encoder_state_c])
decoder_dense = tf.keras.layers.Dense(output_vocab_size, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)
model = tf.keras.models.Model([encoder_inputs, decoder_inputs], decoder_outputs)
First, we define the inputs for the encoder and decoder, and pass them through embedding layers to convert the integer sequences to dense vectors. We then define the LSTM layers for the encoder and decoder, with the encoder LSTM returning its final state. The decoder LSTM also returns its output sequence, which we pass through a dense layer with a softmax activation to produce the final output probabilities. Finally, we define the model using the input and output layers.
Training
Now we can train the model. We'll use the Adam optimizer and categorical cross-entropy loss, and train for 100 epochs.
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.fit([encoder_input_data, decoder_input_data], decoder_output_data, batch_size=64, epochs=100, validation_split=0.2)
Inference
Once the model is trained, we can use it to translate new sentences. We'll define an inference function that takes an input sentence and outputs its translation:
encoder_outputs, encoder_state_h, encoder_state_c = encoder_model.predict(encoder_input)
decoder_state = [encoder_state_h, encoder_state_c]
decoder_input = np.zeros((1, 1), dtype="float32")
decoder_input[0, 0] = output_char_to_int["\t"]
output_sentence = ""
for t in range(max_output_len):
decoder_outputs, state_h, state_c = decoder_model.predict([decoder_input] + decoder_state)
decoder_state = [state_h, state_c]
char_index = np.argmax(decoder_outputs[0, 0, :])
char = output_int_to_char[char_index]
output_sentence += char
if char == "\n":
break
decoder_input[0, 0] = char_index
return output_sentence
In the infer
function, we first convert the input sentence to an integer sequence using the input_char_to_int
dictionary. We then use the encoder_model
to get the encoder outputs and final state for the input sequence. We set the initial decoder input to the start-of-sequence character, and initialize the decoder state with the encoder final state. We then loop over the decoder inputs, passing them through the decoder_model
to get the decoder outputs and updated state. We use the output_int_to_char
dictionary to convert the decoder outputs to characters, and stop the loop when we reach the end-of-sequence character.
Conclusion
In this blog post, we've seen how to use recurrent neural networks to perform language translation. We've preprocessed the data, defined a model using an encoder-decoder architecture, trained the model using categorical cross-entropy loss and the Adam optimizer, and used the trained model to translate new sentences.
Language translation is just one application of recurrent neural networks, and there are many other exciting applications in fields such as natural language processing, time series analysis, and image and video processing.
Full Code
import tensorflow as tf
import numpy as np
# Set the random seed for reproducibility
tf.random.set_seed(123)
# Define the input and output languages
input_lang = "en" # English
output_lang = "fr" # French
# Define the maximum sequence length for input and output sentences
max_input_len = 20
max_output_len = 20
# Define the size of the encoder and decoder RNN layers
encoder_units = 256
decoder_units = 256
# Define the batch size and number of epochs for training
batch_size = 64
epochs = 100
# Load the input and output sentences
with open(f"{input_lang}.txt", "r", encoding="utf-8") as f:
input_text = f.read().splitlines()
with open(f"{output_lang}.txt", "r", encoding="utf-8") as f:
output_text = f.read().splitlines()
# Create the vocabulary for input and output languages
input_vocab = sorted(set(" ".join(input_text)))
output_vocab = sorted(set(" ".join(output_text)))
input_vocab_size = len(input_vocab)
output_vocab_size = len(output_vocab)
# Create the dictionaries for converting characters to integers and vice versa
input_char_to_int = {c: i for i, c in enumerate(input_vocab)}
output_char_to_int = {c: i for i, c in enumerate(output_vocab)}
input_int_to_char = {i: c for i, c in enumerate(input_vocab)}
output_int_to_char = {i: c for i, c in enumerate(output_vocab)}
# Convert the sentences to integer sequences
encoder_input_data = np.zeros((len(input_text), max_input_len), dtype="float32")
decoder_input_data = np.zeros((len(output_text), max_output_len), dtype="float32")
decoder_output_data = np.zeros((len(output_text), max_output_len, output_vocab_size), dtype="float32")
for i, (input_sentence, output_sentence) in enumerate(zip(input_text, output_text)):
for t, char in enumerate(input_sentence):
encoder_input_data[i, t] = input_char_to_int[char]
for t, char in enumerate(output_sentence):
decoder_input_data[i, t] = output_char_to_int[char]
if t > 0:
decoder_output_data[i, t - 1, output_char_to_int[char]] = 1.0
# Define the encoder RNN model
encoder_inputs = tf.keras.layers.Input(shape=(None,))
encoder_embedding = tf.keras.layers.Embedding(input_vocab_size, encoder_units)(encoder_inputs)
encoder_outputs, state_h, state_c = tf.keras.layers.LSTM(encoder_units, return_state=True)(encoder_embedding)
encoder_states = [state_h, state_c]
# Define the decoder RNN model
decoder_inputs = tf.keras.layers.Input(shape=(None,))
decoder_embedding = tf.keras.layers.Embedding(output_vocab_size, decoder_units)(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(decoder_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(output_vocab_size, activation="softmax")
decoder_outputs = decoder_dense(decoder_outputs)
# Define the overall model
model = tf.keras.models.Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Compile the model with categorical cross-entropy loss and Adam optimizer
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
# Train the model
model.fit([encoder_input_data, decoder_input