Introduction
One very broad and highly active field of research in AI (artificial intelligence) is NLP: Natural Language Processing. Scientists have been trying to teach machines how to understand and even write natural languages (such as English or Chinese) since the very beginning of computer science and artificial intelligence. One of the founding fathers of artificial intelligence, Alan Turing, suggested this as a possible application for the “learning machines” he imagined as early as the late 1940s (as discussed in a previous article). Other pioneers, such as Claude Shannon, who founded the mathematical theory of information and communication, have also suggested natural languages as a playground for the application of information technology and computer science.
The world has moved on since the days of these early pioneers, and today we use NLP solutions without even realizing it. We live in the world Turing dreamt of, but are scarcely aware of doing so!
The history of NLP is long and complex, involving several techniques once considered state of the art that now are barely remembered. Certain turning points in this history changed the field forever, and focused the attention of thousands of researchers on a single path forward. In recent years, the resources required to experiment and forge new paths in NLP have largely only been available outwith academia. Such resources are most available to private hi-tech companies: hardware and large groups of researchers are more easily allocated to a particular task by Google, Facebook and Amazon than by the average university, even in the United States. Consequently, more and more new ideas arise out of big companies rather than universities. In NLP, at least two such ideas followed this pattern: the word2vec and BERT algorithms.
The former is a word embedding algorithm devised by Tomas Mikolov and others in 2013 (the original C++ code can be found here). Another important class of algorithms – BERT – was published by Google researchers in 2018. Within just a few months these algorithms replaced previous NLP algorithms in the Google Search Engine. In both cases, the researchers released their solutions as open source, disclosing results, datasets and of course, the full code.
Such rapid progress and impact on widely-used products is amazing and worthy of deeper analysis. This article will offer hints for developers who wish to play with this new tool.
If you are interested in NLP, especially in conjunction with Deep Learning, don’t miss the opportunity to attend our Deep Learning Online Conference! Find out more information at this link.
Understanding natural language: from linguistics to word embedding
Linguistic rules
Before machine learning methods became effective and popular in the AI community, i.e., before the 1980s, natural language processing usually involved taking advantage of linguistic and logical theories (and even of the philosophy of language). The distinction between syntax and semantics, for example, is a consequence typical of that approach, in which the primary concern was trying to represent grammar rules in an effective way. The software then used these rules to analyse and represent natural language texts, in order to classify or summarize them, or to find answers to questions, translate from one language to another one, and so on.
The results achieved were generally poor when compared with the effort required to set up such implementations. The computational element was minimal, consisting of building logical representations of sentences (using languages like Prolog or Lisp), and then applying some expert systems to analyse them.
Information Retrieval and Neural Networks
However, a different type of representation had already been devised for natural language in the IR (Information Retrieval) community. IR was a hot topic in the 1960s, when computers were used mainly to store large archives of data and documents. Given that these computers were rather slow, clever techniques were needed to retrieve information from inside electronic documents. In particular, algebraic models called vector space models were developed, in which documents were associated with vectors in an N-dimensional space. This provided not only a practical indexing of documents, but also offered the chance to locate similar documents in the same region of the space. For example, imagine associating a document with a point on a plane. Ideally, nearby points would correspond with similar documents, so that Euclidean distance can be used to discriminate between similar documents, etc.
These techniques allowed users to deal numerically with documents, sentences and words. Once one linguistic object is mapped to a vector, the powerful weapons of numerical analysis and optimization may be applied to deal with them. Neural networks offer a good example of one of the more powerful of such weapons. Indeed, a neural network is simply an optimization algorithm that works on vectors, and thus on points in an N-dimensional space.
Nonetheless, generally speaking neural network are supervised learning algorithms: a net, newly set up, is a blank space – to be properly used it needs to be trained on a set of data, i.e., a dataset which contains both input records and the correct answers for that record.
If we are interested in NLP tasks more sophisticated than classification, e.g., summarization, translation, text generation, etc., the use of standard neural networks requires a lot of labelled training data, which are difficult to produce. For example, to create translations, examples of texts in both the source and target language are needed – and a lot of them, since natural languages have tens of thousands of words, all of which can be arranged in extremely variable sentence combinations, several orders of magnitude more in number. That’s even before one considers that deep learning algorithms require millions of training cases to be properly trained.
Unsupervised Algorithms: word2vec and BERT
To deal with texts, it is better to devise unsupervised algorithms, which can be fed with raw data taken from the Internet, for example. BERT was initially fed with Wikipedia pages.
How can we design an unsupervised neural network? Neural networks are intrinsically supervised, but if we make some assumptions, we can transform unsupervised datasets into supervised ones. This works for data arranged in series, such as texts, which may be considered series of words.
Consider a text; via some tokenizing process, imagine having the sequence of words which constitute it: w1,…,wN. If the text is long – an article, a short story, a book – most words will appear more than once, but in a certain order. The idea behind word embedding associates each word with a certain vector (this is different from the vector space model of IR which associates a vector with an entire document). In this way, the position of vectors within the n-dimensional space should reflect the contextual relationships between words.
Another way to look at this idea: given a text and its series of consecutive words w1,…,wN, imagine the set of pairs (w1,w2),(w2,w3),…,(wN−1,wN) as a training set, by means of which the network is able to learn the function y = f(x) such that f(wi) = wi+1. Thus, the network learns to predict the word that will follow another given word. This is a non-linear generalization of certain simple Markov algorithms, such as those used to write bullshit generators.
A generalization of this process consists of considering each word in a text as surrounded by a ‘window’ in an effort to teach the network how to guess the missing word in a sentence. This process is used in the word2vec algorithm to train a recurrent neural network. Such algorithms are consequently context-free, as they simply associate vectors with single words.
More sophisticated algorithms involve the consideration of context, in which the vector associated to a word is varies according to context, rather than remaining the same regardless of the other words surrounding it in the sentence.
Google’s BERT is such an algorithm.
BERT comes into play
BERT is an acronym of Bidirectional Encoder Representations from Transformers. The term bidirectional means that the context of a word is given by both the words that follow it and by the words preceding it. This technique makes this algorithm hard to train but very effective. Exploring the surrounding text around words is computationally expensive but allows a deeper understanding of words and sentences.
Unidirectional context-oriented algorithm already exist. A neural network can be trained to predict which word will follow a sequence of given words, once trained on a huge dataset of sentences. However, predicting that word from both the previous and following words is not an easy task. The only way to do so effectively is to mask some words in a sentence and predict them too, e.g., the sentence “the quick brown fox jumps over the lazy dog” might be masked as “the X brown fox jumps over the Y dog” with label (X = quick, Y = lazy) to become a labelled record in a training set of sentences. One can easily derive a training set from a bundle of unsupervised texts by simply masking 15% of words (as BERT does), and training the neural network to deduce the missing words from the remaining ones.
Notice that BERT is truly a deep learning algorithm, while context-free algorithms such as word2vec, based on shallow recurrent networks, may not be. However, as such, BERT’s training is very expensive, due to its transformer aspect. Training on a huge body of text – for example, all English-language Wikipedia pages – is an Herculean effort that requires decidedly nontrivial computational power.
As a result, BERT’s creators disentangled the training phase from the tuning phase required to properly apply the algorithm to a specific task. The algorithm has to be trained once overall, and then fine tuned specifically for each context.
Luckily, BERT comes with several pre-trained representations already computed. On the GitHub page of the project a number of pre-trained models can be found. However, such representations are not enough to solve a specific problem. The model must be fine tuned to the desired task. The following demonstrates how this is done, using a test example:
BERT: let’s play with it
For our purely explanatory purposes, we will use Python to play with a standard text dataset, the Deeply Moving dataset maintained at Stanford University, which contains short movie reviews from the ‘Rotten Tomatoes’ website. The dataset can be downloaded from this page.
We assume that the dataset is stored inside a directory stanfordSentimentTreebank
. The sentences are stored inside the file datasetSentences.txt
as pairs (index, sentence) one per line. A splitting training/testing set can also be found inside the file datasetSplit.txt
its rows containing pairs (index, ds) being ds = 1 for training and ds = 2 for testing sentences. The file sentiment_labels.txt
contains the labels attached to each sentence in the dataset and also to each phrase inside it as pairs (index, phrase), with 0 being the most negative, and 1 the most positive, sentiment. The list of all phrases along with their indexes is stored in dictionary.txt
.
To keep things simple, we’ll focus only on sentences, so that our datasets are built on filtering scores for sentences and leaving out any remaining phrases. The dataset will also need to be built from those files – a simple process that is demonstrated below. BERT will then be applied to perform a test sentiment analysis on this dataset.
Whatever the task, it is not necessary to pre-train the BERT model, but only to fine-tune a pre-trained model on the specific dataset that relates to the problem we want to use BERT to study. We will try to use such a pre-trained model to perform our simple classification task: more exciting use cases may be found on the GitHub page of the project mentioned above, as well as elsewhere on the Web.
First, we choose the pre-trained model: in the BERT GitHub repository there are several choices available, we will use the one known as ‘BERT-tiny’, aka bert_en_uncased_L-2_H-128_A-2
.
This pre-trained representation has been obtained by converting training texts into lowercase, with accent markers stripped out. Moreover, the model is set up as a network with 2 layers, 128 hidden, for a total of 4.4K parameters to train. This is the tiniest model, others include the ‘BERT-mini’ (4 layers, 256 hidden), ‘BERT-small’ (4 layers, 512 hidden), ‘BERT-Medium’ (8 layers, 512 hidden) and ‘BERT-base’ (12 layers, 768 hidden). Of course, the larger the network architecture, the more computational effort is needed to fine-tune these models. As the purpose of this article is purely explanatory, we’ll stick to the tiniest option to enable anyone to run the following code snippet, even if no GPUs are available.
The pre-trained model can be downloaded from the repository and extracted into a local folder. This folder will contain the following files:
bert_config.json
bert_model.ckpt.data-00000-of-00001
bert_model.ckpt.index
vocab.txt
The first file contains all the configuration necessary to build a network layer to use this BERT model, while the latter files are needed to properly tokenize our texts. The largest file contains the model, which may be loaded from the BERT library using the methods demonstrated below.
To remain focused on the model, the assumption will be that our code is run inside a directory which also contains those files, and where the directory stanfordSentimentTreebank
with our dataset is also stored. This is necessary before running the following programs:
Before setting up the model, our dataset is tokenized according to the format expected by the BERT layers; this can be done via the FullTokenizer
class from the BERT package. Next, the tokenizer is fed with each sentence in our datsaset. The tokenizer result, which is a list of strings, between “[CLS]
” and “[SEP]
” is enclosed, as required by the BERT algorithm implementation.
The output of our model will be simply a number between 0 and 1.
import bert
import numpy as np
import os
BERT_PATH = "bert_layer"
VOCAB_TXT = os.path.join(BERT_PATH, "vocab.txt")
DATASET_DIR = "stanfordSentimentTreebank"
SENTENCES = os.path.join(DATASET_DIR, "datasetSentences.txt")
SCORES = os.path.join(DATASET_DIR, "sentiment_labels.txt")
SPLITTING = os.path.join(DATASET_DIR, "datasetSplit.txt")
DICTIONARY = os.path.join(DATASET_DIR, "dictionary.txt")
MAX_LENGTH = 64 # Length of word vectors which the model accepts as input
tokenizer = bert.bert_tokenization.FullTokenizer(VOCAB_TXT, do_lower_case = True)
training_set, training_scores = [], []
testing_set, testing_scores = [], []
# For each sentence in the dataset, tokenize it and stores it inside either
# the training or the testing set, according to the value in `datasetSplit.txt`
def read_file(filename):
with open(filename) as f:
f.readline() # skips the heading line
return f.readlines()
sentences = read_file(SENTENCES)
scores = read_file(SCORES)
splitting = read_file(SPLITTING)
dictionary = read_file(DICTIONARY)
# let scores[i] = score being i an int index and score a float score.
scores = {int(s[:s.index("|")]): float(s[s.index("|")+1:]) for s in scores}
# let splitting[i] = int denoting the kind of dataset (1=training, 2=testing).
splitting = {int(s[:s.index(",")]): int(s[s.index(",")+1:]) for s in splitting}
# let dictionary[s] = phrase index of the corresponding string
dictionary = {s[:s.index("|")]: int(s[s.index("|")+1:]) for s in dictionary}
# Now looks for each sentence inside the dictionary, retrieves the index and looks for
# the index in the scores, creating a list of sentences and scores
for s in sentences:
i = int(s[:s.index("\t")]) # sentence index, to be matched in splitting
s = s[s.index("\t") + 1:][:-1] # extract the sentence (strip the ending "\n")
if s not in dictionary:
continue
ph_i = dictionary[s] # associated phrase index
# Now tokenizes the sentence and put it into the BERT format
s = tokenizer.tokenize(s)
if len(s) > MAX_LENGTH - 2:
s = s[:MAX_LENGTH - 2]
s = tokenizer.convert_tokens_to_ids(["[CLS]"] + s + ["[SEP]"])
if len(s) < MAX_LENGTH:
s += [0] * (MAX_LENGTH - len(s))
# Decides in which dataset to store the data
if splitting[i] == 1:
training_set.append(s)
training_scores.append(scores[ph_i])
else:
testing_set.append(s)
testing_scores.append(scores[ph_i])
training_set, training_scores = np.array(training_set), np.array(training_scores)
testing_set, testing_scores = np.array(testing_set), np.array(testing_scores)
Code language: PHP (php)
Let’s use the BERT model that we downloaded from the GitHub repository. As usual in these kinds of models, fine tuning requires setting some hyper-parameters, i.e., parameters external to the model, such as the learning rate, the batch size, the number of epochs. Finding the right combination is the nightmare of every ML practitioner, but in BERT’s case, we have some suggestions from its inventors:
- Batch size: 8, 16, 32, 64, 128 (in general, the larger the BERT model, the smaller the size)
- Learning rate: 3e-4, 1e-4, 5e-5, 3e-5
- Epochs: 4
As usual, we opt for the Adam optimizer, even if it is more expensive computationally. Nothing special is added to the BERT network layer provided by Google, but two dimensions of tensors representing the BERT output are pooled into one via the GlobalAveragePooling1D
method – another trick that emerged from the Google research of the 2010s. Next, the BERT output is provided to a fully connected layer, the result of which is turned into the output of the network. The summary method of the Model Keras class allows us to show the shapes of the layers, and the verbose option in the training method is turned on to see the performances along epochs.
Of course, this is a very simple architecture, that may be further complicated at will to fit more complex purposes, for example. But, if time and/or resources permit, it is better to start with a larger pre-trained BERT model.
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling1D, Input, Lambda
from tensorflow.keras.optimizers import Adam
# Loads the bert pre-trained layer to plug into our network
bert_params = bert.params_from_pretrained_ckpt(BERT_PATH)
bert_layer = bert.BertModelLayer.from_params(bert_params, name = "BERT")
bert_layer.apply_adapter_freeze()
# We arrange our layers by composing them as functions, with the input layer as inmost one
input_layer = Input(shape=(MAX_LENGTH,), dtype = 'int32', name = 'input_ids')
output_layer = bert_layer(input_layer)
#output_layer = Dense(128, activation = "tanh")(output_layer)
#output_layer = Dropout(0.5)(output_layer)
#output_layer = Lambda(lambda x: x[:, :, 0])(output_layer) # we drop the second dimension
#output_layer = Dropout(0.5)(output_layer)
#output_layer = Dense(1)(output_layer)
#output_layer = Dropout(0.5)(output_layer)
output_layer = GlobalAveragePooling1D()(output_layer)
output_layer = Dense(128, activation = "relu")(output_layer)
output_layer = Dense(1, activation = "relu")(output_layer)
neural_network = Model(inputs = input_layer, outputs = output_layer)
neural_network.build(input_shape = (None, MAX_LENGTH))
neural_network.compile(loss = "mse", optimizer = Adam(learning_rate = 3e-5))
neural_network.summary()
neural_network.fit(
training_set,
training_scores,
batch_size= 128,
shuffle = True,
epochs = 4,
validation_data = (testing_set, testing_scores),
verbose = 1
)
Code language: PHP (php)
And here is the output:
Model: "model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_ids (InputLayer) [(None, 64)] 0
_________________________________________________________________
BERT (BertModelLayer) (None, 64, 128) 4369152
_________________________________________________________________
global_average_pooling1d_2 ( (None, 128) 0
_________________________________________________________________
dense_4 (Dense) (None, 128) 16512
_________________________________________________________________
dense_5 (Dense) (None, 1) 129
=================================================================
Total params: 4,385,793
Trainable params: 4,385,793
Non-trainable params: 0
_________________________________________________________________
Train on 8117 samples, validate on 3169 samples
Epoch 1/4
8117/8117 [==============================] - 100s 12ms/sample - loss: 0.0766 - val_loss: 0.0661
Epoch 2/4
8117/8117 [==============================] - 99s 12ms/sample - loss: 0.0629 - val_loss: 0.0635
Epoch 3/4
8117/8117 [==============================] - 101s 12ms/sample - loss: 0.0592 - val_loss: 0.0594
Epoch 4/4
8117/8117 [==============================] - 94s 12ms/sample - loss: 0.0550 - val_loss: 0.0571
<tensorflow.python.keras.callbacks.History at 0x1ba0ef8cf48>
Code language: PHP (php)
Now we have a simple pre-trained BERT model fine-tuned and trained on our dataset. Let’s use it to check some sentences about imaginary movies: the network will order them from the most negative to the most positive.
some_sentences = [
"The film is not bad but the actors should take acting lessons",
"Another chiefwork by a master of western movies",
"This film is just disappointing: do not waste time on it",
"Well directed but poorly acted",
"The movie is well directed and greatly acted",
"A honest zombie movie with actually no new ideas",
]
print("The following sentences will be sorted from the most negative upward")
for s in some_sentences:
print("\t", s)
ranked = [] # pairs (rank, sentence)
for s in some_sentences:
t = tokenizer.tokenize(s)
if len(t) > MAX_LENGTH - 2:
t = t[:MAX_LENGTH - 2]
t = tokenizer.convert_tokens_to_ids(["[CLS]"] + t + ["[SEP]"])
if len(t) < MAX_LENGTH:
t += [0] * (MAX_LENGTH - len(t))
p = neural_network.predict(np.array([t]))[0][0]
ranked.append((p, s))
print("Network ranking from negative to positive")
for r in sorted(ranked):
print("\t", r[1])
Code language: PHP (php)
Here is the output:
The following sentences will be sorted from the most negative upward
The film is not bad but the actors should take acting lessons
Another chiefwork by a master of western movies
This film is just disappointing: do not waste time on it
Well directed but poorly acted
The movie is well directed and greatly acted
A honest zombie movie with actually no new ideas
Network ranking from negative to positive
This film is just disappointing: do not waste time on it
A honest zombie movie with actually no new ideas
The film is not bad but the actors should take acting lessons
Another chiefwork by a master of western movies
Well directed but poorly acted
The movie is well directed and greatly acted
Code language: JavaScript (javascript)
Not bad for the tiniest model with the minimum Keras wrapping…
Of course this test exercise may be improved on in several respects. A categorical multiclass classification could substitute for our numerical guessing, e.g. by mapping the positivity probability p resulting in the dataset using the following cut-offs:
- very negative if 0 ≤ x ≤ 0.2
- negative if 0.2 < x ≤ 0.4
- neutral if 0.4 < x ≤ 0.6
- positive if 0.6 < x ≤ 0.8
- very positive if 0.8 < x ≤ 1
In this case, the output layer should be a softmax, and the compilation of the net should use categorical cross entropy, etc. One would also want to include more layers and dropouts, and so on.
Conclusions
Despite its simplicity of use, BERT models outperform previous NLP tools in several respects: they may be used not only to classify but also to predict, to translate, to summarize, and to improve the automatic understanding of natural languages. The availability of multiple pre-trained models, even in languages other than English, keeps improving, and, while fine tuning tasks are much harder than one might image after reading this description, I hope that the importance of this class of algorithms for practical and theoretical purposes in machine learning has been sufficiently highlighted by the previous discussion.