Is GPT-3 really the future of NLP?

OpenAI, a major artificial intelligence research laboratory based in San Francisco, has recently published a paper presenting an important upgrade to their well-established language model, GPT-3 (GPT stands for Generative Pre-trained Transformer). Whilst GPT-3 is certainly very impressive, how does it compare with other language models and what are the implications for companies trying to practically apply NLP on their data?

GPT-3 is a language model which has a specific meaning within the field of Natural Language Processing (NLP). A language model is a statistical tool to predict language without understanding it, through mapping the probability with which words follow other words - for instance, how often “wild” is followed by “roses”. The same sort of analysis can then be performed on sentences, “where the wild roses grow” or even entire paragraphs. It can then be given a prompt, for example “write me lyrics about where wild roses grow in the style of Metallica”, and GPT-3 will use statistical relationships it has retained after training to come up with a song which hopefully matches the description. Language models work by trying to find patterns in human language. They are often used to predict the words spoken in an audio recording, the next word in a sentence, which email is spam or how to autocomplete the sentence you were writing in a way you hadn’t even thought about.

GPT-3 is the latest instance of an abundance of pre-trained language models, like Google’s BERT, Facebook’s RoBERTa and Microsoft’s Turing-NLG, meaning that the models (in the form of neural networks) are already trained on massive generic datasets, usually in an unsupervised manner.

But why has GPT-3 has attracted so much attention? Mainly because it is so big. Its sheer size gives the model the potential to perform significantly better than current models — but more on this later.


First, some statistics: the largest version of the GPT-3 model contains 175 billion parameters, gained through training on hundreds of billions of words or tokens. These tokens are gathered from publicly available datasets such as Common Crawl (410 billion tokens), WebTex2 (19 billion tokens), Books1 and Books2 (67 billion tokens) and Wikipedia (‘only’ 3 billion tokens). It is sometimes easy to forget that, not so long ago, a model trained on Wikipedia alone was considered to be really big.

Language Models are Few-Shot Learners, OpenAI

At its disposal, GPT-3 has its own supercomputer for training purposes, hosted in Microsoft’s Azure cloud, consisting of 285k CPU cores and 10k high-end GPUs. Put another way, if you had access to only one V100 (NVIDIA’s most advanced, commercially available GPU to date) it would take you roughly 355 years and it could cost as much as $12 million to train GPT-3 … once.

Benchmark examples for NLP


GPT-3 is different from other NLP systems in an important way, besides its size. Similar models to GPT-3 are usually trained on a large corpus of text and are then fine-tuned to perform a specific task (say, machine translation) and only that task. Taking pre-trained models and fine-tuning them to solve specific problems has become a popular and successful trend in the field of NLP. The method is helping developers to shortcut model development and gain benefits more quickly and in a more cost-efficient manner.


GPT-3, by contrast, goes one step further and does not require fine tuning; it seems able to perform a whole range of tasks reasonably well, from writing fiction, poetry or music, programming code that works, cracking jokes, compiling technical manuals to writing convincing news articles. Twitter is bursting with examples of all the things that GPT-3 can do so I am not going to repeat them here.


A great and really funny (based on an quick Endila office survey) example of GPT-3’s writing can be found here. It is interesting to see that when we read casually through the text generated by GPT-3 it looks great, but look a little bit closer and it becomes rather nonsensical quite quickly.


And that leads to some more critical assessments:


“GPT-3 often performs like a clever student who hasn’t done their reading trying to bullshit their way through an exam. Some well-known facts, some half-truths, and some straight lies, strung together in what first looks like a smooth narrative.”


Julian Togelius, Associate Professor researching A.I. at NYU


Not many people have had the opportunity to have a closer look at GPT-3 yet and a lot more will be written about GPT-3-like systems in the future, but I have included a really great, down-to-earth description of GPT-3’s architecture and a thorough YouTube video analysing the GPT-3 paper in the references. The video is over an hour long but worth your time if you still want to put proper effort in understanding what all the buzz is about after reading this piece.


For the rest of this article, I would like to offer some of my own observations.

. . .


GPT-3 is not new technology

GPT-3, as did its predecessor GPT-2, uses a clever combination of two existing techniques. The first technique is called “the Transformer”. Transformers were introduced by Google in the context of machine translation to handle sequential data such as natural language. But unlike some of the older, established models such as GRU or LSTM they readily allow for massive parallelisation. Parallelisation in this context means distributing the workload over a large number of CPU or GPU processors, reducing the time a certain task takes (almost) linearly with the number of processors available.


The second technique is called self-supervised learning and it allows for AI models to learn about language by examining billions of pages of publicly available documents like Wikipedia entries, self-published books, instruction manuals, online courses, programming snippets, Bulgarian Oriz pudding recipes, etc. Again, this is not unique to GPT-3. Other language models like BERT (Google), XLNet (Google, CMU), RoBERTa (Facebook) and previous iterations of GPT use essentially the same architecture.


It is the combination of these two techniques that allows for GPT-3’s massive scaling. Virtually unlimited computer infrastructure can be utilized, as long as one can pay for it, and there is no need for supervised (human) training, which would be rather impractical for these training data sizes.


GPT-3 is advanced in many ways, but it has its own limitations

The first limitation is that it is hobbled by a limited context window. It cannot ‘see’ beyond this window, which is roughly 500–1000 words long. This results in it having a rather short ‘attention span’, meaning it cannot write large passages of coherent text. It soon irreversibly forgets the beginning of its writing or the context it was written in. This does not matter for tasks where the format is repetitive or limited in size (adding to your Twitter feed for instance), but it will definitely limit its performance in other areas (writing the next “War and Peace”). Part of the reason for this limitation is how GPT-3 uses a technique called Byte Pair Encoding (BPE) for efficiency reasons.


The second limitation that all autoregressive language models trained using likelihood loss seem to share, is that when generating free-form completions, they have a tendency to eventually fall into repetitive loops of gibberish. The model can gradually go off-topic, diverge from whatever purpose there was for the writing and become unable to come up with sensible continuations. This behaviour has been observed many times but has not been fully explained as far as I know. Its unpredictability can be amusing when playing with the model but could become dangerous when more is at stake, for example if it was used to generate work orders or HSE documentation.


These two shortcomings remind us that improvements in AI deep-learning performance today are about increasing brute horsepower, not human-like cleverness or understanding. It is impressive to see GPT-3’s remarkable results across a variety of tasks by just giving it some examples on what to do, but it is sometimes easy to forget that it does not, in any human-like way, understand what it is doing.


Training on the whole internet does not necessarily solves a problem better

It sounds obvious. If we train an NLP model on every text ever produced by humankind then the results we achieve will get better the bigger the model becomes. In practice however, some or most of the data we need for a particular project is not publicly available or not in a format suitable for feeding to computers (more specifically deep-learning models). Suitable training data can be proprietary, sensitive or just not freely available. Data-wrangling of various sorts takes traditionally 80% of the time consumed in a typical AI project and, in my opinion, that percentage will not drop considerably in the near future for many real-world projects.


GPT-3 does a lot of tasks reonably well but for most real-world projects we need the best result we can achieve. This still means data-wrangling, carefully thinking about which algorithms and models to use and, unfortunately, considerable trial and error.

. . .


What does it all mean?

The cost of the resources needed to use GPT-3 make it inaccessible for even a decent size company and certainly to a single data science researcher. Open.ai will start charging for the use of GPT-3 and the infrastructure it runs on. This is perfectly understandable given the hardware required and the cost of running thousands of GPUs and CPUs.

AIaaS (AI as a Service) looks like a logical next step, providing the same advantages and disadvantages that well-established SaaS models provide, including access to advanced infrastructure, easy scalability and high reliability. Deep-learning models are however notoriously opaque and buying into an AIaaS is like putting a black box inside another black box, providing less opportunity for understanding and mitigating the risks of deployment. So, it is still essential to be able to put a human in the loop, every time an important decision is taken by an AI model.

This makes even more sense when using OpenAI’s API, which comes in the form of free text input. Although it feels very intuitive and you don’t need to be a programmer to use is, it also adds another layer of fuzziness, potentially introducing misunderstanding of what is expected from it.


The fundamental assumption of the computing industry is that number crunching gets cheaper all the time. Moore’s law, the computing industry’s metronome, has been predicting that the number of components that can be squeezed onto a microchip of a given size — and thus, loosely, the amount of computational power available at a given cost — doubles every two years, dramatically reducing computing cost over time. But that is not necessarily always the case. Ballooning complexity means that costs at the cutting edge of NLP are rising sharply, as we can clearly see with GPT-3. The growth of the size of language models clearly outpaces the growth of GPU memory, the most scarce and expensive resource that it needs, so it is not likely that we will see a GPT-4 model next year that will be a couple of orders of magnitude larger than GPT-3.


Some serious thinking will need to go into the commercial reality of cost and the technical reality of how to build a system that still scales easily when it is no longer embarrassingly parallel.

I would also like to touch on how progress in the development of machine learning is measured. The State of the Art (SOTA) is typically assessed by running a number of benchmarks (measuring a single or small set of performance metrics on a pre-described dataset). For some examples and GPT-3’s performance, see Table below.

Benchmark examples for NLP


While this is an objective way to compare models, it risks poorly measuring the model performance if the metric does not sufficiently cover the task that the model will be used for in practice. It also incentivises building models that do really well on a particular benchmark, but not on practical use cases.

Related to this, it is becoming almost unavoidable that models, ingesting massive training datasets such as GPT-3, are trained on some of the benchmarks used for evaluation. This is similar to giving the answers to a student before they take the exam. In their defence, the authors of the GPT-3 papers do acknowledge this, but it is not yet clear how this issue can be avoided.


Last but not least, and even more fundamentally, statistical word-matching is no substitute for a coherent understanding of the world. GPT-3 generates grammatically correct text that is nonetheless unmoored from reality, claiming, for instance, that “it takes two rainbows to jump from Hawaii to 17”. It doesn’t have a built-in internal model of the world — or any world — and so it can’t do the reasoning that such a model requires, something that humans would recognise as common sense.


This means that only depending on an NLP model (or any other deep learning model for that matter) to make critical decisions can lead to unexpected and unpredictable outcomes. In computer programming and engineering in general, the process of planning for and gracefully addressing edge cases is typically an important task for the (software) engineer. It is at least as important when using machine learning models for solving important real-world tasks. The problem is that in a non-linear system, where machine learning models are particularly strong, it is not straightforward to judge what an edge case even looks like.


So, my advice would still be to put subject matter experts, data scientists and software engineers in one (virtual) room and make sure that you have guardrails in place so that your AI stays part of a wider system that you understand and can control. Or as the wise Russian proverb goes: Trust, but verify.


This article was (still) written by a human.

. . .



References