ChatGPT: AI Rising

So, in the realm of ones and zeros spun, ChatGPT emerged, a dialogue begun. In the digital tapestry, its story sown, A testament to the code, a creation well-known.

ChatGPT

ChatGPT and other Large Language Models have come to the forefront of AI research and public discourse over the last couple of years. What is this AI technology and where will it take us? I will discuss this in my next three blogs. First: what is the technology behind ChatGPT?

ChatGPT and its brethren, together called Large Language Models, are the latest in a long line of artificial neural networks, one of the approaches to AI that has been pursued since the 1950s. The artificial neural network approach to AI was inspired by the discovery that the human brain and other biological nervous systems are large networks of interconnected cells, neurons, each of which fires an output signal if its input signals together exceed a threshold.

This led AI researchers to the idea that large networks of simple computational units might lead to smart systems. An example of such a simple computational unit is shown below:

This computational unit (also called an “artificial neuron”) takes inputs from four other units, multiplies each input by a weight, adds up the inputs multiplied by the weights, and compares this sum to a threshold. If the sum is greater than the threshold, the unit outputs a value of “1”, otherwise the unit outputs “0”. This output is sent to other units in the network.

How the unit responds to its inputs is controlled by the values of the weights W and the threshold T. The weights and the threshold are referred to as the unit’s parameters.

Inspired by neural networks in the brain, these simple computational units can be put together into artificial neural networks like the simple one on the left.

Early work with artificial neural networks demonstrated that they could indeed recognize simple patterns, e.g., handwritten text or spoken words. Even more exciting, mathematical techniques were developed to automatically adjust the parameters of the computational units so they gave correct answers, using just examples of inputs and correct outputs. This was the beginning of machine learning, the process by which artificial neural networks are “trained” using training samples.

These early demonstrations of “electronic brains” systems that could “learn” from examples got people pretty excited. The term “artificial intelligence” was coined and people began worrying about robot overlords. Note, this was in the 1950s!

As early research progressed, it became clear that these artificial neural networks were quite limited in what they can do. Tasks such as recognizing objects in pictures and translating language seemed just too complicated for artificial neural networks. AI research went in new directions to try to capture the complexities of human intelligence and the world it inhabits.

As the decades marched on, a few die-hard researchers continued to explore artificial neural networks. Finally, about 10 years ago, they had their day. Improvements in computer technology and the availability of large amounts of training data on the internet allowed very large artificial neural networks to be built and trained. It was discovered that these “deep neural networks” performed very well, better than anyone expected.

ChatGPT is an example of one of the most recent developments in deep neural networks. It is huge! Recall that an artificial neural network’s calculations are controlled by parameters that are set through the training process using training samples. The simple network above has 56 parameters. The latest version of ChatGPT has about 100 trillion parameters!

Besides its size, ChatGPT has three distinctive features indicated by the “GPT” part of its name:

  • It is Generative – it generates text that completes missing parts of the text that is given as its input
  • It is Pre-trained – its parameters were adjusted by having it predict removed portions of text in a training set that included about 300 billion words
  • It is based on the Transformer architecture, which is a particular way of connecting the computational units in its deep neural network.

ChatGPT and other Large Language Models have demonstrated impressive capabilities. They can create news stories, essays, poetry, legal documents, translations, summarizations, computer code, and other documents. These tools are expected to transform businesses and other aspects of our society. In future blogs we will explore these impacts, including the risks and dangers of this technology.

How to Con an AI

AIs can perform well at all kinds of tasks, such as interpreting images or text. For example, these days deep neural networks (DNNs) get over 90% accuracy on the IMAGENET benchmark database, which requires recognition of over 20,000 types of objects in over 14 million images.

Even though today’s deep neural networks have roots in early attempts to mathematically model the human brain and nervous system, the ‘knowledge’ possessed by DNNs takes a form quite different from a human’s.

Just as a picture of a school bus is only a bunch of numbers to an AI (red, green and blue brightness levels for each pixel), an AI’s ‘knowledge’ is also only a bunch of numbers – its mathematical parameters.

Each parameter is like a calibration value, and large DNNs have billions of them. Through an exhaustive training process, millions or billions of examples are presented to a DNN, and its parameters adjusted little by little until it gets the right answers.

When kids learn to recognize that it’s a “bus” that takes them to school each morning, they gain a general concept of what a bus is – a large vehicle with lots of seats. They understand “bus” in the context of their everyday lives, with its vehicles, classrooms, classmates, and roads. This allows kids to learn what a bus is without having to see a lot of examples, and they become immediately proficient at recognizing all kinds of buses.

But to an AI, a bus is what happens when input pixel values churn through the DNN, getting multiplied and combined by the DNN’s parameters, and produce an output representing “bus”. The AI can become proficient if its training data has millions of examples of different types of buses, viewed in various contexts and from different angles and distances.

The stark difference between how humans and AI’s ‘understand’ buses has an unfortunate side effect. Researchers have discovered that AIs can be fooled by making minute changes to the numerical values of its input data, changes that are imperceptible to humans. In the example above, an AI that had been trained to confidently recognize a school bus is fooled into ‘thinking’ a bus is an ostrich. This is done simply by making small perturbations to the school bus image.

Conning an AI this way is called an adversarial attack. In the first part of my iMerit article Four Defenses Against Adversarial Attacks, I discuss why AIs are vulnerable to these attacks, how adversarial attacks can be formulated, and how such attacks can cause harm. The diagram below from the article illustrates how to devise a particular type of adversarial attack, a Black Box attack.

Black Box adversarial attack

Adapting AI Systems to a Changing Environment

AI systems rely on machine learning, a process that uses data collected from the real world to train systems by optimizing their parameters. The data collected to train an AI system represents the world at a snapshot in time, the time during which the data was collected.

This makes an AI system vulnerable to changes in its operational environment. For example, if a self-driving car finds itself navigating in an area with traffic signs that are not represented in its training data, it will fail to operate the car correctly.

Drift is a term AI engineers use to refer to this degradation in machine learning system performance due to changes in the operational environment. Everyone agrees that drift is a problem, and articles about drift talk about things like data drift, concept drift, model drift, and covariance shift. These terms for different types of drift are not really standardized, and they can overlap and be somewhat confusing.

In my iMerit blog article Staying Ahead of Drift in Machine Learning Systems I clarify drift by putting it in the context of the three basic components of a machine learning system: feature extraction, model encoding, and output decoding. I illustrate the types of drift with a simplified example, and show how to modify machine learning systems to remedy drift. Finally, I describe a comprehensive approach to detecting and mitigating drift, as summarized in the figure below from the article.

Review: Human Compatible: Artificial Intelligence and the Problem of Control by Stuart Russell

Professor Russell’s book starts out with an entertaining journey through the history of AI and automation, as well as cautionary thinking about them. This discussion is well informed – he is a renown AI academic and co-author of a comprehensive and widely used AI textbook.

Having provided historical background, the remainder of the book argues two main points: (1) the current approach to AI development is having dangerous side-effects, and it could get much worse; and (2) what we need to do is build AIs that can learn to satisfy human preferences.

Concerning the dangers of AI, the author first addresses current perils: misuse of surveillance, persuasion, and control; lethal autonomous weapons; eliminating work as we know it; and usurping other human roles. I found this part of the book an informative and well-reasoned analysis.

Beyond AI’s current perils, the author next addresses the possibility of AIs acquiring superhuman intelligence and eventually ruling and perhaps exterminating humankind. The author believes this is a definite possibility, placing him in basic agreement with works such as Bostrom’s Superintelligence and Tegmark’s Life 3.0. AI’s existential threat is the subject of continuing debate in the AI community, and Russell attempts to refute the arguments made against his position.

Russell bases his case for AI’s existential threat on two basic premises. The first is that in spite of all the scientific breakthroughs required to initiate superintelligence (well documented by Russell), you cannot rule out humans achieving these breakthroughs. While I appreciate this respect for science and engineering, clearly some human achievements are more within reach than others. Humans understanding human intelligence, let alone creating human-level machine intelligence, seems to me too distant to speculate about except in science fiction.

Russell’s second premise is that unless we change course, superintelligence will be achieved using what he calls the standard model, which creates AIs by optimizing them to meet explicit objectives. This would pose a threat to humanity, because a powerful intellect pursuing explicitly defined objectives can easily spell trouble, for example if an AI decides to fix global warming by killing all the people.

I don’t follow this reasoning. I find it contradictory that an AI would somehow be both super intelligent and bound by fixed concrete objectives. In fact in the last part of the book, Russell goes to great pains to illustrate how human behavior, and presumably human-level intelligence, is far more complicated than sequences of explicit objectives.

In the last part of the book Russell advocates developing provably beneficial AI, a new approach that would build AIs that learn to satisfy human preferences instead of optimizing explicit objectives. While I can see how this would be an improvement over homicidal overlords, I don’t think Russell makes the case that this approach would be even remotely feasible.

To point out how we might grapple with provably beneficial AI he spends a good deal of time reviewing mathematical frameworks that address human behavior, such as utility theory and game theory, giving very elementary examples of their application. I believe these examples are intended to make this math accessible to a general audience, which I applaud. However what they mainly illustrate is how much more complicated real life is, compared to these trivial examples. Perhaps this is another illustration of Russell’s faith that human ingenuity can reach almost any goal, as long as it knows where to start. Like scaling up a two-person game to billions of interacting people.

I was very pleased to read Russell’s perspective on the future of AI. He is immersed in the game, and he is definitely worth listening to. However, I have real difficulty following his extrapolations from where we are today to either superintelligence or provably beneficial AI.

Learning Common Sense from Video

Common sense makes humans very efficient learners, so machine learning researchers have been working on ways to imbue machines with at least some ‘common sense’. In a previous blog we discussed using pictures to train natural language processing systems, in a sense giving the systems partial ‘knowledge’ of what words represent in the physical world. ML systems can get even closer to common sense with a little help from video ML models and human teachers.

In my latest iMerit blog I discuss an innovative deep learning architecture that applies the concept of attention, commonly used in sequence models for language processing, to analyze motion patterns in video using only 30 percent of the computations used in previous approaches.

Next I discuss training such a video analysis system to learn the basic language of movement. For this training the human teacher goes beyond typical training data annotation, drawing on knowledge of the physical world to improvise representative examples of the basic concepts of movement. It is hoped that this will give the ML system a bit of ‘common sense’, allowing it to more easily learn new video analysis tasks.

Learning Words with Pictures

Natural language processing (NLP) machines have made great progress by learning to recognize complex statistical patterns in sentences and paragraphs. Work with modern deep learning models such as the transformer has shown that sufficiently large networks (hundreds of millions parameters) can do a good job processing language (e.g., translation), without having any information about what the words mean.

We humans make good use of meaning when we process language. We understand how the things, actions, and ideas described by language relate to each other. This gives us a big advantage over NLP machines – we don’t need the billions of examples these machines need to learn language.

NLP researchers have asked the question, “Is there some way to teach machines something about the meaning of words, and will that improve their performance?” This has led to the development of NLP systems that learn not just from samples of text, but also from digital images associated with the text, such as the one above from the COCO dataset. In my latest iMerit blog I describe such a system – the Vokenizer!

Machines Learning From Machines

‘If I have seen further, it is by standing on the shoulders of giants’

Sir Isaak Newton, 1619

Technical disciplines have always progressed by researchers building on past work, but the deep learning research community does this in spades with transfer learning. Transfer learning builds new deep learning systems on top of previously developed ones.

For example, in my recent iMerit blog, I describe a system to detect Alzheimer’s disease from MRI scans. It was built using a very large convolutional neural network (VGG16) that had been previously trained on 14 million visual images. The Alzheimer’s detection system substituted the last few layers in VGG16 with custom, fully-connected layers. 6400 MRI images were used to train the customer layers, while the parameters of the convolutional layers were ‘frozen’ to their previously trained values.

This approach works because VGG16 had already ‘learned’ some general ‘skills’, like how shape, contrast, and texture contribute to recognizing image differences. Applying this ‘knowledge’ allowed the Alzheimer’s detection system to be trained using a relatively small number of MRI images.

Transfer learning is remarkably easy to implement. The deep learning community has many open source repositories, such as the ONYX Model Zoo, which provide downloadable, pre-trained ML systems. In addition, ML system development environments such as TensorFLow make it easy to load previously trained systems and modify and train custom final layers.

To learn more about how transfer learning works, and how new research is extending the ability of previously trained ML systems to tackle new problems, read my iMerit blog.

Navigating the Cost Terrain with Minibatches

Training a Machine Learning system requires a journey through the cost terrain, where each location in the terrain represents particular values for all ML system parameters, and the height of the terrain is the cost, a mathematical value that reflects how well the ML system is performing for that parameter set (smaller cost means better performance). For a very simple ML system with only two parameters, we can visualize the cost terrain as a mountainous territory with peaks and valleys, plateaus and saddlebacks. (Deep learning cost terrains are a lot like this, only instead of three dimensions they can have millions!)

Training mathematically explores the cost terrain, taking steps in promising directions, hoping not to fall off a cliff or get lost on a plateau. Our guide in this journey is gradient descent, which calculates the best next step in the search for the best ML system parameters, which is in the lowest valley of the cost terrain.

Gradient descent can be very cautious and look at all the training samples before taking a step. This makes sure the step is a very good one, but progress is slow because it takes a long time to look at all the training samples. Or, it can make a guess and take a step after every training sample it looks at. These snap decisions make rapid steps in the cost terrain, but there is a lot of motion but little progress because each step is all about one training sample; we want the ML system to give good average performance across all the training samples.

The best way to efficiently navigate the cost terrain is a compromise between slow deliberation and snap judgement, called minibatching. This approach takes a step using a small subset of the the training set – enough to get a pretty good idea of where to go, but a small enough sample size so that the calculations can be done quickly using modern vector processors.

Read my latest iMerit blog to get a better idea of how minibatching works:

Learning Without a Teacher

Machine learning applications generally rely on supervised learning, learning from training samples that have been labeled by a human ‘teacher’. Unsupervised learning learns what it can from unlabeled training samples. What can be learned this way are basic structural characteristics of the training data, and this information can be a useful aid to supervised learning.

In my latest iMerit blog I describe how the long-used technique of clustering has been incorporated into deep learning systems, to provide a useful starting point for supervised learning and to extrapolate what is learned from labeled training data.

The Road to Human-Level Natural Language Processing

Language is a hallmark of human intelligence, and Natural Language Processing (NLP) has long been a goal of Artificial Intelligence. The ability of early computers to process rules and look up definitions made machine translation seem right around the corner. However language proved to be more complicated than rules and definitions.

The observation that humans use practical knowledge of the world to interpret language set off a quest to create vast databases of human knowledge to apply to NLP. But it wasn’t until deep learning became available that human-level NLP was achieved, using an approach quite unlike human language understanding.

In my latest iMerit blog I trace the path that led to modern NLP systems, which leave meaning to humans and let machines do what they are good at – finding patterns in data.