Learning Common Sense from Video

Common sense makes humans very efficient learners, so machine learning researchers have been working on ways to imbue machines with at least some ‘common sense’. In a previous blog we discussed using pictures to train natural language processing systems, in a sense giving the systems partial ‘knowledge’ of what words represent in the physical world. ML systems can get even closer to common sense with a little help from video ML models and human teachers.

In my latest iMerit blog I discuss an innovative deep learning architecture that applies the concept of attention, commonly used in sequence models for language processing, to analyze motion patterns in video using only 30 percent of the computations used in previous approaches.

Next I discuss training such a video analysis system to learn the basic language of movement. For this training the human teacher goes beyond typical training data annotation, drawing on knowledge of the physical world to improvise representative examples of the basic concepts of movement. It is hoped that this will give the ML system a bit of ‘common sense’, allowing it to more easily learn new video analysis tasks.

Learning Words with Pictures

Natural language processing (NLP) machines have made great progress by learning to recognize complex statistical patterns in sentences and paragraphs. Work with modern deep learning models such as the transformer has shown that sufficiently large networks (hundreds of millions parameters) can do a good job processing language (e.g., translation), without having any information about what the words mean.

We humans make good use of meaning when we process language. We understand how the things, actions, and ideas described by language relate to each other. This gives us a big advantage over NLP machines – we don’t need the billions of examples these machines need to learn language.

NLP researchers have asked the question, “Is there some way to teach machines something about the meaning of words, and will that improve their performance?” This has led to the development of NLP systems that learn not just from samples of text, but also from digital images associated with the text, such as the one above from the COCO dataset. In my latest iMerit blog I describe such a system – the Vokenizer!

Machines Learning From Machines

‘If I have seen further, it is by standing on the shoulders of giants’

Sir Isaak Newton, 1619

Technical disciplines have always progressed by researchers building on past work, but the deep learning research community does this in spades with transfer learning. Transfer learning builds new deep learning systems on top of previously developed ones.

For example, in my recent iMerit blog, I describe a system to detect Alzheimer’s disease from MRI scans. It was built using a very large convolutional neural network (VGG16) that had been previously trained on 14 million visual images. The Alzheimer’s detection system substituted the last few layers in VGG16 with custom, fully-connected layers. 6400 MRI images were used to train the customer layers, while the parameters of the convolutional layers were ‘frozen’ to their previously trained values.

This approach works because VGG16 had already ‘learned’ some general ‘skills’, like how shape, contrast, and texture contribute to recognizing image differences. Applying this ‘knowledge’ allowed the Alzheimer’s detection system to be trained using a relatively small number of MRI images.

Transfer learning is remarkably easy to implement. The deep learning community has many open source repositories, such as the ONYX Model Zoo, which provide downloadable, pre-trained ML systems. In addition, ML system development environments such as TensorFLow make it easy to load previously trained systems and modify and train custom final layers.

To learn more about how transfer learning works, and how new research is extending the ability of previously trained ML systems to tackle new problems, read my iMerit blog.

Navigating the Cost Terrain with Minibatches

Training a Machine Learning system requires a journey through the cost terrain, where each location in the terrain represents particular values for all ML system parameters, and the height of the terrain is the cost, a mathematical value that reflects how well the ML system is performing for that parameter set (smaller cost means better performance). For a very simple ML system with only two parameters, we can visualize the cost terrain as a mountainous territory with peaks and valleys, plateaus and saddlebacks. (Deep learning cost terrains are a lot like this, only instead of three dimensions they can have millions!)

Training mathematically explores the cost terrain, taking steps in promising directions, hoping not to fall off a cliff or get lost on a plateau. Our guide in this journey is gradient descent, which calculates the best next step in the search for the best ML system parameters, which is in the lowest valley of the cost terrain.

Gradient descent can be very cautious and look at all the training samples before taking a step. This makes sure the step is a very good one, but progress is slow because it takes a long time to look at all the training samples. Or, it can make a guess and take a step after every training sample it looks at. These snap decisions make rapid steps in the cost terrain, but there is a lot of motion but little progress because each step is all about one training sample; we want the ML system to give good average performance across all the training samples.

The best way to efficiently navigate the cost terrain is a compromise between slow deliberation and snap judgement, called minibatching. This approach takes a step using a small subset of the the training set – enough to get a pretty good idea of where to go, but a small enough sample size so that the calculations can be done quickly using modern vector processors.

Read my latest iMerit blog to get a better idea of how minibatching works:

Learning Without a Teacher

Machine learning applications generally rely on supervised learning, learning from training samples that have been labeled by a human ‘teacher’. Unsupervised learning learns what it can from unlabeled training samples. What can be learned this way are basic structural characteristics of the training data, and this information can be a useful aid to supervised learning.

In my latest iMerit blog I describe how the long-used technique of clustering has been incorporated into deep learning systems, to provide a useful starting point for supervised learning and to extrapolate what is learned from labeled training data.

The Road to Human-Level Natural Language Processing

Language is a hallmark of human intelligence, and Natural Language Processing (NLP) has long been a goal of Artificial Intelligence. The ability of early computers to process rules and look up definitions made machine translation seem right around the corner. However language proved to be more complicated than rules and definitions.

The observation that humans use practical knowledge of the world to interpret language set off a quest to create vast databases of human knowledge to apply to NLP. But it wasn’t until deep learning became available that human-level NLP was achieved, using an approach quite unlike human language understanding.

In my latest iMerit blog I trace the path that led to modern NLP systems, which leave meaning to humans and let machines do what they are good at – finding patterns in data.

Encoding Human and Machine Knowledge for Machine Learning

iMerit is a remarkable company of over 4000 people that specializes in annotating the data needed to train machine learning systems.

I am writing a series of blogs for them on various aspects of machine learning. In my latest blog I explain how ML systems embody both human intelligence and a form of machine ‘intelligence’.

Just as our biology provides the basis for human learning, human-provided ML system designs provide frameworks that enable machine learning. Through human engineering, these designs bring ML systems to the point where everything they need to ‘know’ about the world can be reflected in their parameters.

Analogous to the role of our parents and teachers, training data annotation drives the learning process toward competent action. Annotation is the crucial link between the ML system and its operational world, and accurate and complete annotation is the only way an ML system can learn to perform well.

The Three Edge Case Culprits: Bias, Variance, and Unpredictability

iMerit is a remarkable company of over 4000 people that specializes in annotating the data needed to train machine learning systems.

I am writing a series of blogs for them on various aspects of machine learning. In my latest blog I explain how ML systems can be fooled by being either too ‘simple‘, too ‘inexperienced‘, or faced with too many surprises.

How Does Mislabeled Training Data Affect ML System Performance?

iMerit is a remarkable company of over 4000 people that specializes in annotating the data needed to train machine learning systems.

I am writing a series of blogs for them on various aspects of machine learning. In my latest blog I explain how inaccuracies in training data labels (‘label noise’) affect ML system performance. It turns out that it’s not so much how many errors that matters, but how those errors are structured.

Thinking About Thinking Machines

AI’s have been developed that respond to human language, drive cars, and play masterful chess. As these feats traditionally require human intelligence, it might be said that AIs possess a form of intelligence.

What do we humans make of this artificial ‘intelligence’? Are AIs intelligent entities in the same sense as we humans? Can machines think?

Thinking about thinking is second nature to us humans. What first evolved as an ability to guess what others are thinking, to better compete and collaborate, further evolved into self-reflection. While other animals self-reflect, humans have a unique ability to conceptualize thinking. While self-reflection is key to human intelligence in general, it shows up particularly in philosophy (Descartes’s famous ‘I think therefore I am’), and it has of course driven the invention of AI itself.

For a long time, philosophers have been thinking about whether machines can (ever) think. The philosophical debate centers on whether there is something inherent in human intelligence that can never be duplicated by a machine .

In a 1980 paper, the philosopher John Searle argues that machines can never achieve human-like intelligence. In his famous ‘Chinese room’ thought experiment, a non-Chinese-speaking man is able to respond, in Chinese, to Chinese messages passed through a slot in the door. He is able to do this without understanding any Chinese, simply by referring to Chinese-to-Chinese correspondence tables.

Searle says that like computers, the Chinese room only simulates thinking, which is clearly different from a person communicating from an understanding of the Chinese language. Thus, AIs do not understand as humans do, and cannot really be thinking. AIs only simulate thinking.

Hubert Dreyfus also believed that machines will never think like humans. In ‘What Computers Can’t Do‘ he argues that human intelligence requires the context of a human body and a human life, which can never be reduced to machine algorithms.

On the other hand, in a 1950 paper Alan Turing concluded that there is no reason a machine might not eventually be judged as ‘thinking’, to the extent we are able to come up with a suitable test. He proposed his famous ‘Imitation Game’ (now called the Turing Test) as a criteria: if an AI can carry on an open-ended conversation with a person and not reveal itself as a non-person, we have no justification to say it is not thinking.

More recently, AI pioneer Geoff Hinton made another argument for the possibility of machines thinking. He believes deep neural networks may eventually achieve human-level intelligence. He points out that our brains work with patterns of billions of elementary elements (electrical and chemical signals) in a way not fundamentally different from the way deep neural networks encode patterns in their billions of parameters.

AI practitioners have tended to view the question of whether machines will ever think as more of a practical issue than a philosophical one. They point out the difficulty in pinning down what constitutes human intelligence, and the difficulty in predicting or ruling out technical breakthroughs. They prefer to look at what AI has accomplished so far, and speculate when continuing progress might produce AIs that match human-level performance.

It seems these speculations have often underestimated how far we have to go. For example, in the 1960’s, AI pioneer Herbert Simon predicted human level intelligence by the 1980’s. In 2006, Ray Kurzweil predicted that a computer with the power of a human brain would be available around 2020 (for $1,000).

We are still waiting!