Natural language processing (NLP) machines have made great progress by learning to recognize complex statistical patterns in sentences and paragraphs. Work with modern deep learning models such as the transformer has shown that sufficiently large networks (hundreds of millions parameters) can do a good job processing language (e.g., translation), without having any information about what the words mean.
We humans make good use of meaning when we process language. We understand how the things, actions, and ideas described by language relate to each other. This gives us a big advantage over NLP machines – we don’t need the billions of examples these machines need to learn language.
NLP researchers have asked the question, “Is there some way to teach machines something about the meaning of words, and will that improve their performance?” This has led to the development of NLP systems that learn not just from samples of text, but also from digital images associated with the text, such as the one above from the COCO dataset. In my latest iMerit blog I describe such a system – the Vokenizer!