Common sense makes humans very efficient learners, so machine learning researchers have been working on ways to imbue machines with at least some ‘common sense’. In a previous blog we discussed using pictures to train natural language processing systems, in a sense giving the systems partial ‘knowledge’ of what words represent in the physical world. ML systems can get even closer to common sense with a little help from video ML models and human teachers.
In my latest iMerit blog I discuss an innovative deep learning architecture that applies the concept of attention, commonly used in sequence models for language processing, to analyze motion patterns in video using only 30 percent of the computations used in previous approaches.
Next I discuss training such a video analysis system to learn the basic language of movement. For this training the human teacher goes beyond typical training data annotation, drawing on knowledge of the physical world to improvise representative examples of the basic concepts of movement. It is hoped that this will give the ML system a bit of ‘common sense’, allowing it to more easily learn new video analysis tasks.