
Artificial Intelligence Is Helping Computers Understand Our Movements
Artificial intelligence (AI) can significantly help computers to understand human behavior as well as various object movements. Computers can learn how the objects change in different scenarios in just a few video frames. Continuous advancement in this machine learning technology can also assist computers to understand our movements as well as their surroundings.
Video frames help in designing a machine learning system that can recognize different activities. Human beings, for example, can understand what is happening by looking at some frames of the video. They can narrate the action on the screen, and they can also predict the activity likely to occur in the next scene.
However, computers still struggle with this concept of understanding what will happen next in a video. A new model that uses machine learning aims to improve the computer’s efficiency in recognizing different activities in a frame. The new robot system can also determine and try to understand the events taking place around it.
The MIT researchers recently presented a paper at the European Conference on Computer Vision, and they said the add-on module is a good development. The module helps explicitly the artificial intelligence systems known as convolutional neural networks (CNN) to fill gaps between video frames.
Different gaps exist between video frames, and computers struggle to recognize activities in between these. The CNN will therefore significantly improve the activity recognition within the network by filling the gaps that exist. However, humans can easily recognize the likely course of action between scenes while computers require sophisticated system.
Source: SAS.com
The New Module
The new module by MIT researchers is called Temporal Relation Network or TRN, and it has a unique way of learning different changes to objects in a video. The changes take place at different periods, and the module picks keyframes that depict a specific activity in the video at various stages. For instance, the module can pick up piled objects that are later knocked down by a hand or any other purpose.
When you repeat the similar process, the module can recognize the same action in a new video. The machine learning technique learns from a specific activity, and it can apply the process in the next video. Objects in the video follow different sequences and this help to determine the next course of action.
If the next video has similar patterns, the machine learning system will be able to recognize them. In the experiment, the TRN module managed to identify hundreds of necessary activities thereby outperforming the existing models. For example, the model could realize activities like poking objects that later fall as well as giving thumbs up.
The model can also accurately predict the things that will happen in the next video. The module can forecast the likely action to take place depending on the number of frames. Other similar modules currently in use are not able to perform the same operation of predicting events that follow.
Significance Of The Module
The TRN module can help the robots to understand what is going on around them in future. In other words, the machine learning system keeps on advancing to such a stage where it can behave like a human being. Through continuous development of machine learning, computer-controlled technology can go a long way in imitating human behavior.
According to the researchers of the TRN module, the AI system seeks to recognize transformation of the objects instead of their appearance. The new system does not necessarily go through all the frames within a particular video. The system only picks, and it analyzes them to identify significant features.
It selects crucial aspects in the keyframes as a measure of improving efficiency by the computers in analyzing videos. The system will also be capable of running in real-time more accurately. The significant notable aspect in the previous modules is that they are not efficient and they are also not accurate.
Robots are standard these days, and they perform various tasks that make our lives easy. For instance, smart Roomba robots can ideally do a lot of functions in our homes while we relax. They do this by using programmed sensors that guide them in undertaking different activities. The new module tries to improve the existing technology where robots can learn to behave like human beings.
For example, the recognition of the transformation of the objects rather than appearance also helps the machines to learn. The human mind uses the same model where it determines a particular cause of action then tries to predict the likely outcome of that activity. When watching a video, the human mind uses the action showing on the screen to predict the next move.
Picking Up Key Frames
Efficiency and accuracy are two significant issues that the new module seeks to address. The two models that are currently in use have their drawbacks in these two components. A video consists of different frames, and these may take time to process one by one. Such action results in a loss of both efficiency and accuracy.
The CNN modules that are common for the activity recognition during the current period are good but not perfect. The other model, for instance, is accurate, but it must analyze all video frames before making a prediction. The whole process is slow, and it is also expensive.
The element of efficiency is lost since the module must analyze each video frame. Likewise, there are many video frames, so the whole exercise is time-consuming. The aspect of real-time in recognizing different courses of action also suffers since the process usually takes an extended period.
The other issue with another module is that it is more efficient but less accurate. This module is called the two-stream network. The additional stream is for extraction of features from one video frame. It then merges results with the optical flows consisting of information extracted about the individual movement pixels.
The main problem that arises with optical flows is that they are computationally expensive to extract. As a result, the module remains inefficient since the extraction process requires a lot of time. Therefore, the new TRN module is something that tries to work in between the previous blades by achieving efficiency and accuracy.
In the new module, the researchers used three sources of crowdsourced datasets of short videos featuring different activities. Something–Something is the first dataset built by TwentyBN with more than 200, 000 videos in 174 categories of action. These categories included lifting an object or poking an object so that it falls.
The second dataset contains nearly 150, 000 videos consisting of 27 different hand gestures. The gestures include giving thumbs up as well as lifting a fist among other actions. The third dataset consists of about 10, 000 videos with 157 categories of activities like playing basketball. These datasets include different classes of work to improve the chances of accuracy.
The TRN module simultaneously processes the ordered frames when it gets a video file. The processing takes place in groups of two, three, as well as four with spaced time apart. The module quickly assigns the probability that the transformation process of the object matches a specific class across those frames.
When the module processes two frames, it assigns high probability class when the next screen showed an object on the bottom but was earlier on top The activity class involves the movement of the object from top to bottom.
When the third frame shows the object in the middle of the screen, the probability then increases, and the trend goes on like that. The class of action indicates that the object transformation featuring in frames mostly represents a particular level of activity.
Recognition And Forecasting Activities
The convolutional neural network (CNN) with the new module can recognize many activities accurately. These activities mainly use two frames, but the other interesting thing is that accuracy increases through sampling more frames.
The new module beats several existing modules by achieving an accuracy rate of 95 % in activity recognition. There are high chances that the new module is accurate compared to the existing modules. As a result, the module can significantly help in improving object recognition and transformation by the computerized robots.
The module sampled a few more frames to come up with the accurate position in recognizing activities from different datasets. Something-Something, for example, used actions showing a person pretending to open a book as well as the real act of opening a book.
The system sampled a few more frames to identify the two actions involving a book. The operation involving more structures in an attempt to recognize certain activities helps to increase chances of accuracy.
Other activity-recognition models do not consider temporal relationships in frames though they also process keyframes. Such models have limited accuracy since they do not take into account the temporal relationships that exist. As such, the TRN researchers report that their model doubles other models concerning accuracy.
The new model focuses on involving many frames in activity recognition, and it beat other existing modules regarding forecasting an activity. The TRN module achieved accuracy with several percentage points after processing the first 25 frames in comparison to the baseline model.
The TRN module also achieves about 10 to 40 percent high accuracy with about 50 % of the frames. The module could predict the following activities such as the hands tearing a paper by the hands holding the paper in early frames. Further advancement of this technological development entails that computers may be able to forecast different things around them.
Major Findings Of The Study
The researchers suggest a powerful technique able to analyze many video frames at a time can help machine learning tool understand things going around it. The apparatus can go a long way in determining the characteristics likely to take place around it.
The TRN module is more efficient compared to the other existing modules. More importantly, the module provides state-of-the-art accuracy on action recognition on benchmark datasets. The model is ideal for various robot applications and can go a long way in helping the blind to be able to drive cars.
The module can provide real-time visual information concerning scanning the surrounding environment and providing accurate data. The module can also play a pivotal role in enhancing elements that have to do with security and many more. There are chances that the module will be able to predict things that can happen around it.
Further development Of The Module
The researchers seek to further improve the module’s sophistication by implementing two elements involving object and activity recognition. The researchers also hope to add intuitive physics to help the model to understand different properties in the real world like what a person does.
Once this is possible, computer-controlled robots can learn and understand the behavior of human beings. What this means is that if the advancement in machine learning technology achieves its goals, robots can perform a variety of tasks without human intervention.
These researchers are confident that they can train the module to learn the physics of law so that it can use them in recognizing new videos. Activity understanding forms an exciting component of artificial intelligence.
The robot should be capable of recognizing the object and carefully observing it. After identifying the object, the module should then be in a position to anticipate the action it is going to take. If the system is capable of recognizing a specific activity, it can also help in predicting the next movement likely to take place in the video.
Over and above, many developments taking place in the area of artificial intelligence show that computers can learn human behavior. Computers can learn the movements of objects by analyzing video frames. These frames play a pivotal role in helping the system to recognize different activities taking place in the video.
The new TRN module proves to be effective compared to other existing modules since it is accurate and efficient. When the module can recognize certain activities, this technology can be used to help the computers copy human behavior.