Modelling how we perceive video

Maria Laura Mele

Neuroscience Lead, Cogisen

Despite advances in technology, we are still not good at predicting how people implicitly perceive video quality. This matters because we need to know what you look at, in order to compress video better. Every video you watch on the internet is compressed by making assumptions about what you see, and throwing away what you won’t. Better video compression matters a lot because 80% of internet traffic is video, so better models of what you look at can save massive amounts of data. This is important for our team at Cogisen to understand, because we develop video compression algorithms that use artificial intelligence to learn what people look at. Consequently, we conduct extensive research into how people react to video.

Today, to assess how people react to video, we rely on viewers reporting on video quality using a ratings system. That has limitations. The main issue is that although we consciously make some ‘explicit’ decisions about video content, we also have ‘implicit’ emotional reactions and responses that affect our perceptions of material. That is why at the Cogisen Cognitive Laboratory we set out to learn how the brain makes implicit decisions about perception. In particular, we asked:

Do traditional ‘explicit’ video quality metrics based on what people say, match our new ‘implicit’ metrics that are based on unconscious physiological responses?


(Cogisen’s model of how top down and bottom up saliency affect implicit and explicit quality)

Human quality estimation is not a straightforward process, because it depends on what we look at in a scene, which is called “visual saliency”. Some of what we choose to look at depends on the information in the scene, like visual complexity or contrast, and normal human bias, such as looking at faces. This type of visual saliency is called “bottom up”, and it is something that draws attention reflexively, an involuntary response to visual information. Another type of saliency is deliberate “top-down” attention, due to higher factors like task-related demands, context, emotional state and the viewer’s needs or motivations. An example of top-down saliency is if we are asked to count dogs in a video showing animals, we would pay more attention to dogs than the other objects in the scene. But if a moving blinking red dot suddenly appears, bottom-up saliency would force us to allocate our attention to it.

The salient things that we notice in a scene cause us to respond to them. Some of those responses are easy to see and measure, like what your eyes look at and the emotions on your face. Other responses are harder to see, like how your thought patterns change.

(Image with saliency from the MIT300 dataset)

Perception happens automatically, so we are not consciously aware of much of the process. If you are asked to rate the quality of a video, you might have the feeling of randomly assigning a score, but research shows that your scores will still correlate with the video quality levels. For other videos, you may know exactly why and what steps you made to assign that score. So even scores that you consciously choose have “explicit” and “implicit” parts.

One area that requires further examination is how you perceive scenes in subjective Video Quality Assessment. The current best practice is to use groups of people to assess video quality, by asking them to consciously assign a score to each video’s quality. Explicit scores can deliver good results, but they have limitations. Subjective quality scores are the result of conscious and deliberate cognitive processes of judgement. They are influenced by expectation, biases that are not easily studied, and motivational or emotional factors such as social desirability. Using human subjective scores means accepting some level of human – conscious or unconscious – bias. We are human after all. If we could measure the physiological response of people watching video, it would allow us to see how people actually respond, not just what they want to tell you or are consciously aware of.

Our research We conducted research to find ‘implicit’ physiological metrics that show what people actually are looking at. The first step in our research was to identify ‘implicit’ physiological responses that are involved in the visual quality evaluation process.The most obvious ‘implicit’ physiological response to look at was gaze. How long the eye looks at each location (fixation duration) is an implicit automatic process. Evaluating video quality increases the demand on attention and working memory (things we keep in mind for a few seconds). Normally, when we look at a scene, our brain tells our eyes to look at points called ‘fixations’ – where the eye rests for a moment and the brain processes information. To lower the working memory load when a task becomes too demanding, the brain reduces the time spent at each fixation (Orquin and Loose, 2013). That means we can measure how quickly a person’s eyes are moving over a scene, and see the workload required to evaluate its video quality. Any change in fixations tells us whether higher video compression is changing a scene’s workload. If we can identify which parts of a scene don’t influence human perception, we will be able to increase the compression of those elements.

Eye gaze isn’t the only physiological process that changes in response to human factors such as workload, stress, approach or emotions. Brain function also changes, so it is possible to measure workload directly by measuring brain activity using temporary electrodes placed on a person’s head, which is called electroencephalography (EEG). EEG has a unique advantage in that it can discriminate different types of cognitive functions, by seeing which parts of the brain are activated.

Emotions influence what we look at, so we needed a physiological measure for emotions. Facial expression analysis was used, which allowed us to measure how much of each emotion was expressed by the faces of people watching the video.

In our study we wanted to investigate what happens to these ‘implicit’ physiological measures when video compression increases. Our investigation asked participants to compare two different video compression types while they were measured using eye tracking, facial expression analysis, and EEG. Then we investigated the relationship between ‘explicit’ reported judgments of video quality and ‘implicit’ physiological measures of video perception.

The results of our research were very positive. Using eye tracking, we found that when videos are highly compressed, the duration of fixations decreases, which shows that higher video compression results in a greater working memory load. We also found that fixation duration predicted video quality scores, and was more sensitive than ‘explicit’ reports at recognising different video compression levels. Unlike traditional report-based video quality assessment methods, fixation measures vary while the video is playing, so they can highlight which part of the video affects quality ratings.

We measured EEG and facial expression during the video presentation. These biometrics provided qualitative information about cognitive workload, interest, motivation, and emotions involved during the process of rating the video quality. The facial expression measures indicate whether you are feeling happy, sad or angry about a scene. The additional information may help us understand the threshold beyond which perceived video quality changes, and show where video quality rating methods have problems: for example the EEG values show people’s “approach motivation” decreased as the test progressed. Motivation is a top-down factor that can affect visual saliency. Combining EEG and face biometrics with eye tracking data can also help us understand which area of the screen causes a perception of quality.

Overall, our Lab’s study found that we can measure how compression affects both explicit and implicit human perception. It emphasises the importance of implicit measures when trying to build video compression technologies that are modelled on the way the human visual system works. If you can identify which points in a video affect subjective quality perception, then you will be able to maximise compression without affecting perception of the video.

From this study, we want to establish how Cogisen’s compression tool, SENCOGI®, can increase video compression without affecting how the content is perceived. This next generation of video compression will require implicit measurements, so we can work at the limits of human perception. It also points the way to a new generation of video quality estimation techniques that move beyond the limitations of explicit video quality reports. As a neuroscientist I’m incredibly excited to see where this research takes us. At the Cogisen Cognitive Laboratory we are at the forefront of current experimentation.

Maria Laura Mele is the Neuroscience Lead at Cogisen and holds a Doctorate in Cognitive and Physiological Psychology from Sapienza Università di Roma.

Next Article

Cogisen Launches its Cognitive Modelling Laboratory