There has been significant progress towards understanding the hierarchical sequence of operations during the first ~150 milliseconds of visual processing (“immediate vision”), building up to the formulation of a theoretical framework instantiated in successful computational algorithms for visual recognition. The bottom-up computations involved in immediate vision (Module 1) provide an initial and often accurate estimation of the contents within a radius subtending approximately 5 degrees around the fixation point. Module 2 aims to understand the visual routines and computations that take place during the subsequent ~300 ms of cortical processing and which are critical for the perceptual intelligence required to interpret a visual scene.
We think about the brain’s operating system as consisting of a series of visual routines (and subroutines), and the ability to flexibly and dynamically call upon them and combine them to solve specific visual tasks. As a working hypothesis, we postulate that the following visual routines would be important components required for scene understanding:
- Extracting initial sensory map → Call VisualSampling
- Propose image gist → Call RapidPeripheralAssessment
- Propose foveal objects → Call FovealRecognition
- Inference → Call VisualInference
- Specific detectors → Call ObjectClassifier, call ObjectLocator
- Temporary information storage → Vall VisualBuffer
- Task-dependent sampling → Call EyeMovement
- Determine spatial relationships → Call SpatialRelationships
- Determine object interactions → Call ObjectInteractions
- Decision making and answer → Call DecisionMaking, TaskReport
The projects in Module 2 are actively pursuing the neural and computational mechanisms instantiating these routines.