slides
Transcription
slides
Hierarchical models of the visual cortex Thomas Serre Brown University Department of Cognitive & Linguistic & Psychological Sciences Brown Institute for Brain Sciences Center for Vision Research patient #004 Arslan Singer Madsen Kreiman & Serre (unpublished) Classifier-based importance maps patient #004 ERP signals Arslan Singer Madsen Kreiman & Serre (unpublished) Classifier-based importance maps patient #004 ERP signals Arslan Singer Madsen Kreiman & Serre (unpublished) Rapid presentation paradigms • Ss get the gist of a scene from ultra-rapid image presentations - No time for eye movements - No top-down / expectations • Coarse initial base representation - Enables rapid object categorization - Does not require attention - Sensitive to background clutter - Insufficient for object localization Potter 1971; Biederman 1972; Thorpe et al 1996; Li et al 2002; Evans & Treisman 2005; Serre et al 2007; see Fabre-Thorpe 2011 for review Rapid presentation paradigms • Ss get the gist of a scene from ultra-rapid image presentations - No time for eye movements - No top-down / expectations • Coarse initial base representation - Enables rapid object categorization - Does not require attention - Sensitive to background clutter - Insufficient for object localization Potter 1971; Biederman 1972; Thorpe et al 1996; Li et al 2002; Evans & Treisman 2005; Serre et al 2007; see Fabre-Thorpe 2011 for review Rapid categorization: Behavior Dy Ry Image Interval Image-Mask 100 Mask 1/f noise ~50 ms SOA Accuracy (%) 90 80 70 60 50 Animal present or not ? Cauchoix Crouzet Fize & Serre (unpublished) human Ss Familiar Novel Mon: 0.49 Hum: 0.47 Cauchoix Crouzet Fize & Serre (unpublished) A C 100 90 1 80 70 60 50 Fam New B 100 80 70 60 M1 50 M2 Fam New Monkeys animalness 90 0.5 0 Cauchoix Crouzet Fize & Serre (unpublished) 0 0.5 Humans animalness 1 0.25 Corrected Correlation A C 100 0.2 90 1 80 0.15 70 60 0.1 50 0.05 B 0 100 Fam New Hum/Hum Hum/Dy Hum/Ry Dy/Ry 80 70 60 M1 50 M2 Fam New Monkeys animalness 90 0.5 0 Cauchoix Crouzet Fize & Serre (unpublished) 0 0.5 Humans animalness 1 Setup Ventral visual stream Button release and touch screen on targets Cauchoix Crouzet Fize & Serre (unpublished) Image source: DiCarlo Setup Ventral visual stream Button release and touch screen on targets Cauchoix Crouzet Fize & Serre (unpublished) Image source: DiCarlo Decoding • Robust single-trial decoding of category information from fast ventral stream neural activity ~ 70–80 ms on fastest trials • Neural activity linked to behavioral responses (both accuracy and reaction times) Cauchoix Crouzet Fize & Serre (unpublished) Ventral visual stream Image source: DiCarlo Riesenhuber & Poggio 1999 Serre Kouh Cadieu Knoblich Kreiman Poggio ’05 ’07 Serre Oliva & Poggio ’07 • System-level feedforward computational model, large-scale (100M units), spans several areas of the visual cortex • Some similarities with state-of-theart computer vision systems (e.g., convolutional and deep belief nets; see also Fukushima’s neocognitron) • But constrained by anatomy and physiology and shown to be consistent with experimental data across areas of visual cortex Feedforward hierarchical model of object recognition za st 01 out et ed rietch ed mof m ats, m epe er ny nd pby an m. don es rch ng nd as es ar al rt ed psychophysics on human subjects. Area Type of data Ref. biol. data Ref. model data Psych. Rapid animal categorization Face inversion effect (1) (1) (2) (2) (11) (5) (12) (8,9) (17–19) (8) Face processing (fMRI) (3) (3) • An initial attempt to reversePFC Differential role of IT and PFC in categorization (4) (5) engineer ventral stream IT Tuning and invariance the properties (6) of the (5) Read out for object category (7) (8,9) visual cortex Average effect in IT (10) (10) LOC V4 MAX operation Tuning for two-bar stimuli (108 • Large-scale units), spans Two-spot interaction (13) (8) Tuning for boundary conformation (14) (8,15) several areas of the visual cortex Tuning for Cartesian and non-Cartesian gratings (16) (8) V1 Simple and complex cells tuning properties • Some similarities with state-of-theart computer vision systems based on hierarchies of reusable parts (Geman, Bienstock, Yuille, Zhu, etc) as well as convolutional and deep belief networks (LeCun, Hinton, Bengio, Ng, etc) MAX operation in subset of complex cells 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. (20) (5) Serre, T., Oliva, A., and Poggio, T. Proc. Natl. Acad. Sci.104, 6424 (Apr. 2007). Riesenhuber, M. et al. Proc. Biol. Sci. 271, S448 (2004). Jiang, X. et al. Neuron 50, 159 (2006). Freedman, D.J., Riesenhuber, M., Poggio, T., and Miller, E.K. Journ. Neurosci. 23, 5235 (2003). Riesenhuber, M. and Poggio, T. Nature Neuroscience 2, 1019 (1999). Logothetis, N.K., Pauls, J., and Poggio, T. Curr. Biol. 5, 552 (May 1995). Hung, C.P., Kreiman, G., Poggio, T., and DiCarlo, J.J. Science 310, 863 (Nov. 2005). Serre, T. et al. MIT AI Memo 2005-036 / CBCL Memo 259 (2005). Serre, T. et al. Prog. Brain Res. 165, 33 (2007). Zoccolan, D., Kouh, M., Poggio, T., and DiCarlo, J.J. Journ. Neurosci. 27, 12292 (2007). Gawne, T.J. and Martin, J.M. Journ. Neurophysiol. 88, 1128 (2002). Reynolds, J.H., Chelazzi, L., and Desimone, R. Journ. Neurosci.19, 1736 (Mar. 1999). Taylor, K., Mandon, S., Freiwald, W.A., and Kreiter, A.K. Cereb. Cortex 15, 1424 (2005). Pasupathy, A. and Connor, C. Journ. Neurophysiol. 82, 2490 (1999). Cadieu, C. et al. Journ. Neurophysiol. 98, 1733 (2007). Gallant, J.L. et al. Journ. Neurophysiol. 76, 2718 (1996). Schiller, P.H., Finlay, B.L., and Volman, S.F. Journ. Neurophysiol. 39, 1288 (1976). Hubel, D.H. and Wiesel, T.N. Journ. Physiol. 160, 106 (1962). De Valois, R.L., Albrecht, D.G., and Thorell, L.G. Vision Res. 22, 545 (1982). Lampl, I., Ferster, D., Poggio, T., and Riesenhuber, M. Journ. Neurophysiol. 92, 2704 (2004). Feedforward hierarchical model of object recognition es,13 finding that the model of the dorsal stream competed with a state-ofthe-art action-recognition system (that outperformed many other systems) on all three data sets.13 A direct extension of this approach led to a computer sys- this model produced a large dictionary of optic-flow patterns that seems consistent with the response properties of cells in the medial temporal (MT) area in response to both isolated gratings and plaids, or two gratings superim- za st 01 out et ed rietch ed mof m ats, m epe er ny nd pby an m. don es rch ng nd as es ar al rt ed psychophysics on human subjects. Area Type of data Ref. biol. data Ref. model data Psych. Rapid animal categorization Face inversion effect (1) (1) (2) (2) (11) (5) (12) (8,9) (17–19) (8) Face processing (fMRI) (3) (3) • An initial attempt to reversePFC Differential role of IT and PFC in categorization (4) (5) engineer ventral stream IT Tuning and invariance the properties (6) of the (5) Read out for object category (7) (8,9) visual cortex Average effect in IT (10) (10) LOC V4 MAX operation Tuning for two-bar stimuli (108 • Large-scale units), spans Two-spot interaction (13) (8) Tuning for boundary conformation (14) (8,15) several areas of the visual cortex Tuning for Cartesian and non-Cartesian gratings (16) (8) V1 Simple and complex cells tuning properties • Some similarities with state-of-theart computer vision systems based on hierarchies of reusable parts (Geman, Bienstock, Yuille, Zhu, etc) as well as convolutional and deep belief networks (LeCun, Hinton, Bengio, Ng, etc) MAX operation in subset of complex cells 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. (20) (5) Serre, T., Oliva, A., and Poggio, T. Proc. Natl. Acad. Sci.104, 6424 (Apr. 2007). Riesenhuber, M. et al. Proc. Biol. Sci. 271, S448 (2004). Jiang, X. et al. Neuron 50, 159 (2006). Freedman, D.J., Riesenhuber, M., Poggio, T., and Miller, E.K. Journ. Neurosci. 23, 5235 (2003). Riesenhuber, M. and Poggio, T. Nature Neuroscience 2, 1019 (1999). Logothetis, N.K., Pauls, J., and Poggio, T. Curr. Biol. 5, 552 (May 1995). Hung, C.P., Kreiman, G., Poggio, T., and DiCarlo, J.J. Science 310, 863 (Nov. 2005). Serre, T. et al. MIT AI Memo 2005-036 / CBCL Memo 259 (2005). Serre, T. et al. Prog. Brain Res. 165, 33 (2007). Zoccolan, D., Kouh, M., Poggio, T., and DiCarlo, J.J. Journ. Neurosci. 27, 12292 (2007). Gawne, T.J. and Martin, J.M. Journ. Neurophysiol. 88, 1128 (2002). Reynolds, J.H., Chelazzi, L., and Desimone, R. Journ. Neurosci.19, 1736 (Mar. 1999). Taylor, K., Mandon, S., Freiwald, W.A., and Kreiter, A.K. Cereb. Cortex 15, 1424 (2005). Pasupathy, A. and Connor, C. Journ. Neurophysiol. 82, 2490 (1999). Cadieu, C. et al. Journ. Neurophysiol. 98, 1733 (2007). Gallant, J.L. et al. Journ. Neurophysiol. 76, 2718 (1996). Schiller, P.H., Finlay, B.L., and Volman, S.F. Journ. Neurophysiol. 39, 1288 (1976). Hubel, D.H. and Wiesel, T.N. Journ. Physiol. 160, 106 (1962). De Valois, R.L., Albrecht, D.G., and Thorell, L.G. Vision Res. 22, 545 (1982). Lampl, I., Ferster, D., Poggio, T., and Riesenhuber, M. Journ. Neurophysiol. 92, 2704 (2004). Feedforward hierarchical model of object recognition es,13 finding that the model of the dorsal stream competed with a state-ofthe-art action-recognition system (that outperformed many other systems) on all three data sets.13 A direct extension of this approach led to a computer sys- this model produced a large dictionary of optic-flow patterns that seems consistent with the response properties of cells in the medial temporal (MT) area in response to both isolated gratings and plaids, or two gratings superim- Beyond spatial orientation and spatial frequency 1346 D.Y. Ts’o et al. / Vision Research 41 (2001) 1333–1349 functional terms, an additional level within the V2 organizational hierarchy, and at a finer grain than the view of V2 as a collection of CO stripes. The subcompartments for color and luminance seen in color stripes seen in optical imaging undoubtedly intermesh with the also observed representation of Issa et al 2000 T’so et al 2001 Fig. 10. Subcompartments for color, orientation and disparity within stripes of V2. The three optical images were obtained from different animals. (A) Color-preferring and luminance-preferring subcompartments within a single thin stripe. (B) Pseudo-color coded image of orientation selectivity in V2, showing domains of orientation (blue arrows indicate zones containing pale and thick stripes, large patches of saturated colors), separated by regions of little apparent organization for orientation (thin stripes, lacking patches of saturated color). Color code: blue =horizontal, red =45°, yellow=vertical, green =135°. (C) Patches of tuned excitatory disparity cells (white patches, left blue arrow) within thick stripes. Also patches of color cells (the dark patches, right blue arrow) can be seen within thin stripes of V2. Note the similarity of the geometry of the subcompartments, 0.7 –1.5 mm in size, regardless of functional type, whereas subcompartments for color (blobs) or (iso)orientation in V1 are smaller than those in V2, at !0.2 mm in size. Shmuel & Grinvald 1996 Color processing Conway ’01 Color processing Conway ’01 Color processing ECCV-12 submission ID 1052 Conway ’01 Spatio-chromatic opponent operator ( , , s) DO Color channels SO Half-squaring Half-wave R / G B R Divisive normalization G R G c2 , Single vs. double-opponent R G G R R C C R Y ECCV-12 submission ID 1052 B B Y Wh Bl 1 SO 0 DO R G G R Y B Wh Bl Fig. 3. Schematic description of spatio-chromatic opponent representation. The orang A 90 Comparison with glob cells in V4/PIT 180 90 90 yellow 90 90 90 Glob cells (Conway & Tsao ’09) Model 0 180 0 0 180 270 270 270 90 90 90 green red 180 red 0 180 0 180 180 0 0 180 0 180 0 cyan blue 270 270 270 90 90 90 270 270 270 B 90 90 90 0.2 180 0 180 270 0 180 270 180 0 270 0 180 180 0 270 270 270 90 90 90 A 90 180 90 90 0 180 0 0 180 180 270 270 0 90 0 270 270 90 0 180 180 90 270 270 C Zhang & Serre in prep Color processing Munsell data Model SO: R2=0.9952 Munsell 2 CIELAB: R = 1 1 1 0.5 0.5 0.5 0 0 0 0 0.5 1 0 0.5 1 0.5 Zhang & Serre in prep Color processing ECCV • SO/DO approach improves on all recognition and segmentation datasets tested ECCV 10 ECCV-12 submission ID 1052 as compared to existing#1052 color Table representations 2. Recognition performance on soccer team #1052 360 and 17-category flower dataset. 361 The data in each feature type are percentage of classification accuracy (Data inside the parentheses are the initial performance reported by [10, 31] using the same features 362 in a bag-of-words scheme.) 363 • ECCV Color datasets ECCV #1052 405 406 ECCV-12 407 A. Gradient used in SIFT 408 ECCV 409 Soccer team360 Flower 364 #1052 and their compo Fig. 4. Filters 410 Color Shape Both Color Shape Both 361 Hue/sift 69 (67) 43 (43) 73 (73) 58 (40) 65 (65) 77 (79) 365 411 (A) Gradient in the y direction submission 1052 412 11 Opp/sift 69 (65) 43 (43) 362 74 (72) 57ECCV-12 (39) 65 (65) 74 (79) ID 366 ECCV-12 submission ID Gaussian 1052 in Hmax [15]. (C) deC9 A. Gradient used in SIFT B. Gabor filters used in HMAX SOsift/DOsift 82 66 83 68 69 79 413 363 367 are: original filter and the Table 3. Recognition performance on Pascal voc 2007 dataset. Performance corre-the 450 SOHmax/DOHmax 87 76 89 77 73 83 414 sponds360 to the mean average precision (AP) paren- the input color channel 364 over all 20 classes. Performance 368 (in process #1052 Method 415 451used in the spatio-ch Fig. 4. Filters and their components thesis) corresponds to the best performance reported in [37, 6] orientations, scales and phases 365 416 452in sift computation 361 (A) Gradient in the369 y directions used • Pascal challenge 417 366 370It 453 362 approaches that do not rely on any prior knowledge object categories. inabout Hmax [15]. (C) Gaussian derivatives used in segmenta 418 A. Gradient used in SIFT B. Gabor filters used in HMAX C. Gaussian 454derivatives used in segmentation Method sift Huesift Opponentsift Csift SODOsift SODOHmax was shown, however, that the performance of various color descriptors could be 367 371 are: the original filter and the individual center and sur 363 419 On these two datasets, 455 further improved on this dataset43(up to 96% performance) when used in 46.8 con-(30.1/36.4) AP 364 40 (38.4) 41 (42.5) 368 43 (44.0) 46.5 (33.3/39.8) 372 channels. process the input color Note that additional filt 420 junction with semantic featuresand (i.e.,their Colorcomponents Names) andused bottom-up 456 opponent highly diagnostic ofHmax object ca Fig.color 4. Filters in the spatio-chromatic operator 421 orientations, scales and phases are also used in and 369 373 and top-down attentional mechanisms [32]. such an approach would 365 457 (A) Gradient in the y Whether directions used in sift computation [14]. (B) Gabor filters used 422 than their grayscale counterp similarly boost performance of the SO and be further 370 DO descriptors 374 Table 4. Recognition performance on scene should categorization 366 the 458 423 in Hmax [15]. (C) Gaussian derivatives used in segmentation [19]. From to’12 righ Zhang Barhomi &left Serre and Hmax) descriptors s studied. Disparity processing Extends the energy model of stereo disparity (Ohzawa et al ’90, Qian ’94, Fleet et al ’96) Disparity processing Riesen & Serre (unpublished data) See Sasaki et al ‘10 for qualitatively similar results Disparity processing Riesen & Serre (unpublished data) See Sasaki et al ‘10 for qualitatively similar results ... Motion processing ... G. DeAngelis, I. Ohzawa and R. Freeman --- Receptive-field dynamics Fig. 3. Spatiotemporal RF profiles (X-T LGN plots) for neurons recorded from the Nonlagged Lagged A B V4/ITof the cat. In LGN and striate cortex 200 250 MT/MST each panel, the horizontal axis represents space (X) and the vertical axis represents time (T). For panels A-F, solid contours de100 125 limit bright-excitatory regions, whereas dashed contours indicate dark-excitatory regions. To construct these X-T plots, 1-D RF profiles (see Fig. 2) are obtained, at 0 0 0 3 0 3 finely spaced time intervals (5-10ms), over G. DeAngelis, I. Ohzawa and R.V1/V2 Freeman --- Receptive-field dynamics V1/MT a range of values of T. These 1-D profiles are then "stacked up" to form a surface, SIMPLE, Separable Fig. 3. Spatiotemporal RF profiles (X-Tas a contour LGN which is smoothed and plotted C D plots) for neurons recorded from8,34 the Nonlagged Lagged A B map (for details, see Refs. ). (A) An X250 400 LGN and striate cortex of the cat. In 250 T profile is shown for a typical 200 ONeach panel, the horizontal axishere represents from the LGN. space (X)center, and thenon-lagged vertical axis X-cell represents Forpanels T<50 ms, RFcontours has a bright-excitatory time (T). For A-F,the solid de125 200 100 125 center and a dark-excitatory limit bright-excitatory regions, V1 whereas surround. V1 However,indicate for T>50 ms, the RF center bedashed contours dark-excitatory comes dark-excitatory and1-D the surround regions. To construct these X-T plots, 0 0 RF profiles (see Fig. 2) are obtained, becomes bright-excitatory. atSimilar spa0 0 6 0 3 0 6 0 0 3 finely spaced time intervals (5-10ms), over tiotemporal profiles are presented elsea range of values of T. These 1-D profiles where9,36 . (B) An X-T plot is shown for an are then "stacked up" to form a surface, SIMPLE, Separable SIMPLE, Inseparable ON-center, lagged X-cell. Note that the which is smoothed and plotted as a contour D second temporal phase of the profile isC E F 8,34). (A) An Xmap (for details, see Refs. 250 400 strongest. (C) An X-T profile for a simple 200 300 Figure 13.1 T profile is shown here for a typical ONcell with a space-time separable For processing of dynamic face stimuli. Form and motion features are extracted in two Neural modelRF. for the center, non-lagged X-cell from the LGN. T<100 ms, the RF separate has a dark-excitatory pathways. The addition of asymmetric recurrent connections at the top levels makes the units se For T<50 ms, the RF has a bright-excitatory 125order. The highest level200 subregion to the left of a bright-excitatory for temporal consists of neurons that fuse form and motion information. center and a dark-excitatorylective surround. 100 150 subregion. For T>100 ms, each subregion However, for T>50 ms, the RF center bereverses polarity, the bright-excitacomes dark-excitatory and so thethat surround region is now on the left. 0 0 becomes tory bright-excitatory. Similar spa- Similar X-T 6 0 0 6 0 0 tiotemporal profiles are presented else-8,30,34. (D) data are presented elsewhere 0 6 0 4 cell where9,36Data . (B)for Ananother X-T plot simple is shown forwith an an approxSIMPLE, Inseparable imately X-T profile. are ON-center, laggedseparable X-cell. Note that the(E) Data COMPLEX second temporal phase of the is shown for a simple cellprofile with a clearly insep-E F strongest.arable (C) An X-T profile Note for a how simple 200 300 X-T profile. the spatial arG cell with arangement space-time of separable For 200 bright- RF. and dark-excitatory Dark Bright Dorsal “motion” pathway Time, t (ms) STS Time, t (ms) Ventral “shape” pathway t t x Separable space-time RFs x Non-separable space-time RFs the rent sists /C2 /C1 dicative ainugh tion were first the tion the cogand isoclass Automated rodent behavioral analysis NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1064 Table 1 | Accuracy of the system. ‘Set B’ (1.6 h of video) ‘Full database’ (over 10 h of video) Our system CleverSys commercial system Human (‘Annotator group 2’) 77.3%/76.4% 60.9%/64.0% 71.6%/75.7% 78.3%/77.1% 61.0%/65.8% Image source: Shmuel & Grinvald ‘96 Accuracies are reported as averaged across frames/across behaviours (underlined numbers, computed as the average of the diagonal entities in Figure 3 confusion matrix; chance level is 12.5% for an eight-class classification problem). Assessing the accuracy of the system is a critical task. Therefore, we made two comparisons: (I) between the resulting system and commercial software (HomeCageScan 2.0, CleverSys Inc.) for mouse home-cage behaviour classification and (II) between the system and human annotators. The level of agreement between human annotators sets a benchmark for the system performance, as the system relies entirely on human annotations to learn to recognize behaviours. To evaluate the agreement between two sets of labellers, Jhuang Serre et al ‘07 ’10; Kuehne Jhuang et al ‘11 the rent sists /C2 /C1 dicative ainugh tion were first the tion the cogand isoclass Automated rodent behavioral analysis NATURE COMMUNICATIONS | DOI: 10.1038/ncomms1064 Table 1 | Accuracy of the system. ‘Set B’ (1.6 h of video) ‘Full database’ (over 10 h of video) Our system CleverSys commercial system Human (‘Annotator group 2’) 77.3%/76.4% 60.9%/64.0% 71.6%/75.7% 78.3%/77.1% 61.0%/65.8% Image source: Shmuel & Grinvald ‘96 Accuracies are reported as averaged across frames/across behaviours (underlined numbers, computed as the average of the diagonal entities in Figure 3 confusion matrix; chance level is 12.5% for an eight-class classification problem). Assessing the accuracy of the system is a critical task. Therefore, we made two comparisons: (I) between the resulting system and commercial software (HomeCageScan 2.0, CleverSys Inc.) for mouse home-cage behaviour classification and (II) between the system and human annotators. The level of agreement between human annotators sets a benchmark for the system performance, as the system relies entirely on human annotations to learn to recognize behaviours. To evaluate the agreement between two sets of labellers, Jhuang Serre et al ‘07 ’10; Kuehne Jhuang et al ‘11 Automated rodent behavioral analysis Automated rodent behavioral analysis Automated rodent behavioral analysis Automated rodent behavioral analysis Visual control of navigation Visual control of navigation Visual control of navigation Visual control of navigation Humans Model What matters: • Multi-stage / pooling mechanisms • Normalization circuits: - Tuning for 2D shape - Color similarity ratings - Tuning for relative disparity - Perceived motion tuning in MT - Classification accuracy (see also Jarrett et al 2009; Pinto et al 2009) What does not matter: • Separate classes of simple and complex cells (Pinto et al 2009; O’Reilly et al 2013) • Max in HMAX • Learning mechanisms? What have we learned about visual processing? Acknowledgments • Past work at CBCL: C. Cadieu, H. Jhuang, M. Kouh, U. Knoblich, G. Kreiman, E. Meyers, A. Oliva, T. Poggio, M. Riesenhuber • Lab members / Brown collaborators: A. Arslan, Y. Barhomi, K. Bath, S. Crouzet (now Charité – Universitäts medizin Berlin) J. Kim, X. Li, M. McGill (now CalTech), D. Mely, S. Parker, D. Reichert, I. Sofer, W. Warren, J. Zhang (Hefei University of Technology), S. Zhang • CNRS (France): E.J. Barbeau, G. Barragan-Jason, M. Cauchoix, D. Fize