Think! Evidence

An invariance-based account of feedforward categorization in a realistic model of the ventral visual pathway

Show simple item record

dc.contributor Tomaso A. Poggio.
dc.contributor Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences.
dc.contributor Massachusetts Institute of Technology. Department of Brain and Cognitive Sciences.
dc.creator Mutch, James Vincent
dc.date 2017-05-11T19:53:36Z
dc.date 2017-05-11T19:53:36Z
dc.date 2017
dc.date 2017
dc.date.accessioned 2019-05-10T17:26:30Z
dc.date.available 2019-05-10T17:26:30Z
dc.identifier http://hdl.handle.net/1721.1/108881
dc.identifier 986239832
dc.identifier.uri https://evidence.thinkportal.org/handle/1721.1/108881
dc.identifier.uri http://hdl.handle.net/1721.1/108881
dc.description Thesis: Ph. D., Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, 2017.
dc.description Cataloged from PDF version of thesis. "September 2016."
dc.description Includes bibliographical references (pages 115-118).
dc.description For the recognition of general objects in natural scenes, the current top-performing computer vision models owe a debt to visual neuroscience. The hierarchical architecture of convolutional networks, and related models such as HMAX, mimics that of the ventral stream of visual cortex. In essence, they apply the model of Hubel and Wiesel recursively, alternating layers of 'simple' cells, which are tuned to certain local features, and 'complex' cells, which pool the outputs of simple cells within a local region. With recent advances in deep learning, for many tasks in vision and speech, emphasis has moved away from so-called 'hand-designed' models and toward big data and high throughput computing, with models learning from millions of labeled examples. Yet CNNs only learn their features - the weights of connections in the network. All other aspects of the network (size, connectivity, response functions, etc.) are unlearned architectural choices made by their designers. Vision has not yet been reduced to a pure learning problem - human insight into the nature of visual problems continues to be important. To design a good vision system, one still has to understand vision. And, as evidenced by performance for many complex visual tasks, natural vision systems still 'understand' vision better than we do; there is still much to be learned from them. Our work is based on the HMAX model, which places greater weight on biological realism. Our goals are threefold: to better understand the ventral stream algorithm, as well as the visual problem it solves, and to improve the performance of artificial vision systems. In this work we take two main approaches. i-theory is an ongoing effort to explain the good performance of hierarchical models in terms of a formal theory of invariance to transformations. We provide a reinterpretation of V1 simple and complex cells in the context of i-theory as computing a high-dimensional, locally translation-invariant signature for the contents of a V1 receptive field. We describe a simple algorithm for learning them which can extend without modification to the learning of higher-order representations for V2 and beyond. The algorithm yields model V1 cells having a good fit to data from several animal species. We also demonstrate that a precondition of i-theory, covariance, can hold in upper layers, even for transformations not anticipated in the training of lower layers. No current hierarchical object recognition model incorporates realistic retinal resolution. Incorporating this detail forces a reevaluation of the role of the ventral stream's feedforward core in the larger task of scene understanding as well as many details of the model itself, particularly with respect to scale. We investigate the optimal shape of the input window used to select a subset of the visual information available in a scene for processing in a single feedforward pass, defined as a region in (x, y, A), the handling of the A dimension within the hierarchy, and the problem of clutter. Our main experimental results are (1) spatial wavelengths too small for the retina to perceive across the entire object do not play a significant role in the no-clutter case, but confer robustness in the presence of clutter, and (2) preservation by the hierarchy of information about the relative scale (distance along A) of feature activations is more important than current models reflect.
dc.description by Jim Mutch.
dc.description Ph. D.
dc.format 118 pages
dc.format application/pdf
dc.language eng
dc.publisher Massachusetts Institute of Technology
dc.rights MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission.
dc.rights http://dspace.mit.edu/handle/1721.1/7582
dc.subject Brain and Cognitive Sciences.
dc.title An invariance-based account of feedforward categorization in a realistic model of the ventral visual pathway
dc.type Thesis


Files in this item

Files Size Format View
986239832-MIT.pdf 12.33Mb application/pdf View/Open

This item appears in the following Collection(s)

Show simple item record

Search Think! Evidence


Browse

My Account