\documentstyle[12pt,titlepage]{article}

\begin{document}


\baselineskip = 0.3in

\title{
   Combining Behaviors
      }
\author{Jude Mitchell }
\date{Third Year IP \\
      UCSD Cognitive Science Dept.\\
      June 5, 1998  \\
\vspace{8ex} \noindent 
Committee: \\
{\center Prof. David Zipser}\\
{\center Prof. Javier Movellan}\\
{\center Prof. Marty Sereno} 
}

\section{Introduction}

This paper reviews the problem of combining movement commands from different
behaviors.  New approaches to robotics construct systems that operate by
combining commands from separate behavioral modules.  Each module delivers 
commands that run a simple behavior such as obstacle avoidance or road following.  
This work is inspired by the ability of insects and amphibians to exhibit highly 
adaptive behavior in navigating their environments with a fairly limited 
behavioral repertoire.  A key problem encountered in this approach is that
different behaviors can attempt to execute incompatible movements at the 
same time. For example, the movement generated for scratching an itch is not 
compatible with drinking a cup of coffee at the same time.  In most cases, 
averaging movements from behaviors yields undesirable results.  Instead, 
control between the behaviors must be coordinated in time. The first section 
of this paper reviews to what extent humans are able to coordinate two tasks 
at the same time.  Interference between tasks depends largely on the demands 
that they place on controlled cognitive processes.  In the second section, 
schema theory is introduced as a behavioral and biological theory for how to 
coordinate concurrent behaviors. Then in the third section, the approaches from 
robotics are reviewed.  Recent robotics architectures are similar to those
proposed by schema theory.  In the final section, reinforcement learning is 
briefly reviewed.  It provides a method for optimizing control of movement 
directly from rewards in the environment. It can be used to learn how to
combine commands generated from seperate behaviors.

The problem of combining commands from separate behaviors can be illustrated
in a simple navigation problem. Suppose that a robot has two tasks: chasing
a moving target and avoiding obstacles.  A behavior for chasing a target will 
often command directions to move that are the opposite from the behavior avoiding
obstacles. For example, if an obstacle lies between the robot and the target 
the two commands will be exactly opposite.  In such cases, averaging traps
the robot in a local minima where it never moves.  Besides this problem, there is 
a second incompatibility that occurs at the level of the sensory input.  This 
conflict emerges because the field of view of the visual sensor is limited.  Since 
the two behaviors need to know where different objects are located to plan their
respective movements, there can be a conflict in where the visual sensor should
be focused.  In part, this conflict lies in the control of gaze, but it also has 
interesting consequences through the limits it imposes on visual information. Since 
neither behavior can be guaranteed to control gaze, each must be robust to periods 
in which visual input is absent or irrelevant to them. This means that each must 
have a working memory for the objects locations relevant to their control.
Further, since knowing when to take control of gaze may hinge on whether or not
the information in working memory is accurate, these must also estimate the 
reliability of stored information. This example, and many others in which the visual 
sensor is active and not passive, raises interesting control problems.

\section{Dual-Task Performance}

Even when no physical limitations are imposed in sensory input or motor
output, cognitive limitations in processing behaviors simultaneously still
occur and are evidenced by the psychological refractory period (PRP).
The PRP is a slowing in the reaction times to a stimuli in one
task when another task is performed concurrently (for review see Pashler,
1994).  In a typical experiment, two simple stimulus to response mappings
are performed together.  Subjects attempt to respond to each stimulus as
quickly as possible without errors.  Interference between the tasks is
probed by varying the interval between the presentation of the first
and second stimulus, called the stimulus onset asynchrony (SOA).  If 
there is no interference, then reactions times for both tasks remain 
unchanged as the SOA interval shortens. If however they interfere, 
then the reaction times slow at shorter SOA intervals.
A large variety of tasks display some degree of interference.  Interference 
persists even when physical limitations are prevented by isolating 
responses to different effectors and different sensory modalities. 

Several PRP studies indicate that bottlenecks occur in processing  
that force one task to wait for another. Bottlenecks are revealed 
when the order between two tasks are fixed and subjects instructed
that task 1 should have higher priority. This results in task 1 reaction 
times remaining unaffected while the task 2 reaction times are 
increased as the SOA interval becomes shorter.  When task 2 reaction
times are plotted as a function of the SOA, they have a slope
near -1 at short intervals reflecting that each reduction to the 
interval between the tasks adds an equal delay to task 2.
These findings are consistent with task 1 locking-out task 2 from
a stage in processing from stimulus to response.

The response selection bottleneck (RSB) hypothesis puts processing
limits at the stage where stimuli are mapped to responses (Pashler, 1994).  
Three stages in processing are assumed: stimulus identification,
response selection, and response execution.  Two tasks can run in
parallel at identification and execution levels, but must take turns
at the selection level.  Manipulating the difficulty of different
stages in processing reveals the bottleneck lies in the selection level.
First, when either the identification or selection stage of task 1 are
lengthened, it produces extra delay to task 2.  In contrast, lengthening 
the execution of task 1 has no affect.  This suggest that the bottleneck 
is imposed before response execution.  Another manipulation shows that 
if the identification stage of task 2 is prolonged, it leads to a less
delay for task 2.  This finding is explained by the task 2 selection stage 
being pushed back so it overlaps less with task 1 selection.  In short,
task 2 must wait for task 1 to complete at the selection stage.  Many 
variations of the PRP experiments support this bottleneck (Pashler, 1994).

Some tasks can avoid bottlenecks when their mappings from stimulus to 
response are particularly natural.  These type of mappings are called 
{\it ideomotor compatible}. An early experiment by Greenwald and Shulman 
(1973) demonstrates that two ideomotor tasks do not delay each other
when performed together.  In the first task, a flashed arrow directs
the subject to make a left or right movement.  In the second task 
the subjects repeat the an aurally prompted letter 'A' or 'B'. 
Interestingly, if the arrow in the visual-manual task is replaced by the 
word 'left' or 'right' then normal refractory periods return.
Thus even though the stimulus is compatible with response movement,
it is not ideomotor.  In general, ideomotor stimuli have physical 
characteristics that prompt the desired response.

Some basic movements to visual targets cause no delays in processing.  
In double-step reaching experiments there is no delay for reprogramming
a reach in progress when the target is moved to a different location
(De Jong, 1995; Barrett, 1993). Two reaction times are measured 
during the reach: the first is the time to initiate the original movement, 
and the second is the time to change direction after the target has moved.  
The SOA interval is given by the time between the initial target appearance 
and its movement. If programming the initial reach causes bottlenecks, 
then the second reaction time should be slower at short SOA intervals.
Instead, reactions times remain unaffected.  Further, the time needed 
to start a movement and to adjust it are very similar.  Another
experiment shows certain saccadic movements avoid delays in processing
(Pashler et al, 1993).  In this experiment, a saccade task always follows 
a speeded manual response to a tone.  Four variants of the saccade task 
are tested.  In the first task, a saccade is made to a target that appears
on the left or right of fixation.  In the second task, a red and 
a green target appear on either side of fixation, and a saccade is made
to the red one.  Both of these tasks have negligible refractory period 
effects.  Tasks in which the saccade direction is determined by the
color of a single target at the fixation point (red means right, green
means left) or by the larger of two adjacent digits have normal
refractory periods. The delay is present for cues with symbolic relations 
to the target of the movement, but not for cues that are the target 
of the movement.

The distinction between automatic and controlled processes explains
some of the differences between ideomotor tasks and those tasks
that cause refractory periods.  Controlled processes can map stimuli to 
arbitrary responses, but have limited capacity. Automatic processes avoid
capacity limits, but implement inflexible mappings.  In PRP studies, 
the response selection stage requires controlled processing to map stimuli 
to the relatively arbitrary responses specified in the task instructions.
Due to the limited capacity of controlled processing, when two tasks 
both require selection of novel responses then one must wait for the other.
Those tasks which are well practiced or common in natural experience can
become automatic (Schneider and Shiffrin, 1977).  Automatic processing
avoids bottlenecks through direct mappings from stimuli to responses.
Automatic processes are thought to be inflexible and below the level 
of deliberate control.

Two aspects of dual-task performance are obscured by the definition of 
automatic processing. First, although practice can reduce the magnitude 
of interference between tasks, it rarely abolishes it (Pashler, 1994).  
Thus some interference, the response selection bottleneck or otherwise, 
remains even between automatic behaviors.  Second, although automatic 
behaviors may be less flexible, they can still show remarkable 
coordination without deliberate control.  For example, in normal conditions 
an itch may cause a fairly reflexive response to move the hand and scratch
it. This response is suppressed or replaced by another movement when
the hand happens to be holding a cup of coffee. 
These types of conflicts are widespread in everyday situations, and 
yet they seem to require little effort to detect or to resolve them.  
In short, even when controlled processing is left aside, a great deal 
remains to be explained about the flexibility achieved when automatic
behaviors are in conflict.

\section{Schema Theory and Competitive Processing} 

\subsection{Schema Theory}

Norman and Shallice (1986) use schema theory to explain how automatic 
behaviors can be coordinated without deliberate control.  Their model 
consists of schemas that plan actions in parallel.  Here schemas are 
automatic behaviors that map perceptual routines to motor responses.  
Although each schema may be fairly inflexible, adaptive behavior 
can emerge from their interactions.  A process they call contention 
scheduling determines which schema gains control of action through 
a competition between schemas.  This competition resembles a winner-take-all 
network in which each node corresponds to a schema.  The activity of 
each schema is determined by its relevance given current sensory cues 
and the context (here context may refer to internal goals or states as well 
as the external world).  Schemas executing compatible behaviors excite 
each other while those executing incompatible behaviors inhibit each 
other.  Competitive interactions between two schemas can arise at any 
stage in processing where they control the same effectors or access
the same sensory or cognitive resources.  If two schemas run along
separable pathways then no delays occur.  However, if two are incompatible
at any point in processing, then one yield control to the other and
delay that stage of processing. 
 
The second part of their theory postulates that controlled processes
are responsible for novel mappings, decision making, planning, or 
inhibition of automatic responses.  A supervisory system is proposed
to implement controlled processing by adding extra excitation or inhibition 
to competing schemas. {\it Attention} is defined to be modulatory 
influence of the supervisory system upon the schemas.  It acts on a slow 
time scale relative to the schemas.  Therefore, it does not control 
the precise timing of movements, but instead selects the schema executing 
the movements.  Further, attention does not have absolute control over
which schemas are active.  If a schema detects conditions for which it
has high relevance, its activity can rise enough to seize control of
the system and redirect attention.  For example, in navigation a schema 
for walking can take control of gaze to look at an pothole in a sidewalk
that is detected in the periphery of the visual field.  This is one example 
of an orienting response (for review see Rohrbaugh, 1994).  

\subsection{Modeling Frog Behavior}

Schema theory is used to model how separate visuomotor behaviors  
are combined in the frog (Cobas and Arbib, 1992; Arbib, 1991).   When frogs 
detect small fly-like objects in their visual field they turn and move 
towards them.  In opposite fashion, frogs turn and move away from large 
looming stimuli that resemble predators.  If two flies appear, the frog
typically selects one of them to pursue.  Likewise, it typically flees
one of two predators, but not their average.  Physiological evidence 
shows that separate classes of retinal ganglion cells are sensitive 
to prey and predator stimuli (Ingle, 1991).  The two classes of retinal cells 
project to different layers of the tectum.  Cobas and Arbib (1992) model 
the behavior of the frog by separate visuomotor schemas: one controls
prey-catching and the other controls predator-avoidance.  The visual input 
for prey catching is a one-dimensional array of units. Each unit 
responds selectively to stimuli that match their location in the 
visual map.  Recurrent connections between units excite near locations 
and inhibit far locations.  This recurrent structure forces units 
at different locations to compete for activity.  When more than one fly
appears in the visual field, location captures all the activity in the 
map.  Movements are then planned toward that target.  Along the other pathway, 
visual inputs for predators feed into a second map with similar 
winner-take-all dynamics.  This maps selects a single predator and then 
plans movements away from it.  

Movement commands from the visuomotor schemas are combined into a motor 
map that represents the desired heading for the frog.  Each visual map 
representing a prey or predator location makes one-to-one connections to 
the corresponding motor map location.  Connections from the predator map 
are inhibitory while connections from the prey-catching map are excitatory.  
A higher priority is given to fleeing from predators by making the inhibitory 
connections larger.  Winner-take-all dynamics refine activity in the motor 
map so the location with peak activity is selected.  This direction determines 
where the frog moves.  If the commands of the schemas are incompatible 
(a prey and predator appear at the same location), then the predator-avoidance 
command dominates due to its higher priority.  If there are no predators, or 
two directions are equally good for avoiding the predator, then the prey-catching 
schema select the direction to move.  These dynamics enable the schemas to 
cooperate when they agree, and compete when they are incompatible.
This model provides a biologically feasible example of a contention scheduling
process between two schemas.  

Taking some liberties, the frog model can illustrate how supervisory control 
acts in a biological model.  Suppose that frogs understand the instructions 
of a psychology experiment, and are instructed not to flee from predators
when a secondary cue is delivered.  The supervisory system is responsible for
detecting the secondary cue, and then suppressing the habitual response to flee.
Further, the supervisory process is expected to take longer to process because
it accesses memory of the experiment instructions.  Due to the
lag in supervisory control, the initial activity in the predator
and motor maps initially matches the automatic response to flee.
After a delay, the decision not to flee from the supervisory process 
begins to suppress activity in the predator and motor maps.  The main 
prediction is that supervisory processes do not impose a bottleneck
on  processing, but instead modify existing automatic responses.

\subsection{Schema Theory in Humans and Monkeys}

Prefrontal lesions cause deficits that are specific to the planning and
selection of novel action attributed to the supervisory system (for review 
see Duncan, 1994). Patients retain competence in ``crystallized'' skills acquired 
prior to the lesion.  For example, many of them can still score well on WAIS 
intelligence tests which emphasize factual knowledge.  In contrast, performance 
on novel tests that require planning or reasoning are severely impaired.  In 
problems that consist of a series of steps subjects often fail to proceed 
unless they are prompted with appropriate sub-goals.  Also, subjects fail to
switch to different behavioral sets.  For example, they have difficulty adopting 
new sorting strategies in the Wisconsin card sorting task.  Even very simple 
tasks are impaired.  Subjects performing a sequential delayed saccade tasks 
saccade to targets in the wrong order (Heide et al, 1995).  Subject also fail
to suppress planned saccades when a 'dont go' signal is given (Godefroy, 1996).  
Somewhat similar deficits appear in monkeys.  Monkeys with prefrontal lesions have 
trouble learning tasks that require delayed or sequential responses (Fuster, 1994). 
In short, frontal lesions impair behaviors that require control of habitual responses,
or the selection of novel ones.

In monkeys, recent physiological experiments show that dynamics in selecting 
motor responses do not reflect bottlenecks in processing, but instead that a 
slow supervisory process modifies automatic plans.  In several motor areas, 
activity among neurons forms quickly to program an automatic responses to a
visual stimuli.  If the task requires suppression or reprogramming of the 
response, then the initial activity is changed after some latency to 
match the correct response.  Kalaska (1996) has identified these changes 
during go/no go reaching tasks in premotor cortex.  In these tasks, the 
color of a visual target cues whether or not the monkey should reach to it.
Recordings are made during a delay period prior to the reach from the neurons 
that are normally active for a movement to the target.  These cells have an
initial burst that persist over the delay period on 'go' trials.  On 'no go'
trials the initial burst is suppressed below the cells normal threshold.
A second task likewise shows that when another cue directs the monkey to reach 
away from the target, then initial activity reflects the target location but
is eventually modified to code for a movement in the opposite direction.
Other studies have documented similar responses in primary motor cortex 
(Requin and Riehle, 1995) and in the frontal eye fields (Hanes et al, 1998).
These findings indicate that no bottleneck delays the programming of automatic 
responses to targets.  Controlled processes act on a slower time scale by altering 
the initial programs.

In humans, the eye movements to visual targets reflect a winner-take-all type 
process in which a single alternative is selected, and not the average of 
alternatives.  Competition between alternative actions is important in schema 
theory for preventing incompatible movements from being combined.  These studies 
consider what happens when a subject is expecting to make a speeded saccade 
to a single target, but instead two targets appear (Ottes et al, 1984).  This tasks 
does not allow time to deliberate on the response, and thus should reflect an 
automatic response.  The movements produced by subjects usually selected a single 
target, and not an average of the two. This result depends critically on the 
spatial separation between the targets and the amount of time before the 
saccade is initiated.  If the two targets appears in opposite hemifields, or they
are separated by more than 30 visual degrees, then a single target is selected.
Targets closer than 30 degrees are averaged.  A follow up study considered what
happens if a delay period (300 ms) precedes the saccade (Ottes et al, 1985). 
With more time, subjects discriminate between nearby targets. 

Physiological models of saccade target selection predict the dynamics 
observed in human subjects.  At the physiological level, alternative saccade 
movements are represented by different locations in motor maps.  This type
of population code is observed in several brain areas programming movements.
When two targets are present in a saccade task, activity in these maps 
initially appears at both locations. Over time, it is refined so one
location is significantly more active than any other (Glimcher and Sparks, 
1992). These dynamics for this target selection are modeled by a winner-take-all
network similar to Cobas and Arbib's frog model (1992). When two targets are far 
apart, the model dynamics produce activity at either of the two locations 
but not both.  When they are close, dynamics produce activity at an average 
between the two (Kopecz and Schoner, 1995).  These results match saccade 
behavior in humans at short latencies.  Averaging for near target can be
prevented if the magnitude of their visual inputs are slightly 
modulated to favor one over the other.  One way to achieve the enhanced 
spatial discrimination observed in humans at longer latencies is to
add a decision process to the visual inputs that slightly favors one
target over another.

Recent studies have found cells with activity that reflect the dynamics 
of a decision process in target selection.  Shalden and Newsome (1995) 
describe a class to cells in the lateral intraparietal area (LIP) that 
predict upcoming movements and are also modulated by the certainty of 
their preferred movement being the right choice.  In their experiment, 
a monkey is presented with two alternative saccade targets on either 
side of the fixation point.  The correct target is cued by a window of 
moving dots above the fixation point.  The monkey must discriminate in 
which direction the dots are moving and saccade in that direction. The 
percentage of dots moving in the same direction, called the {\it motion 
coherence}, is manipulated to adjust the difficulty of the discrimination. 
Three classes of response are identified among LIP cells.  Movement cell 
activity varies only as a function of the actual saccade performed.  Visual 
cell activity varies only as a function of the direction of motion and 
its coherence.  A third class of cells is tuned both to the motion 
stimulus and the selected movement. The activity of this third class 
reflects the dynamics of a decision process.  When the stimulus first 
appears, initial activity among these cells does not discriminate which 
saccade will be chosen.  As observation of the stimulus continues the 
differences in their activities grow such that one target is clearly 
selected over the other.  The magnitude of these differences do not reflect
a pure motor response, but also depend upon the coherence of the stimulus.  
Differences are small for 0\% coherence and large for 100\% coherence.  
Thus their activity reflects in some measure the certainty of a chosen response 
being correct, or the probability of the monkey getting a reward for making 
the movement.  Similar cells are also identified in the frontal eye fields 
(Schall and Hanes, 1993) and the superior colliculus (Basso and Wurtz, 1997).  
They also appear in primary motor cortex among reaching cells (Salinas and Romo,1998). 

At sensory levels, the dynamics of pre-attentive (automatic) feature 
selection bears many similarities to the competitive processes proposed 
for movement selection.  Experiments in visual search suggest that the 
identification of a target in a cluttered visual field can occur quickly 
when the target has a unique visual feature (color for example). In fact, 
the amount of time needed to find the target remains relatively constant 
regardless of the number of other objects, distractors, in the visual 
field (Treisman, 1980).  This suggests that a search process can check 
for the target a multiple visual locations simultaneously. Competitive 
processing in a winner-take-all network has been used to model parallel 
search (Koch and Ullman, 1985). The network consists of several different 
feature maps.  Each feature map is recurrently connected to itself, and 
to the other maps.  Visual locations in the feature maps compete for 
activity. When a feature is unique in a map, activity is enhanced at its 
visual location due to lack of competition from surrounding neighbors.  
This results in a 'pop-out' effect in which unique features are labeled.  
Further, if the target of search possesses a unique feature, the 
corresponding feature map can be given higher priority in search by a 
multiplicative modulation of its activity.  This results in that visual 
location also being enhanced across the other feature maps.
At the physiological level, sensory neurons in almost every cortical area 
exhibit an increase in baseline firing rate when visual attention is 
directed to their locations (Luck et al, 1997).  Recent studies in visual
area V4 indicate this increase in baseline activity is the result of a
multiplicative modulation, an attentional gain field (Salinas and Abbott, 1997; 
Connor et al, 1997).  In behavior, this change in baseline firing 
rate is associated with faster reaction times and lower detection thresholds 
to stimuli at the attended area (Posner and Petersen, 1990).

When targets are defined by novel combinations of features, a supervisory 
process must guide the search for them by modulating the competition between 
visual locations.  A serial search process is indicated in the data by  
times that increase linearly as distractors are added to the visual
field (Treisman, 1980).   The limiting factor preventing the parallel search 
relies on recognizing if a conjunction of features match the target.  It is 
proposed that target recognition is only possible at the visual location that 
is highlighted by attention.  To find targets then, attention must move serially 
to each candidate until the target combination is found.  Koch and Ullman (1985) 
model this search process through modulation of a winner-take-all map by a 
supervised process.  In this case, initial parallel competition within 
feature maps selects a potential target location.  Then a recognition processes 
checks for the features at the location to match the target.  If the target is 
not found, the recognition process inhibits that location, thus forcing the 
winner-take-all network to shift activity to another candidate.  Assuming that 
the inhibition has some decay time, then the highlighted area moves to new 
locations without backtracking.  Some physiological evidence suggests that cells 
in visual areas V4, IT, and ventral prefrontal cortex are involved in 
recognizing selected objects (Desimone, 1996).  Many of these cells are tuned 
preferentially to complex visual features and pictures.  Given an identical picture 
inside their receptive fields, the response of these cells is enhanced 
if the picture matches the target of a memory-guided search.  These cells may
signal whether or not a desired target has been identified in the spotlight. 

A broader perspective on attention stresses its importance for linking 
sensory parameters to the areas that plan behavior (Neumann, 1994; Allport, 1994).  
In this theory, the key limitation in dual-task performance results from 
cross-talk between different channels of information from sensory to motor 
output. The enhanced activity observed at attended locations is the considered 
the outcome of competition between behaviors to get sensory parameters from
different visual locations.  Focusing attention to a single visual location 
enables the sensory features at that location to be accessed by a schema 
without interference from surrounding locations.  Allport illustrates this 
idea with an example in which an observer plans a movement to a banana while 
there is an apple nearby. Control of the grip during the reach must fit the 
banana, not the apple.  Allport argues that attention prevents the sensory 
parameters of the apple from bleeding into the control of the reach to the
banana. This example has been tested in humans (Castiello, 1996).  In normal 
conditions, subjects have no trouble reaching to the banana regardless of 
the apple being close to it.  If however, attention is divided between the 
two of them with the addition of a secondary task involving the apple, then 
the grip parameters of the apple bleed into the movement.  This experiment 
is somewhat analogous to studies of illusory conjunctions (Treisman, 1984).  
These experiments found that when attention was divided between objects in 
a flashed display subjects would sometimes attribute a feature such as color 
to the wrong object.  If attention was cued to a single visual location before
the display was flashed, then illusory conjunctions did not occur.  In short, 
there appears to be a restriction on the amount of sensory information that 
can be channeled from sensory input into the control of behavior without 
interference.  Attention may enable a flexible way of linking arbitrary visual
locations and objects to schemas that used that information to guide behavior.

\subsection{Parallel Visuomotor Pathways}

Interference between concurrent behaviors can be reduced through independent, 
and possibly redundant, channels of processing from sensory input to 
motor control.  Goodale (1994) emphasizes that the evolutionary origin of 
vision is not to construct world models, but to guide movements in a fast 
and efficient manner.  He suggests that the dorsal visual pathways from 
striate to parietal and premotor cortex are not focused on the representation 
of 'where' things are located, but rather 'how' to guide movements to them. 
This distinction is supported by the organization of visual and movement 
areas in the dorsal stream.  In both parietal and premotor cortex, separate 
sub-fields specialize in planning eye, head, arm, and grasping movements.  
Within each sub-field, sensory parameters that are useful in guiding the 
movements for that part of the body are represented (Wise et al, 1997). 
For example, sensory neurons in the areas controlling hand movements 
(parietal area 5) are tuned to the properties of object shape and size 
which are crucial for controlling the grip of the hand 
(Sakata and Kusunoki, 1992).  In humans, lesions including this area impair 
the ability to coordinate the hand grip in relation to an object but 
not to recognize the object or to describe its shape and size. Conversely, 
Goodale (1991) has reported a patient with a lesion to the ventral stream 
that can no longer identify or describe an object's shape but can still 
reach to it with an accurate grip.  A second example of unique sensory 
parameters being stored in movement areas for specific body parts is given 
by how object locations are represented.  Each sub-field maintains a 
separate representation of object locations that is relative to the 
specific body part it controls.  There does not appear to be a central 
representation of the visual world.  Instead, the visual world is divided 
into separate regions each of which are anchored to a part of the body. 
This parallel organization may facilitate programming movements by 
reducing cross-talk between areas and providing more direct routes from 
sensation to action.

In amphibians, there appear to be parallel pathways controlling different 
visuomotor behaviors (Goodale, 1994).  This is demonstrated in the frog 
by re-wiring projections from its retina to the opposite sides of tectum 
during development (Ingle, 1973).  Re-wiring causes prey-catching
behavior to be reorganized.  When a prey-like stimulus is presented on the
left of a rewired frog, it responds by moving to the mirror symmetric 
location on the right.  Re-wiring also reorganizes the predator avoidance
behavior.  Instead of jumping away from a large looming stimuli, rewired
frogs will jump towards it.  In other visual behaviors, however, re-wiring
produces no changes.  For example, if a tactile stimulus pats a rewired frog 
from behind causing it to flee, then it still jumps in the right direction 
to avoid barriers in front of it.  Thus, the barrier avoidance behavior 
appears to use a separate visual pathway for planning movement.  In a 
follow up study, it has been shown that re-wiring a pretectal nuclei causes 
a reversal that is specific to the barrier avoidance behavior 
(Ingle 1980).  In this case, the frog jumps into barriers, but its 
response to prey and predators remain normal.  Thus, at least these two
behaviors appear to have different routes from input to output.

\section{Behavior-Based Robotics}

Some success is achieved in the control of autonomous robots by combining
sets of behavioral modules that run in parallel (Brooks, 1991).  Classic
approaches to robotics typically do not decompose the control of the robot
into separate behaviors.  Instead, control is decomposed into functional
modules that process information from sensation to output in serial stages.
A module for perception typically attempts to reconstruct from visual input
a representation of what things are and where they are located.  This
representation is then used by another module that makes plans to reach 
the robot's objectives.  Commands from planning are passed on to a motor 
module that executes them.  Brooks (1984) explores an alternative that 
reduces the emphasis of classical approaches on constructing internal 
representations.  Instead, control is decomposed into separate modules 
that guide simple behaviors. Each of the modules uses the sensory input 
to find features that are relevant to controlling its behavior.  For example, 
a module for obstacle avoidance uses sonar sensors placed around its body 
to steer away from nearby objects.  Another module that steers the
robot towards distant landmarks uses visual sensors to locate them.  
Each of the modules operates in parallel detecting cues from sensory 
input and generating commands that move the robot.  When the commands from 
several modules are combined, they can produce robots that are capable of 
navigating unfamiliar environments without collision.

The early work in behavior based robotics used a fixed priority scheme
called {\it subsumption architecture} to combine commands from behavior
modules (Brooks, 1984).  In subsumption architecture, behaviors are organized 
into ascending levels of competence.  The behavior at the lowest level
executes its commands with no awareness of the other behaviors above it.
However, even at the lowest level of competence the robot can execute actions
that are meaningful for its survival.  For example, a basic behavior such 
as obstacle avoidance can still function even in the absence of navigation 
goals by at least moving the robot out of the way of approaching vehicles.  
As higher level behaviors are added they impose extra constraints on the 
robot's behavior.  They take as input both sensory information and the 
outputs from lower levels.  When necessary, high levels modify the output 
from lower levels and substitute their own commands.  For example, a behavior 
that moves the robot towards landmarks can replace the movement suggested 
by obstacle avoidance with an alternative that still avoids the obstacle, 
but also moves closer to the landmark. Robots can be constructed with more 
sophisticated behavior by incrementally adding higher levels of competence.

Two features of subsumption architecture are particularly undesirable.
First, forcing sensory information to be processed along separate
channels reduces opportunities to fuse redundant evidence in order to
obtain more reliable perceptual estimates.  The problem of combining
different sources of evidence is called sensor fusion.  Allowing some 
cross-talk between modules in order to share perceptual estimates 
may be beneficial without necessarily requiring the overhead of a 
complete world representation.  The second problem relates to 
using a fixed priority scheme in combining movement commands.  Replacing 
a lower level command by one from a higher level can throw away 
useful information.  One example of this is reported 
for a military vehicle that navigates on dirt roads (Payton et al, 1990).  
The lowest level behavior in this system turns the vehicle to follow the 
road while a higher level turns it away from obstacles.  In cases where 
the road bends around an obstacle the vehicle sometimes decides to turn 
off the road.  This failure is caused because the avoidance module, which 
has control at the time, chooses how to turn with no regard to whether or 
not it stays on the road.  Subsumption architecture solves these types 
of problems by letting the output from lower levels be used as input to 
the higher level.  Thus for this example, the obstacle avoidance behavior
would have to take the burden of changing what it does based on the lower
road following behavior. This solution undermines the benefits of a 
parallel architecture by forcing higher stages to wait for the products
from lower stages.

A recent approach combines commands from behavior modules by having them
vote for different movement alternatives (Rosenblatt, 1995).  This 
approach has solved the road-following problem presented above.
The architecture used here is very similar to the schema model presented
earlier for controlling movement in the frog (Cobas and Arbib, 1992).  
Commands from separate behavior modules are integrated into a heading
map to control the vehicle. Each element in the heading map represents 
a direction for the vehicle to turn.  Behavior modules submit votes ranging 
from -1 to 1 for each heading alternative.  In a winner-take-all fashion, 
the alternative receiving the largest sum of votes from the behavior 
modules is selected as the direction to move.  If one behavior has a 
higher priority than another, then its votes can be weighted higher 
in the voting process.  Using this architecture, the vehicle no longer 
leaves the road in the presence of obstacles.  This is because votes 
from the road following module, though having less magnitude, can 
still tip the scales in favor of picking the right direction.  The authors 
further suggest that it may be useful to vary the relative weights of the 
different behaviors in voting depending on the context.  For example, one 
module in their system is designed to control turning in the vehicle
when it is in danger of tipping over. Tipping over can be expected to
be a greater hazard when traveling at higher speeds.  Therefore, it
make sense for the tipping module to have greater priority in
such contexts. An additional module called the mode manager is assigned 
the role of computing appropriate weights for each module given the context.  
Context may depend upon the internal goals or objectives of the system
as well as sensory input.  It is interesting that the authors arrive
independently upon a system that weights the relevance of different
modules depending on the context.  The magnitude of schema activation 
plays a similar role for schema theory.

\section{Reinforcement Learning}

Reinforcement learning can be used solve the types of control problems 
encountered in robotics.  Unlike supervised techniques, reinforcement 
learning does not require a teaching algorithm or examples of expert 
behavior.  Instead, it uses a single valued reward from its environment 
to improve its behavior.  In simple terms, it learns by exploring the 
world and estimating from experience which actions yield better rewards.
Learning by reward signals is more ecologically valid, and for certain 
control problems, may be more practical if it is difficult to define what 
is optimal behavior, or even to design a teacher algorithm that solves 
the task.  Unfortunately, like other optimization techniques it can fall 
prey to local minima.  This typically occurs when the system adopts a 
behavior that prevents it from exploring other actions or regions of 
the world that give better payoffs.  Nonetheless, it provides a valuable 
alternative for problems in which the desired output is unknown.  
Furthermore, recent physiological evidence supports its biological
feasibility.

To introduce the reinforcement learning technique, a simple account of
it is outlined here. The goal in reinforcement learning is to find the 
policy $\pi^{*}(s)\rightarrow a$ that maps from sensory states, $s$, to 
actions, $a$, such that the average reward, $E\{r(s,a)\}$, received from 
the environment is maximized.  To find the best policy, the system predicts 
the average reward it can expect for taking each action in a given 
sensory state.  This estimate of reward is called the Q-value and is 
written as $Q(s,a)$.  Several techniques have been developed related to 
monte carlo estimation and dynamic programming for estimating these 
Q-values (for review see Sutton and Barto, 1998).  The simplest method 
is monte carlo sampling: the system wanders around the world attempting 
random actions and uses the rewards it gets back to improve its estimates.  
Once the estimates for the Q-values have converged, the optimal policy is 
given by choosing the action for each sensory state that has the highest 
expected reward, $\pi(s)=max_{a}\{Q(s,a)\}$.  This outlines an approach 
in which the objective is only to maximize the reward for the current state 
of the robot, without any foresight as to whether or not current actions 
will lead to good states in the future.  In general, the importance of 
future rewards is considered by instead optimizing the average discounted 
reward given by $E_{\pi}\{\sum_{k=t}^{t+\infty}\lambda^{k}r(s_{k},a_{k})\}$
where $\lambda$ is the discount value that ranges from 0 to 1.  This problem
is more difficult to solve, but also more practical.  For example, a robot
may only receive reward signals in the environment when it reaches
its destination or hits an obstacle, but nothing in between those events.
The lookahead of the discounted reward enables the robot to find paths
to states that are rewarded.  Variants of reinforcement learning have
now been applied to adapting the behavior of autonomous robotic systems
(Beom and Cho, 1995; Maes, 1992; Kaebling, 1992).

The estimation of average reward for action alternatives in Q-learning
bears an interesting similarity to decision neurons in motor areas that
respond more when their movement alternative is certain to be correct.
As an example, reinforcement learning can be applied to learn to combine
prey-catching and predator avoidance for the frog model.  Again, a heading
map is used to represent the frogs movement alternatives.  In this case
though, the activation at each point in the heading map is an estimate
of the expected reward for taking that action, $Q(s,a)$.  A neural network
can learn to map from retinal inputs into a the heading map of Q-values.
Training proceeds by trying actions randomly, observing the reward, and
training the network to compute that reward for that alternative given
its sensory input.  A small positive reward may be given for turning
the frog towards prey, and a large negative reward for turning towards
prey.  The magnitude of the rewards reflects the relative priorities of
the two behaviors.  After learning, the frog can be controlled by a
winner-take-all network that selects the action with the highest 
activation in the heading map of Q-values.  The Q-value units in
the heading map are similar to the decision neurons discussed by
Shalden and Newsome (1995).  These neurons are found both to code
for specific movement directions, and to have activity that increased
in relation to the certainty the movement would be rewarded.
  
One of the key difficulties in applying reinforcement learning
to robot type control problems revolves around the trade-off between
exploration and exploitation.  In order to estimate expected rewards for 
the full range of actions and states, the system must explore its world
and get samples.  If the space of state and action pairs is not properly 
sampled, then a learning system can easily fall into local minima in 
their behavior.  For example, a robot may excel a predicting its reward,
but only because it never does anything.  In order to achieve exploration,
some noise must be added to the selection of actions.  The penalty 
incurred by exploration is that it will usually be testing actions that
yield less reward than the current best estimate.  Ideally, as estimates
for the expected reward are more fully sampled, the system should
shift from exploring to exploiting what it has learned.  An intelligent
transition from exploration to exploitation, and further, avoiding
exploration that kills the system, is a major difficulty for the
approach.  In this regard, the importance of having initial biases,
'innate' behaviors, has been recognized as beneficial to learning.  
It may also be useful to have a parent to foster safe exploration.

Recent studies speculate that the dopamine system in the basal ganglia
may be involved in some form of reinforcement learning.  Failure
of these dopamine systems in Parkinson's patients is known to impair
motor control, coordination of simultaneous movements, and the selection
of appropriate actions (Jackson and Houghton, 1995).  At the synaptic
level, hebbian learning mechanisms appear to be modulated by whether
or not dopamine is present at the synaptic sites of cortical input to 
spiny neurons in the basal ganglia.  When dopamine is present, long term potentiation
increases the efficacy of synapses for which pre- and post-synaptic are
active.  However, in the absence of dopamine similar coactivity can
have opposite effects and cause long term depression (Wickens and Kotter,
1995). In behavior, the release of dopamine in the basal ganglia is
increased in relation to rewards and decreased for aversive stimuli.  
Neurons in the mid-brain are known to mediate the release of dopamine at 
basal ganglia sites.  These cells have been shown to respond to rewards, 
and further to have anticipatory responses that predict upcoming rewards 
(Ljungberg et al, 1992).  Further, electrical stimulation of dopamine neurons
in the mid-brain can be used to condition animals in place of rewards.  
Finally, at the anatomical level the basal ganglia appear to have a special 
place in integrating information from across the cortex.  The spiny neurons of 
the basal ganglia receive inputs from almost every cortical area (Houk, 1995).  
This convergence, in conjunction with reward signals from dopamine neurons, 
may enable these cells to predict the value of different movement commands.  
Spiny neurons exhibit a rough organization into areas programming
movements for different effector systems (Alexander et al, 1986).
Further, they connect into motor structures such as the superior colliculus
through the pallidum.  They also pass information back to the frontal 
cortex via the pallidum and thalamus.  In short, though speculative, much 
evidence supports a role for reinforcement learning in the brain.

\section{Discussion}

Reinforcement learning may be combined with supervised techniques to solve
the navigation problem discussed in the introduction.  Reinforcement learning
is much slower than supervised training.  Therefore, if any constraints can
be imposed to component behaviors of the system through supervised signals
it may improve the speed of learning.  For each of the component behaviors 
in the navigation problem, target chasing and obstacle avoidance, the information 
required to guide movement and gaze is well defined.  Target chasing must have 
an estimate of the location of the target.  Obstacle avoidance must represent 
the obstacle locations, perhaps with a map of objects in its near space.  A 
recurrent neural network can be trained via supervised learning to encode this 
information in each these modules.  The part of the problem that is difficult to 
solve through supervised learning is how to generate and combine commands for 
movement and gaze control.  Reinforcement learning can be used in this problem.  
Each module makes connections to a map of Q-values for movement direction
and for gaze direction.  A positive reward is given to the robot each time it
has caught a target, and a negative reward each time it has hit an obstacle.  
Solutions to these types of problems should improve our understanding
of how neural systems combine different behaviors in the brain.

\section{References}

{
\tiny

Alexander GE, DeLong MR, Strick PL (1986). Parallel organization of functionally
segregated circuits linking basal ganglia and cortex.
Ann Rev Neurosci, 9:357-81. \\

Allport A (1987).  Selection for action: some behavioral and
neuroshysiological considerations of attention and action.
In: Perspectives on perception and action; Heuer H, Sanders AF (Eds.),
Hillsdale, NJ: Lawrence Erlbaum Assoc. \\

Arbib MA (1991). Neural mechanisms of visuomotor coordination: the
evolution of rana computatrix. In: Visual structures and integrated 
functions.  Arbib MA, Ewert JP (Eds.), New York: Springer-Verlag. \\

Basso MA, Wurtz RH (1997). Modulation of neuronal activity by target uncertainty.
Nature, 389:66-69.\\

Barrett CN (1993). The amendment of large-magnitude aiming-movement errors.
Psych Research, 55:148-155.\\

Beom HR, Cho HS (1995). A sensor-based navigation for a mobile robot using
fuzzy logic and reinforcement learning. 
IEEE Trans Sys Man Cyber, 25:464-477.\\

Brooks RA (1986). A robust layered control system for a mobile robot.
IEEE J Robotics and Automation, 1:14-23.\\

Brooks RA (1991). New approaches to robotics. Science, 253: 1227-1232. \\

Castiello U (1996). Grasping a fruit: selection for action. 
J Exp Psych: HPP, 22(3):582-603. \\

Cobas A, Arbib MA (1992). Prey-catching and predator-avoidance in frog and toad:
defining the schemas.  J Theo Bio, 152:271-304. \\

Conner CE, Preddie DC, Gallant JL, Van Essen DS (1997). Spatial attention effects
in macaque area V4. J Neurosci, 17:3201-3214. \\

De Jong R (1995).  Perception-action coupling and S-R compatibility.
Acta Psychologia, 90:287-299.\\

Desimone R (1996). Neural mechanisms for visual memory and their role in attention.
Proc Natl Acad Sci, 93:13494-13499. \\

Duncan J (1995). Intelligence and the frontal lobes.  
In: The cognitive neurosciences.  Gazzaniga MS (Ed.), Cambridge, MA: MIT Press. \\

Fuster JM (1995). Temporal Processing. Ann New York Acad Sci, 769:173-181. \\

Glimcher PW, Sparks DL (1992). Movement selection in advance of action in the
superior colliculus. Nature, 355:542-545. \\

Godefroy O, Lhullier C, Rousseaux M (1996).  Non-spatial attention disorders in 
patients with frontal or posterior brain damage. Brain, 119(1):191-202. \\

Goodale MA (1996). Visuomotor modules in the vertebrate brain.
Can J Physiol, 74:390-400. \\

Greenwald AG, Shulman HG (1973).  On doing two things at once: elimination of
the psychological refractory period effect.
J of Exp Psych, 101(1):70-76. \\

Heide W, Blankenburg M, Zimmermann E, Kompf D (1995). Cortical control of double-step
saccades: implications for spatial orientation.  Ann Neurol, 38:739-48. \\

Hanes DP, Patterson WF, Schall JD (1998). Role of frontal eye fields in countermanding
saccades: visual, movement, and fixation activity. J Neurophysio, 79:817-834. \\

Houk JC (1995). Information processing in moduluar circuits linking basal ganglia
and cerebral cortex. 
In: Models of information processing in the basal ganglia. Houk JC, Davis JL,
Beiser DG (Eds.), Cambridge, MA: MIT Press.\\

Ingle DJ (1973). Two visual systems in the frog.
Science, 181:1053-1055.\\

Ingle DJ (1980). Some effects of pretectum lesions in the frog's detection
of stationary objects.  Behav Br Res, 1:139-163.\\

Ingle DJ (1991). Functions of subcortical visual system in vertebrates and the
evolution of higher visual mechanisms.  In: Vision and visual dysfunction (Vol 2).
Evolution of eye and visual system. Gregory RL, Cronly-Dillon J (Eds.), 
London: Macmillan.\\

Jackson S, Houghton G (1995). Sensorimotor selection and the basal ganglia: a neural
network model.
In: Models of information processing in the basal ganglia. Houk JC, Davis JL,
Beiser DG (Eds.), Cambridge, MA: MIT Press.\\

Kaebling L (1992). An adaptable mobile robot.
In: Towards a practice of autonomous systems. Varela FJ, Bourgine P (Eds.),
Cambridge, MA: MIT Press. \\

Kalaska JF (1996). Parietal cortex area 5 and visumotor behavior.
Can J Physiol, 74:483-498. \\

Koch C, Ullman S (1985). Shifts in selective visual attention: towards the
underlying neural circuitry. Human Neurobio, 4:219-227. \\

Kopecz K, Schoner G (1995). Saccadic motor planning by integrating visual information
and pre-information on neural dynamic fields. Bio Cyb, 73:49-60. \\

Luck SJ, Chelazzi L, Hillyard SA, Desimone R (1997). Neural mechanisms of spatial
selective attention in areas V1, V2, and V4 of macaque visual cortex.
J Neurophsiol, 77:24-42. \\

Ljungberg T, Apicella P, Schultz W (1992). Response of monkey dopamine neurons
during learning of behavioral reactions.
J Neurophsiol, 67:145-163. \\

Maes P (1992). Learning behavior networks from experience.
In: Towards a practice of autonomous systems. Varela FJ, Bourgine P (Eds.),
Cambridge, MA: MIT Press. \\

Neumann O (1987).  Beyond capacity: a functional view of attention.
In: Perspectives on perception and action; Heuer H, Sanders AF (Eds.),
Hillsdale, NJ: Lawrence Erlbaum Assoc. \\

Norman DA, Shallice T (1985). Attention to acton: willed and automatic control
of behavior. In: Consciousness and self-regulation (Vol 4);  Davidson RJ, Schwartz GE, 
Shapiro D (Eds.), New York: Plenum. \\

Ottes FP, Van Gisbergen J, Eggermont (1984). Metric of saccade responses to visual
double stimuli: two different modes. Vision Res, 24(10):1169-1179. \\

Ottes FP, Van Gisbergen J, Eggermont (1985). Latency dependence of colour-based
target vs nontarget discrimination by the saccadic system. Vision Res, 25:849-862. \\

Pashler H (1994). Dual-task interference in simple tasks: data and theory.
Psychological Bulletin, 116(2):220-244. \\

Pashler H, Carrier M, Hoffman J (1993). Saccadic eye movements and dual-task interference.
Quat J Exp Psych, 46A(1):51-82.\\

Payton DW, Rosenblatt JK, Keirsey DM (1990). Plan Guided Reaction.
IEEE Trans Sys Man Cyber, 20(6):1370-1382. \\

Posner MI, Petersen SE (1990). The attention system of the human brain. 
Ann Rev Neurosci, 13:25-42. \\

Requin J, Riehle A (1995). Neural correlates of partial transmission of sensorimotor
information in the cerebral cortex.  Acta Psych, 90:81-95. \\

Rohrbaugh JW (1984). The orienting reflex: performance and central nervous system
manifestations.  In: Varieties of attention. Parasuraman R, Davies DR (Eds.),
New York: Academic. \\

Rosenblatt J, Thorpe C (1995). Combining multiple goals in a behavior-based architecture.
In: Proc 1995 Int Conf Int Robots Sys, Pittsburgh, PA. \\

Sakata H, Kusunoki M (1992). Organization of space perception: neurl representation
of three-dimensional space in the posterior parietal cortex.
Curr Op Neurobio, 2:170-174. \\

Salinas E, Abbott LF (1998). Invariant visual response from attentional gain fields.
J Neurophysio, 77:3267-3272. \\

Salinas E, Romo R (1998). Conversion of sensory signals into motor commands in
primary motor cortex.  J Neurosci, 18(1):499-511. \\

Schneider W, Shiffrin RM (1977).  Controlled and automatic human information
processing: detection, search, and attention.
Psych Rev, 84:1-66.\\

Schall JD, Hanes DP (1993). Neural basis of saccade target selection in frontal
eye field during visual search.  Nature, 366:467-469. \\

Shalden MN, Newsome WT (1996). Motion perception: seeing and deciding.
Proc Natl Acad Sci, 93:628-633. \\

Sutton RS, Barto AG (1998). Reinforcement learning: an introduction.
Cambridge, MA: MIT Press. \\

Treisman A, Gelade G (1980). A feature-integration theory of attention.
Cog Psychol, 12:97-136. \\

Treisman A, Schmidt H (1982). Illusory conjunctions in the perception of objects.
Cog Psychol, 14:107-141. \\

Wickens J, Kotter R (1995). Cellular models of reinforcement.
In: Models of information processing in the basal ganglia. Houk JC, Davis JL,
Beiser DG (Eds.), Cambridge, MA: MIT Press.\\

Wise SP, Boussaoud D, Johnson PB, Caminiti R (1997).  Premotor and parietal
cortex: corticocortical connectivity and combinatorial computations.
Ann Rev Neurosci, 20:25-42. \\

}

\end{document}