\documentstyle[12pt,titlepage]{article} \begin{document} \baselineskip = 0.3in \title{ Combining Behaviors } \author{Jude Mitchell } \date{Third Year IP \\ UCSD Cognitive Science Dept.\\ June 5, 1998 \\ \vspace{8ex} \noindent Committee: \\ {\center Prof. David Zipser}\\ {\center Prof. Javier Movellan}\\ {\center Prof. Marty Sereno} } \section{Introduction} This paper reviews the problem of combining movement commands from different behaviors. New approaches to robotics construct systems that operate by combining commands from separate behavioral modules. Each module delivers commands that run a simple behavior such as obstacle avoidance or road following. This work is inspired by the ability of insects and amphibians to exhibit highly adaptive behavior in navigating their environments with a fairly limited behavioral repertoire. A key problem encountered in this approach is that different behaviors can attempt to execute incompatible movements at the same time. For example, the movement generated for scratching an itch is not compatible with drinking a cup of coffee at the same time. In most cases, averaging movements from behaviors yields undesirable results. Instead, control between the behaviors must be coordinated in time. The first section of this paper reviews to what extent humans are able to coordinate two tasks at the same time. Interference between tasks depends largely on the demands that they place on controlled cognitive processes. In the second section, schema theory is introduced as a behavioral and biological theory for how to coordinate concurrent behaviors. Then in the third section, the approaches from robotics are reviewed. Recent robotics architectures are similar to those proposed by schema theory. In the final section, reinforcement learning is briefly reviewed. It provides a method for optimizing control of movement directly from rewards in the environment. It can be used to learn how to combine commands generated from seperate behaviors. The problem of combining commands from separate behaviors can be illustrated in a simple navigation problem. Suppose that a robot has two tasks: chasing a moving target and avoiding obstacles. A behavior for chasing a target will often command directions to move that are the opposite from the behavior avoiding obstacles. For example, if an obstacle lies between the robot and the target the two commands will be exactly opposite. In such cases, averaging traps the robot in a local minima where it never moves. Besides this problem, there is a second incompatibility that occurs at the level of the sensory input. This conflict emerges because the field of view of the visual sensor is limited. Since the two behaviors need to know where different objects are located to plan their respective movements, there can be a conflict in where the visual sensor should be focused. In part, this conflict lies in the control of gaze, but it also has interesting consequences through the limits it imposes on visual information. Since neither behavior can be guaranteed to control gaze, each must be robust to periods in which visual input is absent or irrelevant to them. This means that each must have a working memory for the objects locations relevant to their control. Further, since knowing when to take control of gaze may hinge on whether or not the information in working memory is accurate, these must also estimate the reliability of stored information. This example, and many others in which the visual sensor is active and not passive, raises interesting control problems. \section{Dual-Task Performance} Even when no physical limitations are imposed in sensory input or motor output, cognitive limitations in processing behaviors simultaneously still occur and are evidenced by the psychological refractory period (PRP). The PRP is a slowing in the reaction times to a stimuli in one task when another task is performed concurrently (for review see Pashler, 1994). In a typical experiment, two simple stimulus to response mappings are performed together. Subjects attempt to respond to each stimulus as quickly as possible without errors. Interference between the tasks is probed by varying the interval between the presentation of the first and second stimulus, called the stimulus onset asynchrony (SOA). If there is no interference, then reactions times for both tasks remain unchanged as the SOA interval shortens. If however they interfere, then the reaction times slow at shorter SOA intervals. A large variety of tasks display some degree of interference. Interference persists even when physical limitations are prevented by isolating responses to different effectors and different sensory modalities. Several PRP studies indicate that bottlenecks occur in processing that force one task to wait for another. Bottlenecks are revealed when the order between two tasks are fixed and subjects instructed that task 1 should have higher priority. This results in task 1 reaction times remaining unaffected while the task 2 reaction times are increased as the SOA interval becomes shorter. When task 2 reaction times are plotted as a function of the SOA, they have a slope near -1 at short intervals reflecting that each reduction to the interval between the tasks adds an equal delay to task 2. These findings are consistent with task 1 locking-out task 2 from a stage in processing from stimulus to response. The response selection bottleneck (RSB) hypothesis puts processing limits at the stage where stimuli are mapped to responses (Pashler, 1994). Three stages in processing are assumed: stimulus identification, response selection, and response execution. Two tasks can run in parallel at identification and execution levels, but must take turns at the selection level. Manipulating the difficulty of different stages in processing reveals the bottleneck lies in the selection level. First, when either the identification or selection stage of task 1 are lengthened, it produces extra delay to task 2. In contrast, lengthening the execution of task 1 has no affect. This suggest that the bottleneck is imposed before response execution. Another manipulation shows that if the identification stage of task 2 is prolonged, it leads to a less delay for task 2. This finding is explained by the task 2 selection stage being pushed back so it overlaps less with task 1 selection. In short, task 2 must wait for task 1 to complete at the selection stage. Many variations of the PRP experiments support this bottleneck (Pashler, 1994). Some tasks can avoid bottlenecks when their mappings from stimulus to response are particularly natural. These type of mappings are called {\it ideomotor compatible}. An early experiment by Greenwald and Shulman (1973) demonstrates that two ideomotor tasks do not delay each other when performed together. In the first task, a flashed arrow directs the subject to make a left or right movement. In the second task the subjects repeat the an aurally prompted letter 'A' or 'B'. Interestingly, if the arrow in the visual-manual task is replaced by the word 'left' or 'right' then normal refractory periods return. Thus even though the stimulus is compatible with response movement, it is not ideomotor. In general, ideomotor stimuli have physical characteristics that prompt the desired response. Some basic movements to visual targets cause no delays in processing. In double-step reaching experiments there is no delay for reprogramming a reach in progress when the target is moved to a different location (De Jong, 1995; Barrett, 1993). Two reaction times are measured during the reach: the first is the time to initiate the original movement, and the second is the time to change direction after the target has moved. The SOA interval is given by the time between the initial target appearance and its movement. If programming the initial reach causes bottlenecks, then the second reaction time should be slower at short SOA intervals. Instead, reactions times remain unaffected. Further, the time needed to start a movement and to adjust it are very similar. Another experiment shows certain saccadic movements avoid delays in processing (Pashler et al, 1993). In this experiment, a saccade task always follows a speeded manual response to a tone. Four variants of the saccade task are tested. In the first task, a saccade is made to a target that appears on the left or right of fixation. In the second task, a red and a green target appear on either side of fixation, and a saccade is made to the red one. Both of these tasks have negligible refractory period effects. Tasks in which the saccade direction is determined by the color of a single target at the fixation point (red means right, green means left) or by the larger of two adjacent digits have normal refractory periods. The delay is present for cues with symbolic relations to the target of the movement, but not for cues that are the target of the movement. The distinction between automatic and controlled processes explains some of the differences between ideomotor tasks and those tasks that cause refractory periods. Controlled processes can map stimuli to arbitrary responses, but have limited capacity. Automatic processes avoid capacity limits, but implement inflexible mappings. In PRP studies, the response selection stage requires controlled processing to map stimuli to the relatively arbitrary responses specified in the task instructions. Due to the limited capacity of controlled processing, when two tasks both require selection of novel responses then one must wait for the other. Those tasks which are well practiced or common in natural experience can become automatic (Schneider and Shiffrin, 1977). Automatic processing avoids bottlenecks through direct mappings from stimuli to responses. Automatic processes are thought to be inflexible and below the level of deliberate control. Two aspects of dual-task performance are obscured by the definition of automatic processing. First, although practice can reduce the magnitude of interference between tasks, it rarely abolishes it (Pashler, 1994). Thus some interference, the response selection bottleneck or otherwise, remains even between automatic behaviors. Second, although automatic behaviors may be less flexible, they can still show remarkable coordination without deliberate control. For example, in normal conditions an itch may cause a fairly reflexive response to move the hand and scratch it. This response is suppressed or replaced by another movement when the hand happens to be holding a cup of coffee. These types of conflicts are widespread in everyday situations, and yet they seem to require little effort to detect or to resolve them. In short, even when controlled processing is left aside, a great deal remains to be explained about the flexibility achieved when automatic behaviors are in conflict. \section{Schema Theory and Competitive Processing} \subsection{Schema Theory} Norman and Shallice (1986) use schema theory to explain how automatic behaviors can be coordinated without deliberate control. Their model consists of schemas that plan actions in parallel. Here schemas are automatic behaviors that map perceptual routines to motor responses. Although each schema may be fairly inflexible, adaptive behavior can emerge from their interactions. A process they call contention scheduling determines which schema gains control of action through a competition between schemas. This competition resembles a winner-take-all network in which each node corresponds to a schema. The activity of each schema is determined by its relevance given current sensory cues and the context (here context may refer to internal goals or states as well as the external world). Schemas executing compatible behaviors excite each other while those executing incompatible behaviors inhibit each other. Competitive interactions between two schemas can arise at any stage in processing where they control the same effectors or access the same sensory or cognitive resources. If two schemas run along separable pathways then no delays occur. However, if two are incompatible at any point in processing, then one yield control to the other and delay that stage of processing. The second part of their theory postulates that controlled processes are responsible for novel mappings, decision making, planning, or inhibition of automatic responses. A supervisory system is proposed to implement controlled processing by adding extra excitation or inhibition to competing schemas. {\it Attention} is defined to be modulatory influence of the supervisory system upon the schemas. It acts on a slow time scale relative to the schemas. Therefore, it does not control the precise timing of movements, but instead selects the schema executing the movements. Further, attention does not have absolute control over which schemas are active. If a schema detects conditions for which it has high relevance, its activity can rise enough to seize control of the system and redirect attention. For example, in navigation a schema for walking can take control of gaze to look at an pothole in a sidewalk that is detected in the periphery of the visual field. This is one example of an orienting response (for review see Rohrbaugh, 1994). \subsection{Modeling Frog Behavior} Schema theory is used to model how separate visuomotor behaviors are combined in the frog (Cobas and Arbib, 1992; Arbib, 1991). When frogs detect small fly-like objects in their visual field they turn and move towards them. In opposite fashion, frogs turn and move away from large looming stimuli that resemble predators. If two flies appear, the frog typically selects one of them to pursue. Likewise, it typically flees one of two predators, but not their average. Physiological evidence shows that separate classes of retinal ganglion cells are sensitive to prey and predator stimuli (Ingle, 1991). The two classes of retinal cells project to different layers of the tectum. Cobas and Arbib (1992) model the behavior of the frog by separate visuomotor schemas: one controls prey-catching and the other controls predator-avoidance. The visual input for prey catching is a one-dimensional array of units. Each unit responds selectively to stimuli that match their location in the visual map. Recurrent connections between units excite near locations and inhibit far locations. This recurrent structure forces units at different locations to compete for activity. When more than one fly appears in the visual field, location captures all the activity in the map. Movements are then planned toward that target. Along the other pathway, visual inputs for predators feed into a second map with similar winner-take-all dynamics. This maps selects a single predator and then plans movements away from it. Movement commands from the visuomotor schemas are combined into a motor map that represents the desired heading for the frog. Each visual map representing a prey or predator location makes one-to-one connections to the corresponding motor map location. Connections from the predator map are inhibitory while connections from the prey-catching map are excitatory. A higher priority is given to fleeing from predators by making the inhibitory connections larger. Winner-take-all dynamics refine activity in the motor map so the location with peak activity is selected. This direction determines where the frog moves. If the commands of the schemas are incompatible (a prey and predator appear at the same location), then the predator-avoidance command dominates due to its higher priority. If there are no predators, or two directions are equally good for avoiding the predator, then the prey-catching schema select the direction to move. These dynamics enable the schemas to cooperate when they agree, and compete when they are incompatible. This model provides a biologically feasible example of a contention scheduling process between two schemas. Taking some liberties, the frog model can illustrate how supervisory control acts in a biological model. Suppose that frogs understand the instructions of a psychology experiment, and are instructed not to flee from predators when a secondary cue is delivered. The supervisory system is responsible for detecting the secondary cue, and then suppressing the habitual response to flee. Further, the supervisory process is expected to take longer to process because it accesses memory of the experiment instructions. Due to the lag in supervisory control, the initial activity in the predator and motor maps initially matches the automatic response to flee. After a delay, the decision not to flee from the supervisory process begins to suppress activity in the predator and motor maps. The main prediction is that supervisory processes do not impose a bottleneck on processing, but instead modify existing automatic responses. \subsection{Schema Theory in Humans and Monkeys} Prefrontal lesions cause deficits that are specific to the planning and selection of novel action attributed to the supervisory system (for review see Duncan, 1994). Patients retain competence in ``crystallized'' skills acquired prior to the lesion. For example, many of them can still score well on WAIS intelligence tests which emphasize factual knowledge. In contrast, performance on novel tests that require planning or reasoning are severely impaired. In problems that consist of a series of steps subjects often fail to proceed unless they are prompted with appropriate sub-goals. Also, subjects fail to switch to different behavioral sets. For example, they have difficulty adopting new sorting strategies in the Wisconsin card sorting task. Even very simple tasks are impaired. Subjects performing a sequential delayed saccade tasks saccade to targets in the wrong order (Heide et al, 1995). Subject also fail to suppress planned saccades when a 'dont go' signal is given (Godefroy, 1996). Somewhat similar deficits appear in monkeys. Monkeys with prefrontal lesions have trouble learning tasks that require delayed or sequential responses (Fuster, 1994). In short, frontal lesions impair behaviors that require control of habitual responses, or the selection of novel ones. In monkeys, recent physiological experiments show that dynamics in selecting motor responses do not reflect bottlenecks in processing, but instead that a slow supervisory process modifies automatic plans. In several motor areas, activity among neurons forms quickly to program an automatic responses to a visual stimuli. If the task requires suppression or reprogramming of the response, then the initial activity is changed after some latency to match the correct response. Kalaska (1996) has identified these changes during go/no go reaching tasks in premotor cortex. In these tasks, the color of a visual target cues whether or not the monkey should reach to it. Recordings are made during a delay period prior to the reach from the neurons that are normally active for a movement to the target. These cells have an initial burst that persist over the delay period on 'go' trials. On 'no go' trials the initial burst is suppressed below the cells normal threshold. A second task likewise shows that when another cue directs the monkey to reach away from the target, then initial activity reflects the target location but is eventually modified to code for a movement in the opposite direction. Other studies have documented similar responses in primary motor cortex (Requin and Riehle, 1995) and in the frontal eye fields (Hanes et al, 1998). These findings indicate that no bottleneck delays the programming of automatic responses to targets. Controlled processes act on a slower time scale by altering the initial programs. In humans, the eye movements to visual targets reflect a winner-take-all type process in which a single alternative is selected, and not the average of alternatives. Competition between alternative actions is important in schema theory for preventing incompatible movements from being combined. These studies consider what happens when a subject is expecting to make a speeded saccade to a single target, but instead two targets appear (Ottes et al, 1984). This tasks does not allow time to deliberate on the response, and thus should reflect an automatic response. The movements produced by subjects usually selected a single target, and not an average of the two. This result depends critically on the spatial separation between the targets and the amount of time before the saccade is initiated. If the two targets appears in opposite hemifields, or they are separated by more than 30 visual degrees, then a single target is selected. Targets closer than 30 degrees are averaged. A follow up study considered what happens if a delay period (300 ms) precedes the saccade (Ottes et al, 1985). With more time, subjects discriminate between nearby targets. Physiological models of saccade target selection predict the dynamics observed in human subjects. At the physiological level, alternative saccade movements are represented by different locations in motor maps. This type of population code is observed in several brain areas programming movements. When two targets are present in a saccade task, activity in these maps initially appears at both locations. Over time, it is refined so one location is significantly more active than any other (Glimcher and Sparks, 1992). These dynamics for this target selection are modeled by a winner-take-all network similar to Cobas and Arbib's frog model (1992). When two targets are far apart, the model dynamics produce activity at either of the two locations but not both. When they are close, dynamics produce activity at an average between the two (Kopecz and Schoner, 1995). These results match saccade behavior in humans at short latencies. Averaging for near target can be prevented if the magnitude of their visual inputs are slightly modulated to favor one over the other. One way to achieve the enhanced spatial discrimination observed in humans at longer latencies is to add a decision process to the visual inputs that slightly favors one target over another. Recent studies have found cells with activity that reflect the dynamics of a decision process in target selection. Shalden and Newsome (1995) describe a class to cells in the lateral intraparietal area (LIP) that predict upcoming movements and are also modulated by the certainty of their preferred movement being the right choice. In their experiment, a monkey is presented with two alternative saccade targets on either side of the fixation point. The correct target is cued by a window of moving dots above the fixation point. The monkey must discriminate in which direction the dots are moving and saccade in that direction. The percentage of dots moving in the same direction, called the {\it motion coherence}, is manipulated to adjust the difficulty of the discrimination. Three classes of response are identified among LIP cells. Movement cell activity varies only as a function of the actual saccade performed. Visual cell activity varies only as a function of the direction of motion and its coherence. A third class of cells is tuned both to the motion stimulus and the selected movement. The activity of this third class reflects the dynamics of a decision process. When the stimulus first appears, initial activity among these cells does not discriminate which saccade will be chosen. As observation of the stimulus continues the differences in their activities grow such that one target is clearly selected over the other. The magnitude of these differences do not reflect a pure motor response, but also depend upon the coherence of the stimulus. Differences are small for 0\% coherence and large for 100\% coherence. Thus their activity reflects in some measure the certainty of a chosen response being correct, or the probability of the monkey getting a reward for making the movement. Similar cells are also identified in the frontal eye fields (Schall and Hanes, 1993) and the superior colliculus (Basso and Wurtz, 1997). They also appear in primary motor cortex among reaching cells (Salinas and Romo,1998). At sensory levels, the dynamics of pre-attentive (automatic) feature selection bears many similarities to the competitive processes proposed for movement selection. Experiments in visual search suggest that the identification of a target in a cluttered visual field can occur quickly when the target has a unique visual feature (color for example). In fact, the amount of time needed to find the target remains relatively constant regardless of the number of other objects, distractors, in the visual field (Treisman, 1980). This suggests that a search process can check for the target a multiple visual locations simultaneously. Competitive processing in a winner-take-all network has been used to model parallel search (Koch and Ullman, 1985). The network consists of several different feature maps. Each feature map is recurrently connected to itself, and to the other maps. Visual locations in the feature maps compete for activity. When a feature is unique in a map, activity is enhanced at its visual location due to lack of competition from surrounding neighbors. This results in a 'pop-out' effect in which unique features are labeled. Further, if the target of search possesses a unique feature, the corresponding feature map can be given higher priority in search by a multiplicative modulation of its activity. This results in that visual location also being enhanced across the other feature maps. At the physiological level, sensory neurons in almost every cortical area exhibit an increase in baseline firing rate when visual attention is directed to their locations (Luck et al, 1997). Recent studies in visual area V4 indicate this increase in baseline activity is the result of a multiplicative modulation, an attentional gain field (Salinas and Abbott, 1997; Connor et al, 1997). In behavior, this change in baseline firing rate is associated with faster reaction times and lower detection thresholds to stimuli at the attended area (Posner and Petersen, 1990). When targets are defined by novel combinations of features, a supervisory process must guide the search for them by modulating the competition between visual locations. A serial search process is indicated in the data by times that increase linearly as distractors are added to the visual field (Treisman, 1980). The limiting factor preventing the parallel search relies on recognizing if a conjunction of features match the target. It is proposed that target recognition is only possible at the visual location that is highlighted by attention. To find targets then, attention must move serially to each candidate until the target combination is found. Koch and Ullman (1985) model this search process through modulation of a winner-take-all map by a supervised process. In this case, initial parallel competition within feature maps selects a potential target location. Then a recognition processes checks for the features at the location to match the target. If the target is not found, the recognition process inhibits that location, thus forcing the winner-take-all network to shift activity to another candidate. Assuming that the inhibition has some decay time, then the highlighted area moves to new locations without backtracking. Some physiological evidence suggests that cells in visual areas V4, IT, and ventral prefrontal cortex are involved in recognizing selected objects (Desimone, 1996). Many of these cells are tuned preferentially to complex visual features and pictures. Given an identical picture inside their receptive fields, the response of these cells is enhanced if the picture matches the target of a memory-guided search. These cells may signal whether or not a desired target has been identified in the spotlight. A broader perspective on attention stresses its importance for linking sensory parameters to the areas that plan behavior (Neumann, 1994; Allport, 1994). In this theory, the key limitation in dual-task performance results from cross-talk between different channels of information from sensory to motor output. The enhanced activity observed at attended locations is the considered the outcome of competition between behaviors to get sensory parameters from different visual locations. Focusing attention to a single visual location enables the sensory features at that location to be accessed by a schema without interference from surrounding locations. Allport illustrates this idea with an example in which an observer plans a movement to a banana while there is an apple nearby. Control of the grip during the reach must fit the banana, not the apple. Allport argues that attention prevents the sensory parameters of the apple from bleeding into the control of the reach to the banana. This example has been tested in humans (Castiello, 1996). In normal conditions, subjects have no trouble reaching to the banana regardless of the apple being close to it. If however, attention is divided between the two of them with the addition of a secondary task involving the apple, then the grip parameters of the apple bleed into the movement. This experiment is somewhat analogous to studies of illusory conjunctions (Treisman, 1984). These experiments found that when attention was divided between objects in a flashed display subjects would sometimes attribute a feature such as color to the wrong object. If attention was cued to a single visual location before the display was flashed, then illusory conjunctions did not occur. In short, there appears to be a restriction on the amount of sensory information that can be channeled from sensory input into the control of behavior without interference. Attention may enable a flexible way of linking arbitrary visual locations and objects to schemas that used that information to guide behavior. \subsection{Parallel Visuomotor Pathways} Interference between concurrent behaviors can be reduced through independent, and possibly redundant, channels of processing from sensory input to motor control. Goodale (1994) emphasizes that the evolutionary origin of vision is not to construct world models, but to guide movements in a fast and efficient manner. He suggests that the dorsal visual pathways from striate to parietal and premotor cortex are not focused on the representation of 'where' things are located, but rather 'how' to guide movements to them. This distinction is supported by the organization of visual and movement areas in the dorsal stream. In both parietal and premotor cortex, separate sub-fields specialize in planning eye, head, arm, and grasping movements. Within each sub-field, sensory parameters that are useful in guiding the movements for that part of the body are represented (Wise et al, 1997). For example, sensory neurons in the areas controlling hand movements (parietal area 5) are tuned to the properties of object shape and size which are crucial for controlling the grip of the hand (Sakata and Kusunoki, 1992). In humans, lesions including this area impair the ability to coordinate the hand grip in relation to an object but not to recognize the object or to describe its shape and size. Conversely, Goodale (1991) has reported a patient with a lesion to the ventral stream that can no longer identify or describe an object's shape but can still reach to it with an accurate grip. A second example of unique sensory parameters being stored in movement areas for specific body parts is given by how object locations are represented. Each sub-field maintains a separate representation of object locations that is relative to the specific body part it controls. There does not appear to be a central representation of the visual world. Instead, the visual world is divided into separate regions each of which are anchored to a part of the body. This parallel organization may facilitate programming movements by reducing cross-talk between areas and providing more direct routes from sensation to action. In amphibians, there appear to be parallel pathways controlling different visuomotor behaviors (Goodale, 1994). This is demonstrated in the frog by re-wiring projections from its retina to the opposite sides of tectum during development (Ingle, 1973). Re-wiring causes prey-catching behavior to be reorganized. When a prey-like stimulus is presented on the left of a rewired frog, it responds by moving to the mirror symmetric location on the right. Re-wiring also reorganizes the predator avoidance behavior. Instead of jumping away from a large looming stimuli, rewired frogs will jump towards it. In other visual behaviors, however, re-wiring produces no changes. For example, if a tactile stimulus pats a rewired frog from behind causing it to flee, then it still jumps in the right direction to avoid barriers in front of it. Thus, the barrier avoidance behavior appears to use a separate visual pathway for planning movement. In a follow up study, it has been shown that re-wiring a pretectal nuclei causes a reversal that is specific to the barrier avoidance behavior (Ingle 1980). In this case, the frog jumps into barriers, but its response to prey and predators remain normal. Thus, at least these two behaviors appear to have different routes from input to output. \section{Behavior-Based Robotics} Some success is achieved in the control of autonomous robots by combining sets of behavioral modules that run in parallel (Brooks, 1991). Classic approaches to robotics typically do not decompose the control of the robot into separate behaviors. Instead, control is decomposed into functional modules that process information from sensation to output in serial stages. A module for perception typically attempts to reconstruct from visual input a representation of what things are and where they are located. This representation is then used by another module that makes plans to reach the robot's objectives. Commands from planning are passed on to a motor module that executes them. Brooks (1984) explores an alternative that reduces the emphasis of classical approaches on constructing internal representations. Instead, control is decomposed into separate modules that guide simple behaviors. Each of the modules uses the sensory input to find features that are relevant to controlling its behavior. For example, a module for obstacle avoidance uses sonar sensors placed around its body to steer away from nearby objects. Another module that steers the robot towards distant landmarks uses visual sensors to locate them. Each of the modules operates in parallel detecting cues from sensory input and generating commands that move the robot. When the commands from several modules are combined, they can produce robots that are capable of navigating unfamiliar environments without collision. The early work in behavior based robotics used a fixed priority scheme called {\it subsumption architecture} to combine commands from behavior modules (Brooks, 1984). In subsumption architecture, behaviors are organized into ascending levels of competence. The behavior at the lowest level executes its commands with no awareness of the other behaviors above it. However, even at the lowest level of competence the robot can execute actions that are meaningful for its survival. For example, a basic behavior such as obstacle avoidance can still function even in the absence of navigation goals by at least moving the robot out of the way of approaching vehicles. As higher level behaviors are added they impose extra constraints on the robot's behavior. They take as input both sensory information and the outputs from lower levels. When necessary, high levels modify the output from lower levels and substitute their own commands. For example, a behavior that moves the robot towards landmarks can replace the movement suggested by obstacle avoidance with an alternative that still avoids the obstacle, but also moves closer to the landmark. Robots can be constructed with more sophisticated behavior by incrementally adding higher levels of competence. Two features of subsumption architecture are particularly undesirable. First, forcing sensory information to be processed along separate channels reduces opportunities to fuse redundant evidence in order to obtain more reliable perceptual estimates. The problem of combining different sources of evidence is called sensor fusion. Allowing some cross-talk between modules in order to share perceptual estimates may be beneficial without necessarily requiring the overhead of a complete world representation. The second problem relates to using a fixed priority scheme in combining movement commands. Replacing a lower level command by one from a higher level can throw away useful information. One example of this is reported for a military vehicle that navigates on dirt roads (Payton et al, 1990). The lowest level behavior in this system turns the vehicle to follow the road while a higher level turns it away from obstacles. In cases where the road bends around an obstacle the vehicle sometimes decides to turn off the road. This failure is caused because the avoidance module, which has control at the time, chooses how to turn with no regard to whether or not it stays on the road. Subsumption architecture solves these types of problems by letting the output from lower levels be used as input to the higher level. Thus for this example, the obstacle avoidance behavior would have to take the burden of changing what it does based on the lower road following behavior. This solution undermines the benefits of a parallel architecture by forcing higher stages to wait for the products from lower stages. A recent approach combines commands from behavior modules by having them vote for different movement alternatives (Rosenblatt, 1995). This approach has solved the road-following problem presented above. The architecture used here is very similar to the schema model presented earlier for controlling movement in the frog (Cobas and Arbib, 1992). Commands from separate behavior modules are integrated into a heading map to control the vehicle. Each element in the heading map represents a direction for the vehicle to turn. Behavior modules submit votes ranging from -1 to 1 for each heading alternative. In a winner-take-all fashion, the alternative receiving the largest sum of votes from the behavior modules is selected as the direction to move. If one behavior has a higher priority than another, then its votes can be weighted higher in the voting process. Using this architecture, the vehicle no longer leaves the road in the presence of obstacles. This is because votes from the road following module, though having less magnitude, can still tip the scales in favor of picking the right direction. The authors further suggest that it may be useful to vary the relative weights of the different behaviors in voting depending on the context. For example, one module in their system is designed to control turning in the vehicle when it is in danger of tipping over. Tipping over can be expected to be a greater hazard when traveling at higher speeds. Therefore, it make sense for the tipping module to have greater priority in such contexts. An additional module called the mode manager is assigned the role of computing appropriate weights for each module given the context. Context may depend upon the internal goals or objectives of the system as well as sensory input. It is interesting that the authors arrive independently upon a system that weights the relevance of different modules depending on the context. The magnitude of schema activation plays a similar role for schema theory. \section{Reinforcement Learning} Reinforcement learning can be used solve the types of control problems encountered in robotics. Unlike supervised techniques, reinforcement learning does not require a teaching algorithm or examples of expert behavior. Instead, it uses a single valued reward from its environment to improve its behavior. In simple terms, it learns by exploring the world and estimating from experience which actions yield better rewards. Learning by reward signals is more ecologically valid, and for certain control problems, may be more practical if it is difficult to define what is optimal behavior, or even to design a teacher algorithm that solves the task. Unfortunately, like other optimization techniques it can fall prey to local minima. This typically occurs when the system adopts a behavior that prevents it from exploring other actions or regions of the world that give better payoffs. Nonetheless, it provides a valuable alternative for problems in which the desired output is unknown. Furthermore, recent physiological evidence supports its biological feasibility. To introduce the reinforcement learning technique, a simple account of it is outlined here. The goal in reinforcement learning is to find the policy $\pi^{*}(s)\rightarrow a$ that maps from sensory states, $s$, to actions, $a$, such that the average reward, $E\{r(s,a)\}$, received from the environment is maximized. To find the best policy, the system predicts the average reward it can expect for taking each action in a given sensory state. This estimate of reward is called the Q-value and is written as $Q(s,a)$. Several techniques have been developed related to monte carlo estimation and dynamic programming for estimating these Q-values (for review see Sutton and Barto, 1998). The simplest method is monte carlo sampling: the system wanders around the world attempting random actions and uses the rewards it gets back to improve its estimates. Once the estimates for the Q-values have converged, the optimal policy is given by choosing the action for each sensory state that has the highest expected reward, $\pi(s)=max_{a}\{Q(s,a)\}$. This outlines an approach in which the objective is only to maximize the reward for the current state of the robot, without any foresight as to whether or not current actions will lead to good states in the future. In general, the importance of future rewards is considered by instead optimizing the average discounted reward given by $E_{\pi}\{\sum_{k=t}^{t+\infty}\lambda^{k}r(s_{k},a_{k})\}$ where $\lambda$ is the discount value that ranges from 0 to 1. This problem is more difficult to solve, but also more practical. For example, a robot may only receive reward signals in the environment when it reaches its destination or hits an obstacle, but nothing in between those events. The lookahead of the discounted reward enables the robot to find paths to states that are rewarded. Variants of reinforcement learning have now been applied to adapting the behavior of autonomous robotic systems (Beom and Cho, 1995; Maes, 1992; Kaebling, 1992). The estimation of average reward for action alternatives in Q-learning bears an interesting similarity to decision neurons in motor areas that respond more when their movement alternative is certain to be correct. As an example, reinforcement learning can be applied to learn to combine prey-catching and predator avoidance for the frog model. Again, a heading map is used to represent the frogs movement alternatives. In this case though, the activation at each point in the heading map is an estimate of the expected reward for taking that action, $Q(s,a)$. A neural network can learn to map from retinal inputs into a the heading map of Q-values. Training proceeds by trying actions randomly, observing the reward, and training the network to compute that reward for that alternative given its sensory input. A small positive reward may be given for turning the frog towards prey, and a large negative reward for turning towards prey. The magnitude of the rewards reflects the relative priorities of the two behaviors. After learning, the frog can be controlled by a winner-take-all network that selects the action with the highest activation in the heading map of Q-values. The Q-value units in the heading map are similar to the decision neurons discussed by Shalden and Newsome (1995). These neurons are found both to code for specific movement directions, and to have activity that increased in relation to the certainty the movement would be rewarded. One of the key difficulties in applying reinforcement learning to robot type control problems revolves around the trade-off between exploration and exploitation. In order to estimate expected rewards for the full range of actions and states, the system must explore its world and get samples. If the space of state and action pairs is not properly sampled, then a learning system can easily fall into local minima in their behavior. For example, a robot may excel a predicting its reward, but only because it never does anything. In order to achieve exploration, some noise must be added to the selection of actions. The penalty incurred by exploration is that it will usually be testing actions that yield less reward than the current best estimate. Ideally, as estimates for the expected reward are more fully sampled, the system should shift from exploring to exploiting what it has learned. An intelligent transition from exploration to exploitation, and further, avoiding exploration that kills the system, is a major difficulty for the approach. In this regard, the importance of having initial biases, 'innate' behaviors, has been recognized as beneficial to learning. It may also be useful to have a parent to foster safe exploration. Recent studies speculate that the dopamine system in the basal ganglia may be involved in some form of reinforcement learning. Failure of these dopamine systems in Parkinson's patients is known to impair motor control, coordination of simultaneous movements, and the selection of appropriate actions (Jackson and Houghton, 1995). At the synaptic level, hebbian learning mechanisms appear to be modulated by whether or not dopamine is present at the synaptic sites of cortical input to spiny neurons in the basal ganglia. When dopamine is present, long term potentiation increases the efficacy of synapses for which pre- and post-synaptic are active. However, in the absence of dopamine similar coactivity can have opposite effects and cause long term depression (Wickens and Kotter, 1995). In behavior, the release of dopamine in the basal ganglia is increased in relation to rewards and decreased for aversive stimuli. Neurons in the mid-brain are known to mediate the release of dopamine at basal ganglia sites. These cells have been shown to respond to rewards, and further to have anticipatory responses that predict upcoming rewards (Ljungberg et al, 1992). Further, electrical stimulation of dopamine neurons in the mid-brain can be used to condition animals in place of rewards. Finally, at the anatomical level the basal ganglia appear to have a special place in integrating information from across the cortex. The spiny neurons of the basal ganglia receive inputs from almost every cortical area (Houk, 1995). This convergence, in conjunction with reward signals from dopamine neurons, may enable these cells to predict the value of different movement commands. Spiny neurons exhibit a rough organization into areas programming movements for different effector systems (Alexander et al, 1986). Further, they connect into motor structures such as the superior colliculus through the pallidum. They also pass information back to the frontal cortex via the pallidum and thalamus. In short, though speculative, much evidence supports a role for reinforcement learning in the brain. \section{Discussion} Reinforcement learning may be combined with supervised techniques to solve the navigation problem discussed in the introduction. Reinforcement learning is much slower than supervised training. Therefore, if any constraints can be imposed to component behaviors of the system through supervised signals it may improve the speed of learning. For each of the component behaviors in the navigation problem, target chasing and obstacle avoidance, the information required to guide movement and gaze is well defined. Target chasing must have an estimate of the location of the target. Obstacle avoidance must represent the obstacle locations, perhaps with a map of objects in its near space. A recurrent neural network can be trained via supervised learning to encode this information in each these modules. The part of the problem that is difficult to solve through supervised learning is how to generate and combine commands for movement and gaze control. Reinforcement learning can be used in this problem. Each module makes connections to a map of Q-values for movement direction and for gaze direction. A positive reward is given to the robot each time it has caught a target, and a negative reward each time it has hit an obstacle. Solutions to these types of problems should improve our understanding of how neural systems combine different behaviors in the brain. \section{References} { \tiny Alexander GE, DeLong MR, Strick PL (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Ann Rev Neurosci, 9:357-81. \\ Allport A (1987). Selection for action: some behavioral and neuroshysiological considerations of attention and action. In: Perspectives on perception and action; Heuer H, Sanders AF (Eds.), Hillsdale, NJ: Lawrence Erlbaum Assoc. \\ Arbib MA (1991). Neural mechanisms of visuomotor coordination: the evolution of rana computatrix. In: Visual structures and integrated functions. Arbib MA, Ewert JP (Eds.), New York: Springer-Verlag. \\ Basso MA, Wurtz RH (1997). Modulation of neuronal activity by target uncertainty. Nature, 389:66-69.\\ Barrett CN (1993). The amendment of large-magnitude aiming-movement errors. Psych Research, 55:148-155.\\ Beom HR, Cho HS (1995). A sensor-based navigation for a mobile robot using fuzzy logic and reinforcement learning. IEEE Trans Sys Man Cyber, 25:464-477.\\ Brooks RA (1986). A robust layered control system for a mobile robot. IEEE J Robotics and Automation, 1:14-23.\\ Brooks RA (1991). New approaches to robotics. Science, 253: 1227-1232. \\ Castiello U (1996). Grasping a fruit: selection for action. J Exp Psych: HPP, 22(3):582-603. \\ Cobas A, Arbib MA (1992). Prey-catching and predator-avoidance in frog and toad: defining the schemas. J Theo Bio, 152:271-304. \\ Conner CE, Preddie DC, Gallant JL, Van Essen DS (1997). Spatial attention effects in macaque area V4. J Neurosci, 17:3201-3214. \\ De Jong R (1995). Perception-action coupling and S-R compatibility. Acta Psychologia, 90:287-299.\\ Desimone R (1996). Neural mechanisms for visual memory and their role in attention. Proc Natl Acad Sci, 93:13494-13499. \\ Duncan J (1995). Intelligence and the frontal lobes. In: The cognitive neurosciences. Gazzaniga MS (Ed.), Cambridge, MA: MIT Press. \\ Fuster JM (1995). Temporal Processing. Ann New York Acad Sci, 769:173-181. \\ Glimcher PW, Sparks DL (1992). Movement selection in advance of action in the superior colliculus. Nature, 355:542-545. \\ Godefroy O, Lhullier C, Rousseaux M (1996). Non-spatial attention disorders in patients with frontal or posterior brain damage. Brain, 119(1):191-202. \\ Goodale MA (1996). Visuomotor modules in the vertebrate brain. Can J Physiol, 74:390-400. \\ Greenwald AG, Shulman HG (1973). On doing two things at once: elimination of the psychological refractory period effect. J of Exp Psych, 101(1):70-76. \\ Heide W, Blankenburg M, Zimmermann E, Kompf D (1995). Cortical control of double-step saccades: implications for spatial orientation. Ann Neurol, 38:739-48. \\ Hanes DP, Patterson WF, Schall JD (1998). Role of frontal eye fields in countermanding saccades: visual, movement, and fixation activity. J Neurophysio, 79:817-834. \\ Houk JC (1995). Information processing in moduluar circuits linking basal ganglia and cerebral cortex. In: Models of information processing in the basal ganglia. Houk JC, Davis JL, Beiser DG (Eds.), Cambridge, MA: MIT Press.\\ Ingle DJ (1973). Two visual systems in the frog. Science, 181:1053-1055.\\ Ingle DJ (1980). Some effects of pretectum lesions in the frog's detection of stationary objects. Behav Br Res, 1:139-163.\\ Ingle DJ (1991). Functions of subcortical visual system in vertebrates and the evolution of higher visual mechanisms. In: Vision and visual dysfunction (Vol 2). Evolution of eye and visual system. Gregory RL, Cronly-Dillon J (Eds.), London: Macmillan.\\ Jackson S, Houghton G (1995). Sensorimotor selection and the basal ganglia: a neural network model. In: Models of information processing in the basal ganglia. Houk JC, Davis JL, Beiser DG (Eds.), Cambridge, MA: MIT Press.\\ Kaebling L (1992). An adaptable mobile robot. In: Towards a practice of autonomous systems. Varela FJ, Bourgine P (Eds.), Cambridge, MA: MIT Press. \\ Kalaska JF (1996). Parietal cortex area 5 and visumotor behavior. Can J Physiol, 74:483-498. \\ Koch C, Ullman S (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobio, 4:219-227. \\ Kopecz K, Schoner G (1995). Saccadic motor planning by integrating visual information and pre-information on neural dynamic fields. Bio Cyb, 73:49-60. \\ Luck SJ, Chelazzi L, Hillyard SA, Desimone R (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J Neurophsiol, 77:24-42. \\ Ljungberg T, Apicella P, Schultz W (1992). Response of monkey dopamine neurons during learning of behavioral reactions. J Neurophsiol, 67:145-163. \\ Maes P (1992). Learning behavior networks from experience. In: Towards a practice of autonomous systems. Varela FJ, Bourgine P (Eds.), Cambridge, MA: MIT Press. \\ Neumann O (1987). Beyond capacity: a functional view of attention. In: Perspectives on perception and action; Heuer H, Sanders AF (Eds.), Hillsdale, NJ: Lawrence Erlbaum Assoc. \\ Norman DA, Shallice T (1985). Attention to acton: willed and automatic control of behavior. In: Consciousness and self-regulation (Vol 4); Davidson RJ, Schwartz GE, Shapiro D (Eds.), New York: Plenum. \\ Ottes FP, Van Gisbergen J, Eggermont (1984). Metric of saccade responses to visual double stimuli: two different modes. Vision Res, 24(10):1169-1179. \\ Ottes FP, Van Gisbergen J, Eggermont (1985). Latency dependence of colour-based target vs nontarget discrimination by the saccadic system. Vision Res, 25:849-862. \\ Pashler H (1994). Dual-task interference in simple tasks: data and theory. Psychological Bulletin, 116(2):220-244. \\ Pashler H, Carrier M, Hoffman J (1993). Saccadic eye movements and dual-task interference. Quat J Exp Psych, 46A(1):51-82.\\ Payton DW, Rosenblatt JK, Keirsey DM (1990). Plan Guided Reaction. IEEE Trans Sys Man Cyber, 20(6):1370-1382. \\ Posner MI, Petersen SE (1990). The attention system of the human brain. Ann Rev Neurosci, 13:25-42. \\ Requin J, Riehle A (1995). Neural correlates of partial transmission of sensorimotor information in the cerebral cortex. Acta Psych, 90:81-95. \\ Rohrbaugh JW (1984). The orienting reflex: performance and central nervous system manifestations. In: Varieties of attention. Parasuraman R, Davies DR (Eds.), New York: Academic. \\ Rosenblatt J, Thorpe C (1995). Combining multiple goals in a behavior-based architecture. In: Proc 1995 Int Conf Int Robots Sys, Pittsburgh, PA. \\ Sakata H, Kusunoki M (1992). Organization of space perception: neurl representation of three-dimensional space in the posterior parietal cortex. Curr Op Neurobio, 2:170-174. \\ Salinas E, Abbott LF (1998). Invariant visual response from attentional gain fields. J Neurophysio, 77:3267-3272. \\ Salinas E, Romo R (1998). Conversion of sensory signals into motor commands in primary motor cortex. J Neurosci, 18(1):499-511. \\ Schneider W, Shiffrin RM (1977). Controlled and automatic human information processing: detection, search, and attention. Psych Rev, 84:1-66.\\ Schall JD, Hanes DP (1993). Neural basis of saccade target selection in frontal eye field during visual search. Nature, 366:467-469. \\ Shalden MN, Newsome WT (1996). Motion perception: seeing and deciding. Proc Natl Acad Sci, 93:628-633. \\ Sutton RS, Barto AG (1998). Reinforcement learning: an introduction. Cambridge, MA: MIT Press. \\ Treisman A, Gelade G (1980). A feature-integration theory of attention. Cog Psychol, 12:97-136. \\ Treisman A, Schmidt H (1982). Illusory conjunctions in the perception of objects. Cog Psychol, 14:107-141. \\ Wickens J, Kotter R (1995). Cellular models of reinforcement. In: Models of information processing in the basal ganglia. Houk JC, Davis JL, Beiser DG (Eds.), Cambridge, MA: MIT Press.\\ Wise SP, Boussaoud D, Johnson PB, Caminiti R (1997). Premotor and parietal cortex: corticocortical connectivity and combinatorial computations. Ann Rev Neurosci, 20:25-42. \\ } \end{document}