Skip to content. | Skip to navigation

Personal tools
Document Actions

Demonstrator CLEVER-B3 2012

This page describes the models and implementation of the third bio-inspired Demonstrator o IM-CLeVeR


This page documents the CLEVER-B3 bio-constrained robot demonstrator shown at the third IM-CLeVeR Review Meeting (Aberystwyth, UK, 02-04/07/2012). The demonstrator reproduces and investigates the board experiment run with capuchin monkeys (run by CNR-ISTC-UCP; see below) and children (run by UCBM; see below). The model was tested tested on the basis of the same mechatronic board used with the real experiment participants, and using the robot iCub. The goal of the model is to furnish an operational hypothesis on the brain mechanisms that might underlie the acquisition and exploitation of actions driven by intrinsic motivations mechanisms, observed in the monkeys and children of the board experiment. The demonstrator builds on the demonstrator of the previous year, but with important differences. 

This page is organised as follows. It first presents the target experiment and how the system is based on two componets, the Decision Making Component (SMC) and the Sensorimotor Component (SMC). Second, it explains how the DMC of the previous year demonstrator (CLEVER-B2) has been developed to arrive to a publication in Neural Networks. Third, it shows how this system has been expanded with an additional component, the amygdala, to allow a biological constrained recall of goals in the second phase of the target experiment based on the value assigned on the fly to them (in CLEVER-B2 this aspect was hardwired).  Fourth, is hows the videos of the robot iCub, controlled by the model, in the two phases of the target experiment. Fifth, it explains in detail how the SMC of the system have been developed. Sixth, shows the videos of the iCub robot experiments that have been used to develop the SMC of the system.

The target experiment

As last year, the board modelled experiment  is divided in two phases, both employing the mechatronic board. The board has three buttons and three boxes that can be opened by an automatic device.

In the first learning phase, the participants can press any button of the board. The press of a button causes the opening of one of three boxes corresponding to it. This causes a surprising, unexpected event that should drive the learning of the action that caused it. Also it should drive the learning of the action-outcome relation between the action executed and its effect (opening of the box).

In the second test phase of the experiment, one particular outcome (e.g., box 1 opening) is given high value (e.g., by putting food into box 1: this has a transparent door closing it, so food is directly visible), so the participant should be able to immediately recall the action that can deliver such outcome thanks to the action-outcome relation learned in the learning phase.

The two components of the model

The experiments with monkeys and children show that when they face the experiment with the board they are already endowed with a well-developped repertoire of actions acquired during life before the experiment, for example for looking at salient parts of the board and for manipulating the foveated objects. On this basis, the model was divided in two main components:

  • A ''decision making'' component (DMc), mainly developped by CNR-ISTC-LOCEN and USFD: this component is in charge of deciding the actions to perform based on mechanisms putatively implemented by striato-cortical loops of brain.
  • A ''sensorimotor component'' (SMc), under the responsibility of AU, in charge of deciding the actions to be performed based on mechanisms putatively implemented in the cortical neural pathways.

To best coordinate the two components at the technical level, CNR-ISTC-LOCEN and AU designed a specific API that interfaces the DMc with the SMc, and interfaces SMc with the Yarp-based simulator of the simulated and real iCub robot. This API has been very important for the success of the interaction between the Teams involved in the design and implementation of CLEVER-B2.

Decision Making component (DMc)


The development and refinement of the DMc used for the CLEVER-B3 demonstrator has recently resulted in a paper accepted by the journal “Neural Networks”, special issue on "Autonomous Learning" (Baldassarre et al., in revision). The DMc, which is grounded on a bio-constrained implementation of the striato-cortical loops and intrinsic-motivations for the acquisition of action repertoires (CNR-ISTC-LOCEN – USFD), has been used jointly with the SensoriMotor Component (SMc) allowing actions at a fine level based on sensorimotor mappings (CNR-ISTC-LOCEN – AU). This section describes how the DMc has been developed to arrive to the neural network publication, while the SMc is explained below.

The DMc  has been conceived to simulate the behaviours recorded on a mechatronic board (UCBM) during experiments carried out with monkeys (CNR-ISTC-UCP) and children (UCBM) directed to investigate intrinsic motivations (Baldassarre, 2011). The core of the system (Baldassarre et al., 2012) relies on: 

  • USFD's neuroscientific theory and empirical experiments on intrinsically-motivated learning (Redgrave et al., 2006).
  • Theory and models of extrinsic and intrinsic motivations (Mirolli et al., subm)
  • Computational models of the board experiment (see above)
  • Models of the development of sensorimotor skills and theoretical aspects of CLEVER-K (see also the seminal model Shmidhuber, 1991).

Specific goals

This bio-constrained model aims at giving a computational framework to the theory and data USFD has produced focussing on intrinsic motivations. The model targets three fundamental processes related to intrinsic motivations (IM) in the subjects dealing with the mechatronic board:

  • The transient focussing of attention and behaviour on interesting portions of the environment (selection of eye gazes and selection of arm actions) based on a typical IM dynamic of dopamine learning signals: the signals, generated by phasic unexpected events (the opening of a box) first arise, and then progressively decrease once the system learns to predict the phasic event (dopamine inhibition) (see also Santucci et al., 2010).
  • The acquisition of action-outcome associations, i.e. the formation of connections between the representations of the outcomes (e.g., “box 1 opening”) with the actions that caused them (“look at button opening box 1”, “press the looked button”) driven by these same dopamine learning signals and the eye and arm actions;
  • The succeeding recall of actions (e.g., “look at button opening box 1”, “press the looked button”) based on the internal reactivation of the representations of their outcomes (e.g., “box 1 opening”). 

Approach and method

The detailed architecture of the model is illustrated in the figure below: each neural unit of the model simulates the mean activity of a population of real neurons. The model relies on relevant constraints coming from behavioural and brain analysis. In particular, the overall architecture of the model has been constrained both on the basis of anatomical connections between areas and on the basis of the specific functions implemented by the  components. The neural architecture of the model is based on three weakly-coupled basal ganglia-cortical loops (Redgrave et al., 1999):
The oculomotor loop has a constant input representing the overall context and can select one among six possible salient stimuli (three boxes and three buttons; Hikosaka et al., 2000).
The arm loop receives six different possible input patterns corresponding to the perceived object and on this basis selects one among three possible actions (“press the looked object” and two other dummy actions used to challenge the model mechanisms).
The goal loop: this encodes goals (e.g., the intention to “open box 1”) that receive input from amygdala and on this basis is capable of recalling the suitable actions for accomplishing a certain goal.
During the learning phase, when the proper press action is performed on a button, the corresponding box opens resulting in a sudden change in the environment. This sudden change (box opening) causes the activation of the superior colliculus (SC) and a consequent release of phasic dopamine, which is responsible for the learning processes taking place in the striatum (supporting the repetition of the just-performed actions: repetition bias) and among the cortices (which results in the action-outcome learning). A further component (which has been hardwired in this version of the model), progressively inhibits the phasic DA signals after the sudden change has been experienced several times: this process of inhibition simulates the learning ability of the system to foresee the results of an action performed on a specific target.
During the test phase, the associations formed during the training phase between representations of open boxes within the goal loop and the representations of attention and arm actions within the other two loops allows the whole model to recall the proper action in correspondence to a goal activated within the goal loop. In the version of the model here described, the goal is activated by a hardwired signal putatively generated by the amygdala (Amg), which assigns a value to one outcome (e.g. “open box 1”) mimicking the fact that the agent sees a food in the corresponding box.


In general, the simulations show the model is able to create its own repertoire of action-outcome links on the basis of IM, being also able to exploit this knowledge recalling suitable goals and the corresponding actions to pursue an extrinsic reward. In further details, the results show that:
1.A simple but biologically inspired implementation of the repetition bias (the ability to focus on and repeat actions causing phasic DA production), greatly favours action-outcome learning.
2.Action-outcome associations in the cortical connections, linking the goal-loop to the eye and arm loops, can encode useful information during IM based exploration.
3.The action-outcome contingencies learned due to IM can be used by the system to accomplish goal-driven behaviours triggered by the presence of EM.

Advancement of the work and relation to other tasks

Despite the presence of few hardwired functions, the present model is a first integrated implementation of the theory of Redgrave and Gurney (2006b). Preliminary data from the empirical experiments run with children show a general agreement with the model’s behaviour. Results with the embodied version of the model in a humanoid robot are encouraging and will be reported elsewhere. Overall, the model allowed producing a number of empirical predictions in relation to the different roles of the repetition bias on visual focusing and on focusing of arm actions: these were sistematically investigated on the basis selective lesions, and the outcomes might be tested in future empirical experiments.

 Clever B3 figure

Figure: The architecture of the model. This is formed by: three striato-cortical loops, realising the selection of arm actions (left), attention focus (centre), and goals (rigth); a dopamine signal generator inhibited by an inhibitor; a component for the computation of values, responsible for the goal-driven action recall.

Selected bibliography

  • Baldassarre, G. (2011). What are intrinsic motivations? a biological perspective. In A. Cangelosi, J. Triesch, I. Fasel, K. Rohlfing, F. Nori, P.-Y. Oudeyer, M. Schlesinger, and Y. Nagai (Eds.), Proceedings of the International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob-2011), pp. E1–8. New York: IEEE.
  • Baldassarre, G., Mannella,  F., Fiore, V.G., Redgrave, P. , Gurney, K. and Mirolli, M.  (2012 accepted, pending revisions). Intrinsically motivated action-outcome learning and goal-based action recall: a system-level bio-constrained computational model, Neural Networks special issue: Autonomous Learning.
  • Baldassarre, G.,  Fiore, V.,  Mannella, F.,  Santucci, V.  Marco,  Nico Veutrelle, M., Gurney, K., Redgrave, P. (2012). CLEVER-B3: New intrinsic-motivation system pivoting on the superior colliculus, and new striato-cortical loop for action-outcome (goal) formation . IM-CLEVeR Internal Report.
  • Hikosaka, O., Y. Takikawa, and R. Kawagoe (2000, Jul). Role of the basal ganglia in the control of purposive saccadic eye movements. Physiol Rev 80 (3), 953–978.
  • Mirolli, M., Santucci, V., Baldassarre, G. (subm) Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: A simulated robotic study. Neural Networks.
  • Redgrave, P., T. J. Prescott, and K. Gurney (1999). The basal ganglia: a vertebrate solution to the selection problem? Neuroscience 89 (4), 1009–1023.
  • Redgrave, P. and K. Gurney (2006). The short-latency dopamine signal: a role in discovering novel actions? Nature Review Neuroscience 7 (12), 967–975.
  • Santucci, V., Baldassarre, G., Mirolli, M. (2010). Biological cumulative learning through intrinsic motivation: A simulated robotic study on the development of visually-guided reaching. In Johansson, B., Sahin, E., Balkenius, C., eds.: Proceedings of the Tenth International Conference on Epigenetic Robotics, 121-128 Lund University Cognitive Studies, Lund.
  • Schmidhuber, J. (1991) Curious model-building control system. In: Proceedings of International Joint Conference on Neural Networks. Volume 2,1458-1463, IEEE, Singapore.


Videos CLEVER B-3

Demonstrator on real robot.
The next video shows the robot iCub learning to interact with  a mechathronic board based on intrinsic motivations. The video describes what happens and how the robot learns.


Demonstrator on real robot.

The next video shows a test of what the robot has learned in during training. In the text, a coloured card-board is put in front of a box. The card-board represents a reward that is given to the robot if the robot opens the box.


Demonstrator on real robot.
The next video shows a situation where two coloured card-boards are put in front of two boxes: the card-boards represent rewards that are given to the robot if the robot opens the related boxes. At a certain point, one reward is diminished of value (is ``devalued''): the video shows how the robot does not try to open the corresponding box anymore.


Demonstrator on real robot.

The next video shows an experiment that integrates the previous two tests.


DMC: Amygdala-based goal recall


According to the Work Plan, the goal of CLEVER-B3 was to allow the system to assign value to goals based on the experience of food (all based on the capacity of amgydala to assign value to seen objects). The objective of CLEVER-B4 is instead to endow the system with the capacity to recall actions based on the activation of goals. CLEVER-B3 achieves both objectives, so anticipating the delivery of the results of next year. The reason is that the task run with monkeys and children was originally planned to be formed by three phases, not two as . This section explains how CLEVER-B3 accomplishes this.

Specific goals

The critical innovation of the model is as follows (see the architecture of the system in the figure below). In CLEVER-B2, the value to select goals is directly provided as a hardwired external input to the goal loop. Instead, in CLEVER-B3 the goal is activated by dopamine. In this model a visual input always reaches the goal loop both during the training and the test phases: the fact that an input activates a goal or not is signalled by the concomitant release of tonic dopamine (DAt), caused by the sight of food in the looked box. This has a multiplicative effect on the input and so allows it to trigger a goal activation.

Approach and methods

More in detail, the amgydala-based selection of goals works as follows. Extrinsic values are provided to the agent only during the test phase: during this phase the "food" is set in one of the three transparent closed boxes and the agent must be able to open such box by recalling suitable actions (i.e. focussing on the button corresponding to the box with the food, and pressing the button to open the box). In order to do so, the amygdala assigns a value to the perceived food, this causes  the release of tonic dopamine (DAt). This bias the selection of the goal corresponding to the perceived box: the DAt allows locking the selection of a goal connected with the current visual input (box with food), so that this goal wins the competition with other goals within the basal ganglia and, via cortico-cortical connections with the arm and oculomotor loops, can trigger the execution of the suitable action-sequence.
Note that a second important change with respect to CLEVER-B2 is that the goal loop has been changed: in CLEVER-B2 the goal loop had 3 units (and hence striato-cortical channels) representing the three possible phasic events (box openings); in CLEVER-B3 the goal loop has 6 units representing buttons and boxes independently of their status (e.g., open/closed box). Note that the semantic of a goal-unit in PFC is acquired when such unit becomes linked to specific eye and arm actions).
The independence of the goal units from the status of the seen object is motivated as follows:
If we represent the open-box and the close-box with distinct units in PFC, these form a strong association between them during the test: so this situation is equivalent to having only one units representing both the close and open box.
We wanted to focus on the superior colliculus and its inhibitor, so we decided to simplify the goal loop as much as possible.
In the future we might separate the two representations of the open and close box if needed (and add the associative learning between the two).
The mechanism of attribution of value by the amygdala to the close/open box, described below, is not affected by the fact that we have the open/close box represented with one or two units, so it is quite orthogonal to the problem of our interest.


The figure below shows some results related to phasic dopamine (related to the repetition bias) and tonic dopamine (related to goal selection) during the training and test. The results show that the system is capable of both producing phasic DA, used to drive the repetition-bias based learning taking place within the striatum in the learning phase, and tonic DA, used to select the goals in the test phase with food.
Interestingly (data not shown), the system allows simulating the typical devaluation task (two foods are put in two different foods, but one food is not valuable for the organism as it has been previously satiated with it, so the organism opens the other box). The reproduction of this experiment is as follows. To activate a goal, the goal loop needs: (a) a seen state (closed box; and eventually the states associated to it: open box); (b) the production of dopamine activating it and caused by the sight of food. If the DA production is inhibited by an internal satiation state, as done in the simulation, the DAt is no more produced and in turn this prevents the goal from being selected. If satiation involves only one food, only the goal of the non-devalued food is activated and the corresponding actions (to open the corresponding box) are recalled.

Advancement of the work and relation to other tasks

The solution proposed here solves a type of “binding problem”. Indeed, the multiplication between the tonic dopamine and the currently active unit of NAcc binds: (a) The value related to the food; (b) The sensorial input related to the box goal . The logic AND between the two generates the activation of the goal. Note that we have taken into consideration various alternative systems and this problem revealed to be the most challenging one among those characterising the new version of the model. The DAt-based mechanism here described solves it elegantly exploiting the multiplicative power of tonic dopamine capable of implementing an AND. The importance of the binding problem is exemplified by considering that the food is moved from one box to another: at this point, the value has to shift, on the fly, from one box to another. In this case, the binding must be dynamic (i.e. computed by the multiplication) and cannot rely on a modification of connection weights. Finally, note that we solved this “binding problem” between value and goals by relying on the same solution that was used for the original binding problem involving “where and what” in the realm of vision: namely, on the basis of attention focussing that exploits the spatial correspondence between the food (value) and the goal to be valued (box to be opened).

SMC: The sensorimotor component

The sensorimotor component of the CLEVER-B demonstrator handles the sensor and motor capabilities required for interaction with objects in the real world. This includes: directing gaze toward stimuli of interest; reaching to, or pointing at, stimuli; grasping and manipulating objects, and pressing buttons on the experimental board.

These abilities are developed on the robot by modelling the data from the infant psychology literature. Learning occurs in stages, and follows cephalocaudal and proximo-distal learning directions, reflecting the patterns evident in infancy.

Specific goals

The target experiment for the CLEVER-B demonstrator (see Deliverable 7.3) requires the robot to look at the buttons and openings on the experimental board, and to push the buttons. This was achieved by the SMc during year 2. The goals for CLEVER-B3 were to improve the ability of the robot to interact with its environment by 1) increasing the range of its reaching and visual systems, 2) adding new manipulation skills, 3) using play-like behaviour to drive further learning.

Approach and methods

Using our mapping framework, and constraint-based learning approach, we have extended the range of behaviours that can be performed by the iCub. In particular, we have focussed on increasing the range of positions that can be seen and reached to, by modelling additional infant behaviour.

The architecture for learning gaze control, within the SMC, enables the robot to attend to visual targets, currently identified using simple vision processes (more advanced visual abstraction techniques will be employed in future demonstrators). This has been extended to incorporate torso movements, which enable the robot to fixate on targets outside of eye and head range. A mapping of torso rotation to visual change, similar to that used to learn eye and head movements, can be learnt, allowing the torso to contribute to gaze shifts. An important property of movement at the torso is that it does not change the hand-eye relationship, and so can also be used to move the workspace of the robot to envelope the stimuli of interest. Simple vergence control has also been added, and is used to estimate the distance to an object in the gaze space.

Reaching is learnt by mapping arm joint to gaze positions, and reflects findings on early infant reaching (Berthier, 2011). Constraints analogous to those identified in early reaching, such as locking the elbow joint, are imposed to reduce redundancy and remove discontinuities in the joint space. Reaching in the current system is visually elicited, rather than guided, and made directly from a preferred pre-reaching position to the target with no trajectory control. In the absence of elbow flexion, the torso is rotated to position the target at a distance reachable by the learnt arm postures. A simple engineered grasp has been implemented to allow grasping of objects.

We have also focussed on play behaviour in infancy, and its role in the generation and discovery of new actions. Rather than build a set of actions for the robot to use in the board experiments, the goal is for the robot to generate its own actions by exploring the environment. We can provide scaffolding via the environment to direct learning towards particular skills.

An improved babble generator enables babbling to include both primitive actions (i.e. those that cannot be decomposed into smaller actions) and composite actions (e.g. reach and press). In the previous system all stimuli were immediately processed (or ignored) and selection was performed by simple attention and saliency methods. This means there were no long term memory of events and effects. The babble generator is based on a schema model that is able to memorise small fragments of behaviour in terms of sensorimotor relations. These can be generalized, enabling the system to make informed decisions about scenarios it hasn’t encountered yet but which are similar to past experiences.

Following the developmental timeline, the robot quickly acquires the skills which allow it to look at, reach to, and pick up objects. These arise through simple motor babbling, and a sequence of constraints on behaviour. It then experiments with the learnt behaviours, using play-like behaviour, which enable it to learn about action effects, plan sequences of actions, and make generalisations about learnt actions and how they can be used in new situations.


The gaze space of the robot has been increased by the addition of movements of the torso, and extended from 2D to 3D by the addition of mechanisms for controlling the vergence of the cameras.

The reach space has been increased by the re-enabling of 2-arm reaching, which allows the robot to select either arm for reaching to a target, and the addition of movements of the torso. The addition of vergence control enables the reach space to be mapped in 3D, allowing reaches to be conducted to positions closer to, or further from, the robot that appear on the same line of sight. Together these improvements allow greater flexibility in the positioning of the experimental board and of the objects to be manipulated. The virtual skin is used to ensure the robot does not move into any dangerous positions, and a palmer grasp has been added to enable the robot to pick up simple objects.

Schema learning enables the robot to learn about sensorimotor actions, and their effects, through play-like behaviour. The robot can use schemas to make generalisations about actions and their outcomes based on prior experience, and so select when to use actions in new situations. Learning is driven by novelty, and so the robot is capable of creating its own action sequences through experimentation. It can also use learnt schemas to plan a sequence of actions to achieve a goal.

Advancement of the work and relation to other tasks

The work described here demonstrates how the abstract representations of Task 4.4 and the novelty-driven learning Task 5.4 can be used to model infant-like development using the hierarchical architectures of Task 6.2. The sensorimotor actions learnt using this process provide can be selected and driven by the DMC to provide real-world interactions.

The image processing and vergence modules are currently based on simple engineered mechanisms, and are not biologically plausible. The will be replaced in the future by biologically constrained modules from FIAS.

The virtual skin software, provided by IDSIA, is being used to protect the robot from unwanted collisions and dangerous postures.


Figure: Architecture of the SMc, showing the schema mechanism for generating play behaviour, and its interactions with the real and simulated iCub.

Selected bibliography

  • Berthier, N.E. (2011). The syntax of early human infant reaching. In Sayama, H., Minai, A.A., Braha, D., Bar-Yam, Y. (Eds) Unifying Themes in Complex Systems Volume VIII: Proceedings of the Eighth International Conference on Complex Systems. New England Complex Systems Institute Series on Complexity, NECSI Knowledge Press, 1477-1487.


The following video shows the Aberystwyth iCub learning to perform visually triggered reaching and grasping by modelling the developmental stages observed in infancy.


The following video shows the Aberystwyth iCub learning using play-like behaviour.  It experiments with behaviours it has previously learnt, storing the results in its schema memory. The video shows schemas being created, generalised, and chained, through a novelty-driven process.  It also shows how learning can be shaped by interaction, in this case through the use of language.