Documentation of demo CLEVER-K4, the “kitchen scenario” demonstrator: machine-learning integrated models of abstraction, novelty, hierarchical architectures.
A major challenge for this task is the integration of the methods and models developed in the work packages WP4, WP5 and WP6 on the iCub robot. This requires a safe environment for experimenting with the iCub robot, a specification of how modules should interact, robust testing environments for the individual modules developed in the work packages, and an integrated experimental setup for benchmarks and demonstration. Some of these issues where solved during the last project periods, some were improved during the current project period and some are still under development.
The NES optimization framework is extended to build maps by iteratively solving new posture optimization problems with slightly different requirements. These requirements are defined by a task-function, and defines the movement freedoms we require for the specific task (see Fig.2). The resulting framework builds sets of points in the relevant task-dimensions. These are defined by functions that represent the dimensions in which we want to move. In this example in Fig.1 (middle) we want to control the angle, and distance of the head with respect to a certain point. This can be useful to investigate a certain object from different angles. A second example in Fig.1 (right), the roadmap is made to move the left hand to different positions, thus when plotting the position of the left hand of the different postures, we get a glimpse at the movement-flexibilities it provides.
Figure 2: The framework architecture uses NES to iteratively minimize constraints, while maximizing the covering in a task-space. All information about the body-part poses and world state can be used.
In order to understand how intrinsic motivation can help an RL Agent with respect to the goal of learning to manipulate objects, we have been focused on developing an RL agent that captures a meaningful problem in robotics, namely to plan and execute reaches to arbitrary pre-grasp configurations. To implement States and Actions, we have developed the Modular Behavioral Environment (MoBeE), and this year we have embodied an Intelligent Agent in the iCub robot using MoBeE.
Last year we demonstrated reactive/reflexive collision avoidance while learning a simple Markov Decision Process to represent the iCub's joint space (see Fig.3). To implement State Transitions, we used a switching controller, which acted like a proxy, interrupting the Agent to avoid danger. We thought this was an elegant approach because it makes few assumptions about how the agent controls the robot. We observed that switching control according to a binary danger signal proved impractical due to all the noise and asynchrony involved in communicating with hardware through the network. This year, we have adopted a different control strategy, wherein each body part is controlled by a second order dynamical system (see Fig.4). This introduces a mess of parameters and tuning, but it also allows offers the following two key advantages: (1) Fictitious forces can be used to smoothly avoid infeasible configurations. (2) Action can be defined as an arbitrary piece of code that forces the dynamical system and/or moves its attractor and then terminates.
Figure 3: The iCub autonomously re-plans a motion to move from one side of the ball to the other. If the ball is not a solid object (top), the Agent moves the hand through it. When the ball is suddenly made an obstacle (bottom), the Agent quickly finds the path around it. The active plan is shown with red edges in the inset graphs.
Figure 4: A second order dynamic system is working as attractor for a reactive behavior control module of the iCub robot. In this example the target position for the right hand is constantly moving on a circle in front of the robot. The executed path is an ellipsoid, because of the influence of repulsing forces of body components to prevent self-collisions.
We have used a Katana robotic arm to teach an iCub humanoid robot how to perceive the location of the objects it sees. To do this, the Katana accurately positions an object within the shared workspace, and informs the iCub about the objects position. While the iCub moves it observes the object from various poses and while doing so a neural network learns how to relate the pose and visual inputs (image position) to the object location.
The task-relevant roadmaps form a basis for learning algorithms, where the robot-states in the roadmap correspond to states, and movements between those states form the actions. A video shows various simple, complex and a sequences for previously learned specific tasks on the iCub robot http://vps9114.xlshosting.net/trm.mp4. The thumbnails in Fig.5 are showing a task in which the robot moves a box with both hands.
Figure 5: A sequence from the video available from http://vps9114.xlshosting.net/trm.mp4. An object manipulation task which involves a coordinated planning for both arms and hands to move the huge box from one place to another.
The MoBeE Model was upgraded to reflect our new control strategy, and this has enabled us to define a Reinforcement Learner that can coordinate deliberate planning and reactive control. So far we have experimented with Agents that have two different kinds of actions in their repertoire. Reaching actions greedily force the dynamical system toward pre-grasp poses near target objects using gradient decent. These actions work very well some of the time, depending on the state of the robot prior to reaching. Therefore there are also planning actions, which move the dynamical system's attractor to a state where the reach is more likely to succeed.
We show that satisfactory results can be obtained for localization even in scenarios without prior information about the robot kinematics. Furthermore, we demonstrate that this task can be accomplished safely. For this task we extend MoBeE to model multiple robots and prevent collisions between these independently controlled, heterogeneous robots in the same workspace (see Figure 6).
Figure 6: The iCub robot and the Katana arm are interacting in the real world (left). Both robots are controlled via YARP. The models of both robots are loaded into MoBeE to perform collision detection for both of them while working in a shared workspace. This setup allows us to transfer spatial perception from the accurate Katana arm to improve localization accuracy on the iCub.
The new framework to create task related roadmaps to plan complex movements on a 41 DOF humanoid is able to execute manipulation tasks as skills which can be used as modules in an hierarchical learning environment. The method improves on other methods in its flexibility of constraint and task definitions. Based on RL we are able to learn some feasible reaches in simple environments and we are able to remember and reuse different reaches using a version of the PowerPlay concept. Interaction of two robots in the same framework (MoBeE) allows to learn object localization without external supervision.
Frank, M. Leitner, J. Stollenga, M., Kaufmann, G., Harding, S., Förster, A., Schmidhuber, J. (2012). The Modular Behavioral Environment for Humanoids & other Robots (MoBeE). 9th International Conference on Informatics in Control, Automation and Robotics (ICINCO). Rome, Italy. July 2012.
Leitner, J., Harding, S., Frank, M., Förster, A., Schmidhuber, J. (2013). An Integrated, Modular Framework for Computer Vision and Cognitive Robotics Research (icVision). In Chella, A et. al (Eds.): Biologically Inspired Cognitive Architectures 2012. Advances in Intelligent Systems and Computing (Vol. 196), pp. 205-210. Springer Berlin Heidelberg, 2013. Presented at the Int'l Conference on Biologically Inspired Cognitive Architectures (BICA). Palermo, Italy. November 2012.
Leitner, J., Harding, S., Frank, M., Förster, A., Schmidhuber, J. (2012). Transferring Spatial Perception Between Robots Operating In A Shared Workspace. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Vilamoura, Portugal. October 2012.
Stollenga, M., Pape, L., Frank, M., Leitner, J., Forster, A., Schmidhuber, J. (2013). Task-Relevant Roadmaps: A Framework for Humanoid Motion Planning. (under review)
Previous work has focused on the development and improvement of core static and dynamic novelty detection algorithms for identification and tracking of objects based on visual perception. Visual perception novelty detection methodologies were also extended with action learning; we also investigated the tactile domain. The bulk of our work in the final year focused on collaboration and engagement with CNR specifically related to intrinsically motivated action-outcome learning and goal-based action recall. Our methods are integrated with FIAS's gaze control and object learning and IDSIA's motion planning and motor control framework. The K4 demo (as described early in year 4) focused on the agreed phases of an intrinsically motivated learning sequence.
Our specific goals were to collaborate with partner CNR to practically apply conceptual intrinsic motivation action-outcome learning and goal-based recall and to advance the existing algorithms. Our research aims and research phases associated with the agreed elements of the K4 demo were:
The Stage 1 UU framework included three blocks: eye control, head control (with defined possible actions) and achievement of goal. A basic experimental setup includes the PR2 robot recognizing and monitoring robot balls on a table. Holes exist on the table so that the balls can roll away and disappear from view. The experimental setup was designed to bean action learning analogy of the button/boxes approach utilised in the monkey experiments (see Figure 1). In UUs PR2 experiment, a ball was used instead of buttons, and holes instead of boxes, in comparison with Baldassarre et al(2012). Four different actions were defined. In addition, there were three holes which were located in different positions on the table. In the experiment the robot learnt the outcomes of a scenario which was to get the ball into a hole, i.e. how to push the ball to the one of the target holes and learn using the Intrinsic Motivation approach. Both cortex and striatal learning methods were used. Specifically, cortex learning modified the connection weights from the goal block to the robot arm and view control – learning the relationship between the outcome and the focus direction and also between the outcome and the action. Also striatal learning allowed the system to look at the balls/holes and to select the correct action so that the system learnt that when it saw a ball, this caused an interesting change in the environment. If the randomly selected action caused a ball to move then another event was triggered: activate/adjust signals; refocus on change location (receive context input); outcome considered an achieved 'goal' (learn associations later); change in environment perception (change signal and process two learning actions). When events became ''non-interesting' then the view control explored the environment and the arm control selected other actions to discover new (unpredicted) events.
Figure 1: Experimental setup: [LHS] PR2 robot monitoring environment and [RHS] striking ball into hole
The video shows the implementation of the Clever-B model within a different context. The three buttons and boxes are replaced by a ball with three useful angular trajectories and three different objectives (holes). See above for further explanation. Top left: Shows the current selection within the striatal layer. Top right: Shows the associated hebbian learning within the cortex layer. Bottom left: Shows the exterior view of the experiment. Bottom right: shows the first person view from the PR2 robot.
Stage 2: We believe that humans and other biological organisms bias their selections based on their previous experiences. For this reason, UU implemented a new probabilistic biased selection (PBS) approach based on former acquired knowledge as a replacement to a random selection step as demonstrated in the previous model. In fact, we suggest that the PBS is always active, instead of forcing the striatal selection until a threshold, hence removing the need for any threshold in the first place; whilst still being compatible with Clever-B's model and principles, specifically the preference of re-selecting the striatal instance. This is expressed by allowing the current striatal instance (and any other recent one) to have a higher probability of being selected that any other unselected available action, i.e. we "bias" the system to select the striatal instance, instead of forcing it, while giving a small chance to all other actions to be selected. The architecture of our PBS integrated in within the Clever-B model is shown in Figure 2.
Figure 2: Clever-B IM Model Framework (2012) extended with UU's Probabilistic Biased Selection approach (PBS) and improved-by prediction learning. Green blocks represent extension to model. As in the Clever-B model, arrow heads represent excitatory connections and circle-heads represent inhibitory connections. Blocks with bold border are phenomenologically modeled ('hard-wired') components and dashed lines represent dopamine enabled learning signals.
Stage 3: Although the model of PBS improves the selection and in our opinion it is closer to how biological organisms choose among a set of actions, i.e. using former acquired knowledge to make a suitable selection, there is still one additional improvement to be made as aforementioned. Clever-B's model learns the actions/outcomes that lead into a novel outcome, ignoring the actions that have no consequences. In its original form where random selection is applied this makes no difference to the overall selection process. However, in the case of PBS, this means that previously tried dummy actions have the same chance of being selected as the unselected actions. We believe that biological organisms also store information about actions that had no positive result and use this information in their future choices. For this reason, we have implemented a mechanism that allows the PBS to take into account previously seen dummy actions and suppress their selection bias over unselected actions. This allows the system to better explore areas that there is higher potential of learning something useful.
Stage 4: We demonstrate the effectiveness of PBS-PL by implementing a sequence learner by way of reinforcement learning (RL). In this stage, we show that the sequence learner benefits from our previous extensions by facilitating the learning process at a much faster rate.
Stage 2 - Probabilistic Biased Selection (PBS): The video below shows results that explain the behavior and benefits of the system as outlined above for Stage 2. A random selection is at the beginning of the experiment where all choices have equal probability of being selected (t0). Once a useful selection has been made, it becomes a biased selection and is afforded a higher probability of being selected again. A dummy selection produces no visible output and retains the same proportion of probability as an unselected choice. However, upon convergence (i.e. all useful actions have been fully learnt within the cortex layer), the probability reverts back to random choice (probability for each choice is equal again) thus it still resembles the Clever-B model at this point.
In this video, the top-left shows the probability of all the choices with the exploded wedge being the currently selected choice. Graphs shown in the top-right relate to the cortex Hebbian weights which indicate how well an association between look-action-outcome is known. The useful actions are denoted **B,P1/**B,P2/**B,P3.
Stage 3 – PBS with Prediction Learning (PL): The extension of Stage 2 with added PL allows for the model to also learn dummy actions and thus learn from its mistakes. We believe that this is a more biologically inspired approach as humans will also learn uninteresting actions or from their mistakes. One noticeable difference in this model extension is that dummy actions are also afforded an initial boost in probability after the initial selection. This in effect handles exceptions where an action performed might not necessarily mean that it is a dummy but rather a mistake that has occurred within hardware. In this case, if the action is performed again with the same outcome, the resultant proportion of probability for reselection will be reduced faster than that of a true useful action. The video below shows the behavior of model with PL.
In this video, we believe the model behaves in a more biologically inspired way. Upon convergence, the model retains the knowledge of both useful actions and dummy actions and affords a higher chance for reselection on all useful actions. This becomes especially useful when learning new skills using the useful actions.
Stage 4 – Learning Sequence of actions: The extension of the sequence learner within the experiment is used to demonstrate the effectiveness of our PBS-PL approach to striatal selection. In this process, the learning is facilitated at a much faster rate upon convergence as the memory of useful actions is retained in the form of higher probabilities being afforded for reselection. Figure 3 shows a comparison of the converged states of the Stage 2 and Stage 3 experiments. The important difference to note between them is that PBS-PL retains the knowledge of the useful actions upon convergence whilst the Stage 2 model resorts back to random selection. In the case where extra learning is required, i.e. sequence learning, PBS-PL performs substantially quicker in achieving its goals. The video below shows a synopsis of this experiment.
Figure 3: The image shows the sequence learning associated with the converged states of both the a) Stage 2 and b) Stage 3 models. Both Sequence learners refer to the same time-step (t=320). The Stage 2 model converges faster (t=71) though upon convergence, the Stage 3 model promotes a faster learning rate.
In this video, we demonstrate the effectiveness of the PBS-PL approach by reviewing the rate at which a particular sequence is learned. The sequence in this case is a sequence of 'look-action-outcome' which results in the ball being placed within the holes in the order of 1-3-2. The top-left shows the converged state of the model whilst the top-right shows the q-learning for the sequences. When a sequence is encountered, it is given a positive reward otherwise the reward is negative.
To summarise, the outcome of our work in collaboration with CNR and Sheffield was the reimplementation of the Clever-B model in our system with the advancement of the model in three original ways:
Intrinsically motivated selection based on a probabilistic framework
Prediction learner of useful/dummy actions for improved selection of the probabilistic biased selector (inspired by IDSIA)
Sequence learner based on previously learnt basic skills
These skills were demonstrated in a real world robot experiment which bridged between the Clever-K and Clever-B projects (Clever-KB). We would like to thank Gianluca Baldassare of CNR and Kevin Gurney of the University of Sheffield for their help and support to our numerous questions.