Interactive Perception: Leveraging Action in Perception and Perception in Action

Jeannette Bohg, Karol Hausman, Bharath Sankaran, Oliver Brock, Danica Kragic, Stefan Schaal, Gaurav Sukhatme

I Introduction

There is compelling evidence that perception in humans and animals is an active and exploratory process . Even the most basic categories of biological vision seem to be based on active visual exploration, rather than on the analysis of static image content. For example, Noë argues that the visual category circle or round cannot be based on the direct perception of a circle, as (i) we rarely look at round objects from directly above, and (ii) the projection of a circle onto our retina is not a circle at all. Instead, we perceive circles by the way their projection changes in response to eye movements.

Held and Hein analyzed the development of visually-guided behavior in kittens. They found that this development critically depends on the opportunity to learn the relationship between self-produced movement and concurrent visual feedback. The authors conducted an experiment with kittens that were only exposed to daylight when placed in the carousel depicted in Fig. 1. Through this mechanism, the active kittens (A) transferred their own, deliberate motion to the passive kittens (P) that were sitting in a basket. Although, both types of kittens received the same visual stimuli, only the active kittens showed meaningful visually-guided behavior in test situations.

Gibson showed that physical interaction further augments perceptual processing beyond what can be achieved by deliberate pose changes. In the specific experiment, human subjects had to find a reference object among a set of irregularly-shaped, three-dimensional objects (see Fig. 2). They achieved an average accuracy of 49% if these objects were shown in a single image. This accuracy increased to 72% when subjects viewed rotating versions of the objects. They achieved nearly perfect performance (99%) when touching and rotating the objects in their hands.

These three examples illustrate that biological perception and perceptually-guided behavior intrinsically rely on active exploration and knowledge of the relation between action and sensory response. This contradicts our introspection, as we just seem to passively see. In reality, visual perception is similar to haptic exploration. “Vision is touch-like” [3, p.73] in that, perceptual content is not given to the observer all at once but only through skillful, active looking.

This stands in stark contrast to how perception problems are commonly framed in Machine Vision. Often, the aim is to semantically annotate a single image while relying on a minimum set of assumptions or prior knowledge. These requirements render the considered perception problems under-constrained and thereby make them very hard to solve.

The most successful approaches learn models from data sets that contain hundreds of thousands of semantically annotated static images, such as Pascal VOC , ImageNet or Microsoft COCO . Recently, Deep Learning based approaches led to substantial progress by being able to leverage these large amounts of training data. In these methods, data points provide the most important source of constraints to find a suitable solution to the considered perception problem. The success of these methods over more traditional approaches suggests that previously considered assumptions and prior knowledge did not correctly or sufficiently constrain the solution space.

Different from disembodied Computer Vision algorithms, robots are embodied agents that can move within the environment and physically interact with it. Similar to biological systems, this creates rich and more informative sensory signals that are concurrent with the actions and would otherwise not be present. There is a regular relationship between actions and their sensory response. This regularity provides the additional constraints that simplify the prediction and interpretation of these high-dimensional signals. Therefore, robots should exploit any knowledge of this regularity. Such an integrated approach to perception and action may reduce the requirement of large amounts of data and thereby provide a viable alternative to the current data-intensive approaches towards machine perception.

II Interactive Perception

Recent approaches in robot perception are subsumed by the term Interactive Perception (IP). They exploit any kind of forceful interaction with the environment to simplify and enhance perception. Thereby, they enable robust perceptually-guided manipulation behaviors. IP has two benefits. First, physical interaction creates a novel sensory signal that would otherwise not be present. Second, by exploiting knowledge of the regularity in the combined space of sensory data and action parameters, the prediction and interpretation of this novel signal becomes simpler and more robust. In this section, we will define what we mean by forceful interaction. Furthermore, we explain the two postulated benefits of IP in more detail.

Any action that exerts a potentially time-varying force upon the environment is a forceful interaction. A common way of creating such an interaction is through physical contact that may be established for the purpose of moving the agent (e.g. in legged or wheeled locomotion), for changing the environment (e.g. to open a door or pushing objects on a table out of the way) or for exploring environment properties while leaving it unchanged (e.g. by sliding along a surface to determine its material). It may also be a contact-free interaction that is due to gravitational or magnetic forces or even lift. An interaction may only be locally applied to the scene (e.g. through pushing or pulling a specific object) or it may affect the scene globally (e.g. shaking a tray with objects standing on it). This interaction can be performed either by the agent itself or by any other entity, e.g. a teacher to be imitated or someone who demonstrates an interaction through kinesthetic teaching.

In this survey we are interested in approaches that go beyond the mere observation of the environment towards approaches that enable its Perceptive ManipulationWe consider Perceptive Manipulation to be the equivalent term to Interactive Perception. This emphasizes the blurred boundary which is traditionally drawn between manipulation and perception.. Therefore, we focus on physical interactions for the purpose of changing the environment or for exploring environment properties while leaving it unchanged. We are not concerned with interactions for locomotion and environment mapping.

II-B Benefits of Interactive Perception

Forceful interactions create novel, rich sensory signals that would otherwise not be present. These signals are beneficial for estimating the quantities that are relevant to manipulation problems such as haptic, audio and visual data correlated over time. Relevant quantities include object weight, surface material or rigidity.

Action Perception Regularity (APR)

Forceful interactions reveal regularities in the combined space ( $S\times A\times t$ ) of sensor information ( $S$ ) and action parameters ( $A$ ) over time ( $t$ ). This regularity is constituted by the repeatable, multi-modal sensory data that is created when executing the same action in the same environment. Not considering the space of actions amounts to marginalizing over them. The corresponding sensory signals would then possess a significantly higher degree of variation compared to the case where the regularity in $S\times A\times t$ is taken into account. Therefore despite $S\times A\times t$ being much higher dimensional, the signal represented in this space has more structure.

Knowing this regularity corresponds to understanding the causal relationship between action and sensory response given specific environment properties. Thereby, it allows to (i) predict the sensory signal given knowledge about the action and environment properties, (ii) update the knowledge about some latent properties of the environment by comparing the prediction to the observation and (iii) infer the action that has been applied to generate the observed sensory signal given some environment properties. These capabilities simplify perception but also enable optimal action selection.

Learning these regularities corresponds to identifying the causal relationship between action and sensory response. This requires information about the action that produced an observed sensory effect. If the robot autonomously interacts with the environment, this information is automatically available. Information about the action can also be provided by a human demonstrator.

III Historical Perspective

In robotics, the research field of Active Perception (AP) pioneered the insight that perception is active and exploratory. In this section, we relate Interactive to Active Perception. Additionally, we discuss the relation of IP to other perception approaches that neglect either the sensory or action space in $S\times A\times t$ . Figure 3 summarizes the section.

This approach to perception does not require any sensing. It aims at finding a sequence of actions that brings the system of interest from an unknown into a defined state. Therefore, after performing these actions, the system state can be considered as perceived. This kind of sensorless manipulation was demonstrated first by Erdmann and Mason who used it for orienting a planar part that is randomly dropped onto a tray. The goal of the proposed algorithm is to generate a sequence of tray tilting actions that uniquely moves the part into a goal orientation without receiving sensor feedback or knowing the initial state. It uses a simple mechanical model of sliding and information on how events like collisions with walls reduce the number of possible part orientations. More recently, Dogar et al. extend this line of thought to grasping. The authors plan for the best push-grasp such that the object of interest has a high probability of moving into the gripper while other objects are pushed away. The plan is then executed open-loop without taking feedback of the actual response of the environment into account.

We argue that IP critically depends on representing a signal in the combined space of sensory information and action parameters over time. Sensorless manipulation is similar in that it also requires a model of how actions funnel the uncertainty about the system state into a smaller region in state space. However, different from the approaches in this survey, it does not require sensory feedback as it assumes that the uncertainty can be reduced to the required amount only through the actions. For complex dynamical systems, this may not always be the case or a sufficiently expressive forward model may not be available.

III-B Perception of Visual Data

The vast research area of Computer Vision focuses on interpreting static images, video or other visual data. The majority of approaches completely neglect the active and exploratory nature of human and robot perception. Nevertheless, there are examples in the Computer Vision literature that show how exploiting the regularity in $S\times A\times t$ simplifies perception problems. The first example aims at human activity recognition in video. It is somewhat obvious that this task becomes easier when observing the activity over a certain course of time. Less obvious is the result by Kjellström et al. who showed that classifying objects is easier if they are observed while being used by a person. More recently, Cai et al. support these results. They show that recognizing manipulation actions in single images is much easier when modeling the associated grasp and object type in a unified model.

Another example considers the problem of image restoration. Xue et al. exploit whole image sequences to separate obstructing foreground like fences or window reflections from the main subject of the images, i.e. the background. This would be a very hard problem if only a single image were given or without the prior knowledge of the relation between optical flow and depth.

Aloimonos et al. show how challenging vision problems, such as shape from shading or structure from motion, are easier to solve with an active than a passive observer. Given known camera motion and associated images, the particular problem can be formulated such that it has a unique solution and is linear. The case of the passive observer usually requires additional assumptions or regularization and sometimes non-linear optimization.

III-C Active Perception

In 1988, Bajcsy introduced AP as the problem of intelligent control strategies applied to the data acquisition process. Ballard and Aloimonos et al. further analyzed this concept for the particular modality of vision. In this context, researchers developed artificial vision systems with many degrees of freedom and models of visual attention that these active vision systems could use for guiding their gaze.

Recently, Bajcsy et al. revisited AP giving an excellent historical perspective on the field and a new, broader definition of an active perceiver based on decades of research:

“An agent is an active perceiver if it knows why it wishes to sense, and then chooses what to perceive, and determines how, when and where to achieve that perception.”

The authors identify the why as the central and distinguishing component to a passive observer. It requires the agent to reason about so called Expectation-Action tuples to select the next best action. The expected result of the action can be confirmed by its execution. Expectation-Action tuples capture the predictive power of the regularity in $S\times A\times t$ to enable optimal action selection.

The new definition of AP is not only restricted to vision. However, the majority of approaches gathered under the term of active perception consider vision as the sole modality and the manipulation of extrinsic or intrinsic camera parameters as possible actions. This is also reflected by the choice of examples in . The focus on the visual sense has several implications for Active Perception in relation to Interactive Perception. First, an active perceiver with the ability to move creates a richer and more informative visual signal (e.g. from multiple viewpoints or when zooming) that would otherwise not be present. However, this may not provide all relevant information, especially not those required for manipulation problems. Natale et al. emphasize that only through physical interaction, a robot can access object properties that otherwise would not be available (like weight, roughness or softness).

Second, as shown in Aloimonos et al. , we have very good understanding of multi-view and perspective geometry that can be leveraged to formulate a vision problem in such a way that its solution is simple and tractable. However, when it comes to predicting the effect of physical interaction that does not only change the viewpoint of the agent on the environment, but the environment itself, we are yet to develop rich, expressive and tractable models.

Lastly, AP mainly focuses on simplifying challenging perception problems. However, a robot should also be able to manipulate the environment in a goal-directed manner. Sandini et al. [24, p.167] formulate this as a difference in how visual information is used: in AP it is mainly devoted to exploration of the environment whereas in IP it is also used to monitor the execution of motor actions.

III-C2 Early Examples of Interactive Perception

There are a number of early approaches within the area of Active Perception that exploit forceful interaction with the environment and are therefore early examples for IP approaches. Tsikos and Bajcsy propose to use a robot arm to make the scene simpler for the vision system through actions like pick, push and shake. The specific scenario is the separation of random heaps of objects into sets of similar shapes. Bajcsy , Bajcsy and Sinha propose the Looker and Feeler system that allows to perform material recognition of potential footholds for legged locomotion. The authors hand-design specific exploration procedures of which the robot observes the outcome (visually or haptically) to determine material attributes. Salganicoff and Bajcsy show how the mapping between observed attributes, actions and rewards can be learned from training data gathered during real executions of a task. Sandini et al. [24, Section 3] propose to use optical flow analysis of the object motion while it is being pushed. The authors show that through this analysis, they can retrieve both geometrical and physical object properties which can then be used to adapt the action.

III-D Active Haptic Perception

Haptic exploration of the environment relies on haptic sensing that requires contact with the environment. Interpretation of a sequence of such observations is part of IP as it requires a forceful and time-varying interaction. The interpretation of an isolated haptic frame without temporal information is similar to approaches in Computer Vision such as semantic scene understanding from static images .

Early approaches that use touch in an active manner are applied to problems such as reconstructing shape from touch , recognizing objects through tracing their surface or exploring texture and material properties . The complementary nature of vision and touch has been explored by Allen and Bajscy in reconstructing 3D object shape. A more complete review of these early approaches towards active haptic perception is contained in .

More recent examples include to learn haptic object representations, for object detection and pose estimation, for reconstructing the shape of objects or the environment as well as for texture classification or description. The most apparent difference of these recent approaches to earlier work lies in the use of machine learning techniques to either automatically find suitable exploration strategies, to learn suitable feature representations or to better estimate different quantities.

In general, active haptic perception requires deliberate contact interaction but the majority of the cases do not aim at changing the environment. Instead, for simplification, objects or the environment are often assumed to be rigid and static during contact.

IV Applications of Interactive Perception

Interactive perception methods may be applied to achieve an estimation or a manipulation goal. Currently, the vast majority of IP approaches estimate some quantity of interest through forceful interaction. Other IP approaches pursue either a grasping or manipulation goal. This means that they aim to manipulate the environment to bring it into a desired state. Usually this includes the estimation of quantities that are relevant to the manipulation task.

Existing IP approaches can be broadly grouped into ten major application areas as visualized in Figure 4. In this section, we briefly describe each of these areas. For the first three applications (Object Segmentation, Articulation Model Estimation and Object Dynamics Learning), we use a couple of simple examples (Figures 5 and 6) to allow the reader to better appreciate the benefits of IP and understand its distinction to Active Perception.

Object segmentation is a difficult problem and, in the area of Computer Vision, it is often performed on single images . To illustrate the challenges, consider the simple example scenario depicted in Figure 5. Two Lego blocks are firmly attached to the table. The robot is supposed to estimate the number of objects on the table. When the robot is a passive observer of the scene as in Fig. 5 [Left], it would be very challenging to estimate the correct number of Lego blocks on the table without incorporating a lot of prior knowledge. The situation does not improve in this static scenario even with more sensory data from different viewpoints or after zooming in.

When the robot observes another agent interacting with the scene as shown in Fig. 5 [Center], it will be able to segment the Lego blocks and correctly estimate the number of objects in the scene. This is an example of how forceful interactions can create rich sensory signals that would otherwise not be present (CNS). The new evidence in form of motion cues simplifies the problem of object segmentation.

The ability to interact with the scene allows a robot to also autonomously generate more informative sensory information as visualized in Fig. 5 [Right]. Reasoning about the regularity in $S\times A\times t$ may lead to even better segmentation since the robot can select actions that are particularly well suited for reducing the segmentation uncertainty (APR).

For these reasons, object segmentation has become a very popular topic in Interactive Perception. For example, Fitzpatrick and Metta , Metta and Fitzpatrick are able to segment the robot’s arm and the objects that were moved in a scene. Gupta and Sukhatme , Chang et al. use predefined actions to segment objects in cluttered environments. van Hoof et al. can probabilistically reason about optimal actions to segment a scene.

IV-B Articulation Model Estimation

Another problem that is simplified through Interactive Perception is the estimation of object articulation mechanisms. The robot has to determine whether the relative movement of two objects is constrained or not. Furthermore, it has to understand whether this potential constraint is due to a prismatic or revolute articulation mechanism and what the pose of the joint axis is. Fig. 5 [Left] visualizes an example situation in which the robot has to estimate the potential articulation mechanism between two Lego blocks given only visual observations of a static scene. This is almost impossible to estimate from single images without including a lot of prior semantic knowledge. It is also worth noting that this situation is not improved if gathering more information from multiple viewpoints of this otherwise static scene.

In Fig. 5 [Center], the robot observes an agent lifting the top-most lego block. This is another example of how forceful interactions create a novel, informative sensory signal (CNS). In this case it is a straight-line, vertical motion of one Lego block. It provides evidence in favor of a prismatic joint in between these two objects (although in this case, this is still incorrect).

When the robot autonomously interacts with the scene it creates these informative sensory signals not only in the visual but also haptic sensory modality. This data is strongly correlated with a particular articulation mechanism. Fig. 5 [Right] visualizes this scenario. By leveraging knowledge of the regularity in $S\times A\times t$ , the robot can also form a correct hypothesis of the articulation model (APR). The Lego blocks are rigidly attached at first, but when the robot applies enough vertical force to the top-most Lego block, there is sensory evidence for a free body articulation model.

In the literature, there are offline estimation approaches towards this problem that either rely on fiducial markers or marker-less tracking . There are also online approaches where the model is estimated during the movement. Most recent methods include reasoning about actions to actively reduce the uncertainty in the articulation model estimates .

IV-C Object Dynamics Learning and Haptic Property Estimation

Interactive Perception has also made major inroads into the challenge of estimating haptic and inertial properties of objects. Fig. 6 shows a simple example scenario that shall serve to illustrate why IP simplifies the problem. Consider a sphere that is lying on a table. The robot is supposed to estimate the weight of the sphere given different sources of information. We assume that the robot knows the relationship between push force, distance the sphere traveled and sphere weight. In the trivial static scene scenario illustrated in Fig. 6 [Left], the robot is not able to estimate any of the inertial properties. It encounters similar problems as in the previous example (Fig. 5) even if it was able to change the viewpoint.

In Fig. 6 [Center], the robot can observe the motion of the sphere that is pushed by a person. Now, the robot can easily segment the ball from the table due to the additional sensory signal that was not present before (CNS). However, it remains very difficult for the robot to estimate the inertial properties of the sphere because it does not know the strength of the push. Without this information, the known regularity in $S\times A\times t$ cannot be exploited. The robot will only be able to obtain a very uncertain estimate of the sphere weight because it needs to marginalize over all the possible forces the person may have applied.

In Fig. 6 [Right], the robot interacts with the sphere. It can control the push force that is applied and observe the resulting distance at which the sphere comes to rest. Given the knowledge of the strength of the push, it can now exploit the known associations between actions and sensory responses to estimate the sphere’s inertial properties (APR).

There are several examples that leverage the insight that IP enables the estimation of haptic and inertial properties. For example, show that surface and material properties of objects can be more accurately estimated if the robot’s haptic sensor is moved along the surface of the object.

Atkeson et al. and Zhang and Trinkle move the object to estimate its inertial properties or other parameters of object dynamics which are otherwise unobservable.

IV-D Object Recognition or Categorization

Approaches to detect object instances or objects of a specific category have to learn the appearance or shape of these objects under various conditions. There are many challenges in object recognition or categorization that make this task very difficult given only a single input image. A method has to cope with occlusions, different lighting conditions, scale of the images, just to name a few. State-of-the-art approaches in Computer Vision as e.g. require enormous amounts of training data to handle these variations.

Interactive Perception approaches allow a robot to move objects and hence reveal previously hidden features. Thereby it can resolve some of the aforementioned challenges autonomously and may alleviate the need for enormous amounts of training data. Example approaches that perform object segmentation and categorization can be found in . The challenge of object recognition/categorization has been tackled by Sinapov et al. and Hausman et al. .

IV-E Multimodal Object Model Learning

Learning models of rigid, articulated and deformable objects is a central problem in the area of Computer Vision. In the majority of the cases, the model is learned or built from multiple images of the same object or category of objects. Once the model is learned, it can be used to find the object in new, previously unseen contexts.

A robot can generate the necessary data through interaction with the environment. For example, Krainin et al. present an approach where a robot autonomously builds an object model while holding the object in its hand. The object model is completed by executing actions informed by next best view planning. Kenney et al. push an object on the plane and accumulate visual data to build a model of the object.

There are also approaches that build an object model from haptic sensory data, e.g. by Dragiev et al. . Allen and Bajscy , Bohg et al. , Ilonen et al. , Björkman et al. show examples that initialize a model from visual data and then further augment it with tactile data. Sinapov et al. present a method where a robot grasps, lifts and shakes objects to build a multi-modal object model.

IV-F Object Pose Estimation

Interactive perception has also been applied to the problem of object pose estimation. Related approaches focus on reducing object pose uncertainty by either touching or moving it.

Koval et al. employ manifold particle filters for this purpose. Javdani et al. use information-theoretic criteria such as information gain to actively reduce the uncertainty of the object pose. In addition to reducing uncertainty, they also provide optimality guarantees for their policy.

IV-G Grasp Planning

Cluttered scenes and premature object interactions used to be considered as obstacles for grasp planning that had to be avoided by all means. In contrast, Interactive Perception approaches in this domain take advantage of the robot’s ability to move objects out of the way or to explore them to create more successful plans even in clutter or under partial information.

Hsiao et al. use proximity sensors to estimate the local surface orientation to select a good grasp. Dragiev et al. devise a grasp controller for objects of unknown shape which combines both exploration and exploitation actions. Object shape is represented by a Gaussian Process implicit surface. Exploration of the shape is performed using tactile sensors on the robot hand. Once the object model is sufficiently well known, the hand does not prematurely collide with the real object during grasping attempts.

IV-H Manipulation Skill Learning

In some cases the goal of Interactive Perception is to accomplish a particular manipulation skill. This manipulation skill is generally a combination of some of the pre-specified goals.

To learn a manipulation skill Kappler et al. , Pastor et al. represent the task as a sequence of demonstrated behaviors encoded in a manipulation graph. This graph provides a strong prior on how the actions should be sequenced to accomplish the task. Lee et al. uses a set of kinesthetic demonstrations to learn the right variable-impedance control strategy. Cusumano-Towner et al. propose a planning approach that uses a previously-learned Hidden Markov Model to fold clothes.

The approaches discussed above can be thought of as methods that capture the regularity of complex manipulation behaviors in $S\times A\times t$ by learning them via demonstration.

IV-I State Representation Learning

In the majority of the IP approaches, the representation of sensory data and latent variables are pre-specified based on prior knowledge about the system and task. There are however some approaches that learn these representations. Most notable are Jonschkowski and Brock , Levine et al. , Wahlström et al. . All of them learn some mapping from raw, high-dimensional sensory input (in this case images) to a lower-dimensional state representation. All of these example approaches fix the structure of this mapping, e.g. linear mapping with task-specific regularizers or Convolutional Neural Networks . The parameters of this mapping are learned from data.

V Taxonomy of Interactive Perception

In this section, we identify a number of important aspects that characterize existing IP approaches. These are additional to the two benefits of CNS and APR and independent of the specific application of an approach. We use these aspects to taxonomize and group approaches in the Tables 8 and 9. In the following, each table column is described in detail in a subsection along with example approaches. We use paper sets to refer to groups of similar approaches that address the same application, e.g. either object segmentation or manipulation skill learning. We split paper sets further into approaches that either exploit CNS or APR. We also list papers separately that do not pursue a unique goal, e.g. they perform both Object Segmentation and Recognition.

An IP approach leverages at least one of the two aforementioned benefits: (i) it exploits a novel sensory signal that is due to some time-varying, forceful interaction (CNS) or (ii) also leverages prior knowledge about the regularity in the combined space of sensory data and action parameters over time $S\times A\times t$ for predicting or interpreting this signal (APR).

Approaches that exploit the novel sensory signal (CNS) also rely on regularities in the sensory response to an interaction. In its most basic form, this regularity is usually linked to some assumed characteristic of the environment that thereby restricts the expected response of the world to an arbitrary action. Even more useful to robust perception and manipulation is to also include prior knowledge about the response to a specific interaction (APR).

Existing approaches towards IP cover a broad spectrum of how the possibilities afforded by this combined space $S\times A\times t$ are leveraged as visualized in Fig. 7. On the one end of the spectrum, there are approaches such as visual tracking or optical flow that use very weak priors to regularize the solution space while maintaining generality (e.g. brightness constancy or local motion). In the middle, there are approaches that heavily rely on the regularity in the sensory response to an arbitrary interaction (e.g. rigid body dynamics, motion restricted to a plane or smooth motion). At the very end of the spectrum, there are approaches that leverage both assumptions about environmental constraints and knowledge about the specific interaction, to robustly interpret the resulting sensory signal and enable perceptually-guided behavior. While using this kind of prior knowledge loses generality, it may result in more robust and efficient estimation in a robotics scenario due to a simplification of the solution space. If an approach leverages APR, then it also automatically leverages CNS.

V-A2 Example approaches

We start with approaches that exploit the informative sensory signal that is due to some forceful interaction (CNS). For instance, Fitzpatrick and Metta , Kenney et al. ease the task of visual segmentation and object model learning by making some general assumptions about the environment and thereby about the possible responses to an arbitrary interaction performed by the robot. Example assumptions are that only rigid objects are present in the scene and that motion is restricted to a plane. Although the interaction is carried out by a robot, the available proprioceptive information is not used in the interpretation of the signal. Katz et al. , Sturm et al. , Pillai et al. , Martín Martín and Brock aim at understanding the structure of articulated objects by observing their motion when they are interacted with. While objects are not restricted to be rigid or to only move in a plane, they are restricted to be piece-wise rigid and to move according to some limited set of articulation mechanisms. Approaches by Bergström et al. , Chang et al. , Gupta and Sukhatme , Hausman et al. , Kuzmic and Ude , Schiebener et al. devise different heuristics for selecting actions that generate informative sensory signals. These are used to ease perceptual tasks such as object segmentation or object model learning. Similar to the above, none of the potentially available knowledge about interaction parameters is used to predict their effect.

The aforementioned approaches use vision as the source for informative sensory signals. Chu et al. , Culbertson et al. demonstrate how either unconstrained interactions in a plane or fixed interaction primitives lead to novel haptic sensory signals to ease the learning of material properties.

Other approaches utilize the regularity in $S\times A\times t$ to a much larger extent for easing perception and/or manipulation (APR). For example, Atkeson et al. estimate the dynamics parameters of a robotic arm and the load at the end-effector. This requires a sufficient amount of arm motion, measurements of joint torques, angles, velocities and acceleration as well as knowledge of the arm kinematics. We can only learn the appropriate model that predicts arm motion from input motor torques if given this prior information on the structure of the space $S\times A\times t$ and data from interaction. Sinapov et al. ,, Sinapov and Stoytchev let a robot interact with a set of objects that are characterized by different attributes such as rigid or deformable, heavy or light, slippery or not. Features computed on the different sensor modalities serve as the basis to learn object similarity. The authors show that this task is eased when the learning process is conditioned on joint torques and the different interaction behaviors. They also use the knowledge of the interaction in to correlate various sensor modalities in the $S\times A\times t$ .

Zhang and Trinkle , Koval et al. track object pose using visual and tactile data while a robot is pushing this object on a plane. Zhang and Trinkle solve a non-linear complementarity problem within their dynamics model to predict object motion given the control input. At the same time, they use observations of the object during interaction for estimating parameters of this model such as the friction parameters. Koval et al. assume knowledge of a lower-dimensional manifold that describes the different contact states between a specific object and hand during a push motion. Hypotheses about future object poses are constrained to lie on this manifold. Hausman et al. , Hsiao et al. condition on the action to drive the estimation process. Hsiao et al. estimate the belief state by conditioning the observations on the expected action outcomes. Hausman et al. adopt a similar approach to estimate the distribution of possible articulation models based on action outcomes.

V-B What priors are employed?

To devise an IP system means to interpret and/or deliberately generate a signal in the $S\times A\times t$ . The regularity of this signal can be programmed into the system as a prior incorporating knowledge of the task; it can be learned from scratch or the system can pick up these regularities using a mixture of both priors and learning. Therefore an important component of any IP system is this regularity and how it is encoded and exploited for performing a perception and/or manipulation task.

Interactive Perception requires knowledge of how actions change the state of the environment. Encoding this kind of regularity can be done in a dynamics model i.e. the model for predicting the evolution of the environment after a certain action has been applied. Dependent on the number of objects in the environment, this prediction may be very costly to compute. Furthermore, due to uncertainty and noise in robot-object and object-object interactions, the effects of the interactions are stochastic.

There are many approaches that rely on priors which simplify the dynamics model and thereby make it less costly to predict the effect of an action. Examples of commonly used priors are the occurrence of only rigid objects (RO), of articulated objects (AO) with a discrete set of links or of only deformable objects (DO). Another prior includes the availability of a set of action primitives (AP) such as push, pull, grasp, etc. These action primitives are assumed to be accurately executed without failure. Many approaches assume that object motion is restricted to a plane (PM) or other simplifications of the scene dynamics (SD), e.g. quasi-static motion during multi-contact interaction between objects. In this section, each prior will be explained in more detail by using one or several example approaches that exploit them.

Of the highlighted priors some are more commonly used than others. For instance apart from papers in paper set (Object Segmentation II, Object Segmentation - Object Recognition II, Haptic Property Estimation II) almost all other approaches make assumptions about the nature of objects in the environment, i.e they assume that all objects present in the environment belong exclusively to one of three classes: rigid, articulated or deformable.

The majority of approaches in Interactive Perception assume that the objects are rigid (RO). Only approaches concerned with estimating an articulation model assume the existence of articulated objects. Similarly, Levine et al. , Cusumano-Towner et al. , Lee et al. in paper set Manipulation Skill Learning are unique in that they are the only ones that deal with the manipulation of deformable objects (DO).

Many approaches in the paper set Object Segmentation I utilize the planar motion prior (PM). In instances such as Gupta and Sukhatme , this prior is used for scene segmentation where all the objects in the scene are assumed to lie on a table plane. In other approaches e.g. in Object Segmentation I, in Multimodal Object Model Learning I and in Multimodal Object Model Learning II the planar motion assumption is used not only for scene segmentation, but also to track the movement of objects in the environment.

Then there are approaches which make additional simplifying assumptions about the dynamics of the system (SD). For instance Koval et al. assume that the object being manipulated has quasi-static dynamics and moves only on a plane (PM). Such an assumption becomes particularly useful in cases where action selection is performed via a multi-step planning procedure because it simplifies the forward prediction of object motion.

There are approaches that learn a dynamics model of the environment given an action. Some of these let the robot learn this autonomously through trial and error. Early approaches towards this are by Christiansen et al. , Metta and Fitzpatrick that learn a simple mapping from the current state and action to a most likely outcome. demonstrate this in a tray-tilting task for bringing the object lying on this tray into a desired configuration. demonstrate their approach in an object pushing behavior and learn the response of an object to a certain push direction. Both of them model the non-determinism of the response of the object to an action. More recent approaches are presented by Levine et al. , Han et al. , Wahlström et al. where the authors learn the mapping from current state to next best action in a policy search framework. Kappler et al. , Pastor et al. , Lee et al. bootstrap the search process through trial and error by demonstrating actions.

V-B2 Priors on the Observations

Regularities can also be encoded in the observation model that relates the state of the system to the raw sensory signals. Thereby it can predict the observation given the current state estimate. Only if this relationship is known, an IP robot can gain information from observations. This information may be about some quantity of interest that needs to be either estimated or directly provide the distance to some goal state.

Traditionally, the relationship between the state and raw sensory signals is hand-designed based on some expert knowledge. One example are models of multi-view or perspective geometry for camera sensors . Often, approaches also assume access to an object database (OD) that allows them to predict how the objects will be observed through a given sensor, e.g. by Chu et al. .

More recently, we see more approaches that learn a suitable, task-specific state representation directly from observations. Examples include Jonschkowski and Brock , Levine et al. , Wahlström et al. who each use raw pixel values as input and learn the lower-dimensional representation jointly with the policy that maps these learned states to actions. achieve this by introducing a set of hand-defined priors in a loss function that is minimal if the state representation best matches these priors. The mapping from raw pixels to the lower dimensional representation is linear. map the raw pixel values through a non-linear Convolutional Neural Network (CNN) to a set of feature locations in the image and initialize the weights for an object pose estimation task. Both the type of function approximator (CNN encoding receptive fields) and the data for initialization can be seen as a type of prior. use an autoencoder framework where the authors not only minimize the reconstruction error from the low-dimensional space back to the original space but also optimize the consistency in the latent, low-dimensional space.

In the case, where the mapping between state and observation is hand-designed, the state usually refers to some physical quantity. In the case where the state representation is learned, it is not so easily interpretable.

V-C Does the approach perform action selection?

Knowledge about the structure of $S\times A\times t$ can also be exploited to select appropriate actions. A good action will reveal as much information as possible and at the same time bring the system as close as possible to the manipulation goal. If we know something about the structure of $S\times A\times t$ , we can perform action selection so as to make the resulting sensor information as meaningful as possible. The agent must balance between exploration (performing an action to improve perception as much as possible) and exploitation (performing an action that maximizes progress towards the manipulation goal).

For optimal action selection, the IP agent needs to know a policy that given the current state estimate returns the optimal action or sequence of actions to take. Here, optimal means that the selected actions yield a maximum expected reward to the IP agent. The specific definition of the reward function heavily depends on the particular task of the robot. If it is a purely perceptual task, actions are often rewarded when they reduce the uncertainty about the current estimate (exploration), e.g. van Hoof et al. . If the task is a manipulation task, actions may be rewarded that bring you closer to a goal (exploitation), e.g. Levine et al. .

Finding this policy is one of the core problems for action selection. Its formalization depends on whether the state of the dynamical system is directly observable or whether it needs to be estimated from noisy observations. It also depends on whether the dynamics model is deterministic or stochastic.

V-C2 Dynamics Model

Knowing the dynamics model is even more important for action selection than for improving perception. It allows to predict the effect of an action on the quantity of interest and thereby the expected reward. A common way to find the optimal sequence of actions that maximizes reward under deterministic dynamics is forward or backward value iteration .

As mentioned earlier, a realistic dynamics model should be stochastic to account for uncertainty in sensing and execution. In this case, to find the optimal sequence of actions the agent has to form an expectation over all the possible future outcomes of an action. The dynamical system can then be modeled as an MDP. Finding the optimal sequence of actions can be achieved through approaches such as value or policy iteration .

In an MDP, we assume that the state of the system is directly observable. However in a realistic scenario, the robot can only observe its environment through noisy sensors. This can be modeled with a POMDP where the agent has to maintain a probability distribution over the possible states, i.e. the belief, based on an observation model. For most real-world problems, it is intractable to find the optimal policy of the corresponding POMDP. Therefore, there exist many methods that find approximate solutions to this problem .

PSRs are another formalism for action selection. Here, the system dynamics are represented directly by observable quantities in the form of a set of tests instead of over some latent state representation as in POMDPs .

V-C3 Planning Horizon

Action selection methods can be categorized based on the number of steps they look ahead in time. There are approaches that have a single step look ahead which are called myopic or greedy (M). Here the agent’s actions are optimized for rewards in the next time step given the current state of the system. Most approaches to interactive perception that exploit the knowledge of the outcome of an action in $S\times A\times t$ are myopic (M). Myopic approaches do not have to cope with the evolution of complex system dynamics or observation models beyond a single step. Hence this considerably reduces the size of the possible solution space. Examples of such approaches can be seen in paper sets Object Segmentation II , in Articulation Model Estimation II and in the paper set Pose Estimation.

Then, there are approaches which look multiple steps ahead in time to inform their action selection process. These multi-step look-ahead solutions decide an optimal course of action also based on the current state of the system. The time horizon for these multi-step look-aheads can either be fixed or variable. In either case, the time horizons are generally dictated by a budget, examples of which include computational resources, uncertainty about the current state, costs associated with the system, etc. For instance, a popular multi-step look ahead approach relies on the assumption that the maximum likelihood estimate (MLE) observation will be obtained in the future. This way, one can predict the behavior of the system within the time horizon and use it to select an action. Overall we label such approaches to action selection as planning horizon approaches (PH). Examples of these approaches include .

Another set of methods tries to find global policies that specify the action that should be applied at any point in time for any state. We categorize such approaches as methods that have global policies (GP) Among these, there are approaches that take into account all possible distributions over the state space (beliefs) and offer globally optimal policies. These policies account for stochastic belief system dynamics, i.e they maintain probabilities over the possible current states and probable outcomes given an action. Such methods are often solved by formulating them as POMDPs. In practice the solution to such problems are intractable and are often solved by approximate offline methods. Javdani et al. , Koval et al. demonstrate such an approach to action selection for interactive perception. Another way of finding global policies uses reinforcement learning which provides a methodology to improve a policy over time. An example of a specific policy search method is presented by Levine et al. , Han et al. , Levine et al. .

Apart from planning based approaches that perform action selection, there are approaches that focus on low-level control. In these approaches, the control input is computed online for the next cycle based on a global control law. We also classify these methods as global-policy (GP) approaches as they compute the next control input based on control law that is global, e.g. the feedback matrix in Linear Gaussian Controllers. The actions are generated at a high frequency and operate on low-level control commands. Examples of these approaches include .

V-C4 Granularity of Actions

Action selection can be performed at various granularities. For example, a method may either select the next best control input or an entire high-level action. The next best controls can be low-level motor torques that are sent to the robot in the next control cycle. The corresponding action selection loop is executed at a very high frequency and is dependent on the immediate feedback from different sensors .

High-level action primitives are generally used in approaches that do not require reasoning about fine motor control such as pushing or grasping actions that are represented by motion primitives. In such cases, reasoning about observations is purely dependent on the outcome of high-level actions. There are numerous approaches that utilize high-level actions for interactive perception. Examples include: Barragán et al. and the following authors in paper set Object Segmentation I: Fitzpatrick and Metta , Metta and Fitzpatrick , Kenney et al. and Bergström et al. , Sturm et al. , Pillai et al. , Martín Martín and Brock .

V-D What is the objective: Perception, Manipulation or Both?

Approaches to Interactive Perception may pursue a perception or a manipulation goal and in some cases both (see Fig. 4). Object segmentation, recognition and pose estimation, multi-modal object model learning and articulation model estimation are examples of areas where interactive perception is utilized to service perception.

Then there are interactive perception approaches whose primary objective is to achieve a manipulation goal (e.g. grasping or learning manipulation skills). For instance Kappler et al. , Pastor et al. , exploit regularities in $S\times A\times t$ to enable better action selection. The robot compares the observed perceptual signal with the expected perceptual signal given the current manipulation primitive. It then picks controls that drive the system towards the expected signal. Similarly, Koval et al. , Kaelbling and Lozano-Pérez , Platt et al. exploit the regularities in $S\times A\times t$ to facilitate task oriented grasping, i.e locate and grasp an object of interest.

The final thread of interactive perception approaches include a combination of both perception and manipulation. For instance, Dragiev et al. , Koval et al. simultaneously improve perception (object model reconstruction or pose estimation, respectively) and select better actions under uncertainty (efficient grasping). In Jain and Kemp , Karayiannidis et al. in paper set Articulation Model Estimation II, the knowledge about the regularity in both the observations and dynamics in $S\times A\times t$ is used to improve articulation model estimation as well as to enable better control. In the case of Karayiannidis et al. , the control input is directly incorporated into the state estimation procedure. In contrast, Jain and Kemp use the position of the end effector in the articulation mechanism estimation. The manipulation goal in both these approaches is to enable a robot to open doors and drawers.

V-E Are multiple sensor modalities exploited?

Some approaches exploit multiple modalities in the $S\times A\times t$ space, whereas other approaches restrict themselves to a single informative modality. The various sensing modalities can be broadly categorized into contact and non-contact sensing. Examples of non-contact sensing include vision, proximity sensors, sonar, etc. Contact sensing is primarily realized via tactile sensors and force-torque sensors. Approaches that only use tactile sensing include the works of Chu et al. , Koval et al. , Javdani et al. . There are also approaches that use both contact and non-contact sensing to inform the signal in the $S\times A\times t$ space. These include some of the works listed in paper sets Articulation Model Estimation II, Pose Estimation - Object Dynamics Learning II, Multimodal Object Model Learning I & II and Manipulation Skill Learning in Tables 8 and 9.

V-F How is uncertainty modeled and used?

In Interactive Perception tasks, there are many sources of uncertainty about the quantity of interest. One of them is the noisy sensors through which an agent can only partially observe the current state of the world. Another is the dynamics of the environment in response to an interaction. Some approaches towards Interactive Perception model this uncertainty in either their observations and/or the dynamics model of the system. Depending on their choice, there are a wide variety of options for estimating the quantity of interest from a signal in $S\times A\times t$ . For updating the current estimate, some approaches use recursive state estimation and maintain a full posterior distribution over the variable of interest, e.g. . Others frame their problem in terms of energy minimization in a graphical model and only maintain the maximum a posteriori (MAP) solution from frame to frame, e.g. . An MLE of the variable of interest is computed in approaches that do not maintain a distribution over possible states. Examples are clustering methods that assign fixed labels to the variable of interest. More recently non-parametric approaches have also been utilized. For instance Boularias et al. use kernel density estimation.

Methods that model uncertainty of the variable(s) of interest can cope better with noisy observations or dynamics, but they become slower to compute as the size of the solution space grows. This creates a natural trade-off between modeling uncertainty and computational speed. The above choices also have implications for action selection. If we maintain a full distribution over the quantity of interest, then computing a policy that takes the stochasticity in the dynamics and observation models into account is generally intractable . If an approach assumes a known state, the dynamical system can also be modeled by an MDP with stochastic dynamics given an action. The least computationally demanding model for action selection is the one that neglects any noise in the observations or dynamics. However, it might also be the least robust depending on the true variance in the real dynamical system that the agent tries to control.

Based on the above, we propose four labels for IP approaches with respect to their way of modeling and incorporating uncertainty in estimation and manipulation tasks. Approaches that assume deterministic dynamics are labeled (DDM), stochastic dynamics (SDM), deterministic observations (DOM), stochastic observations (SOM) and approaches that estimate uncertainty are labeled (EU).

Fitzpatrick and Metta , Metta and Fitzpatrick , Kenney et al. propose example approaches that assume no stochasticity in the system, and model both the dynamics and observations deterministically. Then there are approaches that assume deterministic observations but do not model the dynamics at all. These are listed in paper set Object Segmentation I which include the works of Chang et al. , Gupta and Sukhatme , Hausman et al. , Kuzmic and Ude , Schiebener et al. . Then there are approaches that model only stochastic observations but no dynamics because they assume that the environment is static upon interaction, e.g. Hsiao et al. . Most approaches that assume both stochastic dynamics and observations have some form of uncertainty estimation technique implemented to account for the stochasticity in the system. An approach that assumes stochasticity in its observations but does not estimate uncertainty is Chu et al. . Here the authors train a max-margin classifier to assign labels to stochastic observations.

VI Discussion and Open Questions

If Interactive Perception is about merging perception and manipulation into a single activity then the natural question arises of how to balance these components. When have manipulation actions (that are in service of perception) elicited sufficient information about the world such that manipulation actions can succeed that are in service of a manipulation goal? This question bears significant similarities with the exploration/exploitation trade-off encountered in reinforcement learning. One can further ask: how can manipulation actions be found that combine these two objectives—achieving a goal and obtaining information—in such a way that desirable criteria about the resulting sequence of actions (time, effort, risk, etc.) are optimized?

When performing manipulation tasks, humans aptly combine different sources of information, including prior knowledge about the world and the task, visual information, haptic feedback, and acoustic signals. Research in Interactive Perception is currently mostly concerned with visual information. New algorithms are necessary to extend IP towards a multi-modal framework, where modalities are selected and balanced so as to maximally inform manipulation with the least amount of effort, while achieving a desired degree of certainty. Furthermore, for every sensory channel, one might differentiate between passively (e.g. just look), actively (e.g. change vantage point to look), and interactively (e.g. observe interaction with the world) acquired information. Each of these is associated with a different cost but also with a different expected information gain. In addition to adequately mining information from multiple modalities, Interactive Perception must be able to decide in which of these different ways the modality should be leveraged.

Also at the lower levels of perception significant changes might be required. It is conceivable that existing representations of sensory data are not ideal for Interactive Perception. Given the focus on dynamic scenes with multiple moving objects, occlusions, lighting changes, and new objects appearing and old ones disappearing — does it make sense to tailor visual features and corresponding tracking methods to the requirements of Interactive Perception? Are there fundamental processing steps, similar to edge or corner detection, that are highly relevant in the context of Interactive Perception but have not seen a significant need in other applications of computer vision? The same for haptic or acoustic feedback: when combined with other modalities in the context of Interactive Perception, what might be the right features or representations we should focus on?

VI-B A Framework for Interactive Perception?

All of the aforementioned arguments indicate that Interactive Perception might require a departure from existing perception frameworks, as they can be found in applications outside of robotics, such as surveillance, image retrieval, etc. In Interactive Perception, manipulation is an integral component of perception. The perceptual process must continuously trade off multiple sensor modalities that might each be passive, active, or interactive. There is no stand-alone perceptual process and not only a single aspect of the environment that must be extracted from the sensor stream as the optimization objectives may change when the robot faces different tasks over its lifetime.

After the review of existing work in the field, we conclude that there is yet no framework that can address all the challenges in Interactive Perception. There are however candidates that represent the regularity in $S\times A\times t$ in a way that caters to a particular challenge encountered in IP. For instance, Krüger et al. present a concept that allows to symbolically represent continuous sensory-motor experience: Object-Action Complexes (OACs). The concept’s current instantiations through the examples in are focused on learning and detecting affordances which describe the relationship between a certain situation (often including an object) and the action that it allows.

Other popular formalisms lend themselves particularly well to the problem of optimal action selection (see Section V-C). Examples include MDPs, POMDPS, PSRs or Multi-armed bandits. They rely on different assumptions (e.g. Markov Assumptions, observable state) and make different algorithmic choices (e.g. probabilistic modeling). Approaches that rely on these decision-making frameworks often assume the availability of transition, observation and reward functions and the possibility to analytically compute the optimal action.

For complex real-world problems this is often not the case and information about the world can only be collected through interaction. The data collected in this way is then used to update the relevant models. The problem of selecting the next best action may be based on submodularity , the variance in a Gaussian Process or the Bhattacharyya coefficient between two normal distributions .

Reinforcement learning is also a common choice to learn a policy for action selection under these complex conditions. Many approaches assume the availability of some reliable state estimator (e.g. by using motion capture or marker-based systems) where the state is of relatively low dimension and hand-designed. Particularly relevant to Interactive Perception are recent approaches that directly learn a state representation from data and employ reinforcement learning on this learned state representation .

All these formalisms have been used to solve particular subproblems encountered in the context of Interactive Perception. We do not claim that this list is complete. However, the wealth of very different approaches suggests that there is currently not one framework for IP that can address all the relevant challenges. It is an open question what such a framework would be and how it could enable coordinated progress by developing adequate subcomponents.

VI-C New Application Areas

The majority of the work that is included in this survey is concerned with Interactive Perception for manipulating and grasping objects in the environment. In the context of the recent Darpa Robotics Challenge (DRC) we have also seen a need to bridge the gap between perception and action in whole-body, multi-contact motion planning and control. The ability to physically explore unstructured environments (such as those encountered in disaster sites) are of utmost importance for the safety and robustness of a robot. Probing and poking not only with your hands but also your legs can also help extract more information. Currently, these robots extensively rely on teleoperation and carefully designed user interfaces . We argue that they can achieve a much higher degree of autonomy if they rely on Interactive Perception.

VII Summary

This survey paper provides an overview on the current state of the art in Interactive Perception research. In addition to presenting the benefits of IP, we discuss various criteria for categorizing existing work. We also include a set of problems such as object segmentation, manipulation skills and object dynamics learning that are commonly eased using concepts of interactive perception.

We identify and define the two main aspects of Interactive Perception. (i) Any type of forceful interaction with the environment creates a new type of informative sensory signal that would otherwise not be present and (ii) any prior knowledge about the nature of the interaction supports the interpretation of the signal in the Cartesian product space of $S\times A\times t$ . We use these two crucial aspects of IP as criteria to include a paper as related or not. Furthermore, we compare IP to existing perception approaches and named a few formalisms that allow to capture an IP problem.

We hope that this taxonomy helps to establish benchmarks for comparing various approaches and to identify open problems.

Acknowledgment

The authors would like to thank the anonymous reviewers for their insightful comments and all the cited authors who provided feedback upon our request. They would also like to sincerely thank Aleksandra Waltos for providing the visuals in Figures 5 and 6.

This research is supported in part by National Science Foundation grants IIS-1205249, IIS- 1017134, EECS-0926052, the Office of Naval Research, the Okawa Foundation, and the Max-Planck-Society. It is also supported by grant BR 2248/3-1 by the German Science Foundation (DFG), and grant H2020-ICT-645599 on Soft Manipulation (Soma) by the European Commission. The authors would also like to thank Swedish Research Council and Swedish Foundation for Strategic Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding organizations.