Deep Spatial Autoencoders for Visuomotor Learning

Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, Pieter Abbeel

I Introduction

One of the fundamental challenges in applying reinforcement learning to robotic manipulation tasks is the need to define a suitable state space representation. Typically, this is handled manually, by enumerating the objects in the scene, designing perception algorithms to detect them, and feeding high-level information about each object to the algorithm. However, this highly manual process makes it difficult to apply the same reinforcement learning algorithm to a wide range of manipulation tasks in complex, unstructured environments. What if a robot could automatically identify the visual features that might be relevant for a particular task, and then learn a controller that accounts for those features? This would amount to automatically acquiring a vision system that is suitable for the current task, and would allow a range of object interaction skills to be learned with minimal human supervision. The robot would only need to see a glimpse of what task completion looks like, and could then figure out on its own how to change the scene into that configuration.

Directly learning a state space representation from raw sensory signals, such as images from a camera, is an active area of research , and while considerable progress has been made in recent years , applications to real robotic systems remain exceedingly difficult. The difficulties stem from two challenges. First, learning a good representation with unsupervised learning methods, such as deep learning, often requires a very large amount of data . Second, arbitrary learned visual representations do not always lend themselves well to control. We address these challenges by using a spatial autoencoder architecture to learn a state representation that consists of feature points. Intuitively, these feature points encode the configurations of objects in the scene, and the spatial autoencoder that we describe provides for data-efficient learning by minimizing the number of non-convolutional parameters in the encoder network. Since our learned feature points correspond to object coordinates, this architecture also addresses the second challenge, since real-valued quantities such as positions are more amenable to control than the types of discrete or sparse features that are more commonly learned with deep learning methods .

In fact, our experiments show that our learned feature point representation can be used effectively in combination with an efficient trajectory-centric reinforcement learning algorithm. This method iteratively fits time-varying linear models to the states visited at the previous iteration , estimating how the robot’s actions (which correspond to joint torques) affect the state. When the state includes visual feature points, this method can learn how the robot’s actions affect the objects in the world, and the trained controllers can perform closed-loop control on the configuration of these objects, allowing the robot to move and manipulate them. Furthermore, this reinforcement learning algorithm can easily be combined with guided policy search to learn complex, nonlinear policies that generalize effectively to various initial states , which we demonstrate by training a nonlinear neural network policy on top of our learned representation.

Our main contribution is a method for learning vision-based manipulation skills by combining unsupervised learning using deep spatial autoencoders with simple, sample-efficient trajectory-centric reinforcement learning. We demonstrate that this approach can be used to learn a variety of manipulation skills that require “hand-eye” coordination, including pushing a free-standing toy block, scooping objects into a bowl, using a spatula to lift a bag of rice from the table (shown in Figure 1), and hanging a loop of rope on hooks at various positions. Each of these tasks are learned using the same algorithm, with no prior knowledge about the objects in the scene, and using only the onboard sensors of the PR2 robot, which consist of joint encoders and an RGB camera. We also compare our approach to a number of baselines that are representative of previously proposed visual state space learning methods.

II Related Work

Reinforcement learning and policy search methods have been applied to perform a variety of robotic tasks, such as table tennis , object manipulation , and locomotion . One key challenge to using RL successfully is handling high-dimensional sensory observations needed to perform sensorimotor control. A standard approach is to hand-design low-dimensional features from the observations; however, this requires manual task-specific engineering, which undermines the goal of using an automatic learning-based approach. Other approaches instead use function approximation to replace the value function in standard RL algorithms like TD-learning and approximate Q-learning . These approaches typically require large amounts of time and data to learn the parameter space. In this work, we use the approach of first learning a low-dimensional state representation from observations via unsupervised learning, and then use a data-efficient RL algorithm with the learned state representation to learn robotic manipulation tasks involving perception.

Prior work has proposed to learn representations using general priors from the physical world or the ability to predict future observations . Representation learning has been applied to control from high-dimensional visual input in several recent works. Lange et al. use autoencoders to acquire a state space for Q-learning and evaluate on simple real-world problems with 2 to 3 dimensions. Several methods also train autoencoders that can predict the next image, effectively learning dynamics , though such methods have only been demonstrated on toy problems with synthetic images. The quantity of data required to apply deep learning methods to real-world images is known to far exceed that of toy tasks , and prior methods generally avoid this issue by using synthetic images. We also use an autoencoder, but we use a spatial architecture that allows us to acquire a representation from real-world images that is particularly well suited for high-dimensional continuous control. Our architecture is also data-efficient and can handle relatively small datasets. We demonstrate our method on a range of real-world robotic tasks.

Other methods have been proposed to train policies from high-dimensional observations, without first acquiring a low-dimensional state space . However, running these methods on a real robotic platform requires either impractical amounts of data or an instrumented training setup that provides knowledge about relevant objects at training time . Our method does not require this knowledge, which enables it to learn tasks such as pushing, tossing, and scooping without using any additional instrumentation. In fact, our method has no knowledge of which objects make up the scene at all. These tasks cannot be learned using our prior method without mechanisms such as motion capture.

The goal of our method is similar to that of visual servoing, which performs feedback control on image features . However, our controllers and visual features are learned entirely from real-world data, rather than being hand-specified. This gives our method considerable flexibility in how the visual input can be used and, as a result, the tasks that it can learn to perform. Furthermore, our approach does not require any sort of camera calibration, in contrast to many visual servoing methods (though not all – see e.g. ).

Most deep learning architectures focus on semantic representation learning, e.g. . Some spatial network architectures have been proposed for tasks such as video prediction and dynamics modeling . The architecture that we use for the encoder in our unsupervised learning method is based on a spatial softmax followed by an expectation operation, which produces a set of spatial features that correspond to points of maximal activation in the last convolutional layer. We previously proposed this type of architecture in the context of policy search for visuomotor policies . However, in our prior work, this type of network was trained with supervised learning, regressing directly to joint torques. Obtaining these joint torque targets requires being able to solve the task, which requires a representation that is already sufficient for the task. In this paper, we show how the representation can be learned from scratch using unsupervised learning, which extends the applicability of this architecture to a much wider range of tasks where a suitable representation is not known in advance.

III Preliminaries

In this paper, we are primarily concerned with the task of learning a state representation for reinforcement learning (RL) from camera images of robotic manipulation skills. However, we consider this problem in the context of a specific RL algorithm that is particularly well suited to operate on our learned state representations. We describe the RL method we use in this background section, while the Section V describes our representation learning method. The derivation in this section follows prior work , but is repeated here for completeness.

The controller can be optimized using a variety of model-based and model-free methods . We employ a trajectory-centric algorithm that optimizes time-varying linear-Gaussian controllers, which can be thought of as a trajectory with time-varying linear stabilization . While linear-Gaussian controllers are simple, they admit a very efficient optimization procedure that works well even under unknown dynamics. This method is summarized in Algorithm 1. At each iteration, we run the current controller p(utxt)p(\mathbf{u}_{t}|\mathbf{x}_{t}) on the robot to gather NN samples (N=5N=5 in all of our experiments), then use these samples to fit time-varying linear-Gaussian dynamics of the form p(xt+1xt,ut)=N(fxtxt+futut+fct,Ft)p(\mathbf{x}_{t+1}|\mathbf{x}_{t},\mathbf{u}_{t})=\mathcal{N}(f_{\mathbf{x}t}\mathbf{x}_{t}+f_{\mathbf{u}t}\mathbf{u}_{t}+f_{ct},\mathbf{F}_{t}). This is done by using linear regression with a Gaussian mixture model prior, which makes it feasible to fit the dynamics even when the number of samples is much lower than the dimensionality of the system . We also compute a second order expansion of the cost function around each of the samples, and average the expansions together to obtain a local approximate cost function of the form

When the cost function is quadratic and the dynamics are linear-Gaussian, the optimal time-varying linear-Gaussian controller of the form p(utxt)=N(Ktxt+kt,Ct)p(\mathbf{u}_{t}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{K}_{t}\mathbf{x}_{t}+\mathbf{k}_{t},\mathbf{C}_{t}) can be obtained by using the LQR method. This type of iterative approach can be thought of as a variant of iterative LQR , where the dynamics are fitted to data. In our implementation, we initialize p(utxt)p(\mathbf{u}_{t}|\mathbf{x}_{t}) to be a random controller that is centered around the initial state with low-gain PD feedback, in order to avoid unsafe configurations on the real robot.

One crucial ingredient for making this approach work well is to limit the change in the controller p(utxt)p(\mathbf{u}_{t}|\mathbf{x}_{t}) at each iteration, since the standard iterative LQR approach can produce a controller that is arbitrarily far away from the previous one, and visits parts of the state space where the fitted dynamics are no longer valid. To that end, our method solves the following optimization problem at each iteration:

where pˉ(τ)\bar{p}(\tau) is the trajectory distribution induced by the previous controller. Using KL-divergence constraints for controller optimization was proposed in a number of prior works . In the case of linear-Gaussian controllers, we can use a modified LQR algorithm to solve this problem. We refer the reader to previous work for details .

III-B Learning Nonlinear Policies with Guided Policy Search

Linear-Gaussian controllers are easy to train, but they cannot express all possible control strategies, since they essentially encode a trajectory-following controller. To extend the applicability of linear-Gaussian controllers to more complex tasks, prior work has proposed to combine it with guided policy search , which is an algorithm that trains more complex policies, such as deep neural networks, by using supervised learning. The supervision is generated by running a simpler algorithm, such as the one in the previous section, from multiple initial states, generating multiple solutions. In our experimental evaluation, we combine linear-Gaussian controllers with guided policy search to learn a policy for hanging a loop of rope on a hook at different positions. A different linear-Gaussian controller is trained for each position, and then a single neural network policy is trained to unify these controllers into a single policy that can generalize to new hook positions. A full description of guided policy search is outside the scope of this paper, and we refer the reader to previous work for details .

IV Algorithm Overview

Our algorithm consists of three stages. In the first stage, we optimize an initial controller for the task without using vision, using the method described in the previous section. The state space during this stage corresponds to the joint angles and end-effector positions of the robot, as well as their time derivatives. This controller is rarely able to succeed at tasks that require keeping track of objects in the world, but it serves to produce a more focused exploration strategy that can be used to collect visual data, since an entirely random controller is unlikely to explore interesting parts of the state space. In general, this stage can be replaced by any reasonable exploration strategy.

In the second stage, the initial controller is used to collect a dataset of images, and these images are then used to train our deep spatial autoencoder with unsupervised learning, as described in Section V. Once this autoencoder is trained, the encoder portion of the network can be used to produce a vector of feature points for each image that concisely describes the configuration of objects in the scene. The final state space is formed by concatenating the joint angles, end-effector positions, and feature points, and also including their velocities to allow for dynamic tasks to be learned. We must also define a cost function using this new state space, which we do by presenting an image of the target state to the feature encoder and determining the corresponding state.

In the third stage, a vision-based controller is trained using the new state space that now contains visual features from the encoder, with the new cost function defined in terms of the visual features and the robot’s configuration. This controller is trained using the same trajectory-centric reinforcement learning algorithm as in the first stage, and is able to perform tasks that require controlling objects in the world that are identified and localized using vision.

An overview of this procedure is provided in Figure 2. In the next section, we describe our representation learning method.

V Unsupervised State Representation Learning from Visual Perception

Autoencoders acquire features by learning to map their input to itself, with some mechanism to prevent trivial solutions, such as the identity function. These mechanisms might include sparsity or a low-dimensional bottleneck. Our autoencoder architecture, shown in Figure 3, maps from full-resolution RGB images to a down-sampled, grayscale version of the input image, and we force all information in the image to pass through a bottleneck of spatial features, which we describe below. Since low-dimensional, dense vectors are generally well-suited for control, a low-dimensional bottleneck is a natural choice for our learned feature representation. A critical distinction in our approach is to modify this bottleneck so that it is forced to encode spatial feature locations rather than feature values – the “where” rather than the “what.” This architecture makes sense for a robot state representation, as it forces the network to learn object positions; we show in Section VII, that it outperforms more standard architectures that focus on the “what,” both for image reconstruction, and for robotic control. The final state space for RL is then formed by concatenating this learned encoding, as well as its time derivatives (the feature “velocities”), with the robot’s configuration.

The first part of the encoder architecture is a standard 3-layer convolutional neural network with rectified linear units of the form acij=max(0,zcij)a_{cij}=max(0,z_{cij}) for each channel cc and pixel (i,j)(i,j). We compute the spatial features from the last convolutional layer by performing a “spatial soft arg-max” operation that determines the image-space point of maximal activation in each channel of conv3. This set of maximal activation points forms our spatial feature representation and forces the autoencoder to focus on object positions. The spatial soft arg-max consists of two operations. The response maps of the third convolutional layer (conv3) are first passed through a spatial softmax scij=eacij/α/ijeacij/αs_{cij}={e^{a_{cij}/\alpha}}/{\sum_{i^{\prime}j^{\prime}}{e^{a_{ci^{\prime}j^{\prime}}/\alpha}}}, where the temperature α\alpha is a learned parameter. Then, the expected 2D position of each softmax probability distribution sc\mathbf{s}_{c} is computed according to fc=(iiscij,jjscij)\mathbf{f}_{c}=(\sum_{i}i*s_{cij},\sum_{j}j*s_{cij}), which forms the bottleneck of the autoencoder. This pair of operations typically outputs the image-space positions of the points of maximal activation for each channel of conv3. Finally, the decoder hdech_{\text{dec}} is simply a single linear (fully connected) mapping from the feature points f\mathbf{f} to the down-sampled image. We found that this simple decoder was sufficient to acquire a suitable feature representation.

With this architecture, the bottleneck representation f=henc(I)\mathbf{f}=h_{\text{enc}}(I), which we refer to as learned “feature points,” encodes the positions of the learned features in the image. The autoencoder is forced to compress all of the image information through this spatial feature point representation, which will therefore be capable of directly localizing objects in the image. This means that the state representation used for control will capture the spatial organization of objects in the scene. Example learned feature points are visualized in Figure 4. Note that these feature points pick out task-relevant objects (those which move during the collection of the image data.), and learn to ignore background features. One drawback of this representation, however, is that there is always exactly one output point per feature channel, which does not gracefully handle occlusions (zero features present) or repeated structure and objects (more than one feature).

A feature representation that can reconstruct the image clearly contains the state information captured by the camera. However, the learning algorithm must also be able to predict the dynamics of the state representation, which is not necessarily inherently possible with unconstrained learned features . Thus, we add a penalty to the autoencoder objective, gslow(ft)=(ft+1ft)(ftft1)22g_{\operatorname{slow}}(\mathbf{f}_{t})=||(\mathbf{f}_{t+1}-\mathbf{f}_{t})-(\mathbf{f}_{t}-\mathbf{f}_{t-1})||_{2}^{2}, to encourage the feature points to slowly change velocity. As a result, the overall autoencoder objective becomes:

where IdownsampI_{\operatorname{downsamp}} is a downsampled, grayscale version of the input image II, and fk,t=henc(Ik,t)\mathbf{f}_{k,t}=h_{\text{enc}}(I_{k,t}), the feature point encoding of the ttth image in the kkth sequence.

We optimize the auto encoder using stochastic gradient descent (SGD), and with batch normalization following each of the convolutional operations. The filters of the first convolutional layer are initialized with a network trained on ImageNet .

V-B Filtering and Feature Pruning Using Feature Presence

Not all of the learned feature points will encode useful information that can be adequately modeled with a time-varying linear dynamical system by our RL algorithm. Lighting and camera exposure can cause sudden changes in the image that do not reflect motion of relevant objects. However, the autoencoder will assign some subset of its features to model these phenomena to improve its reconstruction, producing noisy feature points that make it difficult to train the controller. Additionally, even useful features can periodically become occluded, resulting in low activations in the last convolutional layer and unpredictable peak locations.

To handle these issues, we introduce the concept of feature presence. The presence of a feature is determined by the softmax probability value scij\mathbf{s}_{cij} at the pixel location of the feature point fc=(i,j)\mathbf{f}_{c}=(i,j).Since the feature point position may fall on a boundary between pixels, we actually sum the probability over a 3×33\times 3 window of nearby pixels. This measure reflects the probability mass after the softmax that is actually located at the position of the maximum activation. This measure is usually very high. As shown in Figure 4, learned feature points usually place almost all of their probability mass at fc\mathbf{f}_{c}, the location of the maximum. We therefore use a threshold on scij\mathbf{s}_{cij} that marks the feature as present if scijβ\mathbf{s}_{cij}\geq\beta, with β=0.95\beta=0.95. The positions of the feature points during a trial are filtered by a Kalman filter, which receives observations of fc\mathbf{f}_{c} only when scijβ\mathbf{s}_{cij}\geq\beta, producing smooth predictions when the feature is not present. We use a second order Kalman filter and fit the parameters using expectation maximization.

In addition to filtering features that might become occluded, we also automatically determine the features that do not adequately represent task-relevant objects in the scene and prune them from the state. As discussed in Section IV, the goal positions for the feature points are defined by showing an image of the goal state to the robot. To mitigate the effects of noise, we record a short sequence of fifty images over the course of two seconds and average the resulting feature positions. We also measure the feature presence indicators during this period, and prune those points that are not present for the entire sequence, since they are unlikely to be relevant. In our evaluation, we show that this pruning step is important for removing extraneous and noisy features from the state. To validate this heuristic, we conducted a more thorough analysis of the predictability of the feature points to determine its effectiveness

To analyze the usefulness of the learned feature points for modeling the dynamics of the scene, we define the following measure of feature predictiveness:

VI Control Over Visual Features

VII Experimental Evaluation

We evaluated our method on a range of robotic manipulation tasks, ranging from pushing a lego block to scooping a bag of rice into a bowl. The aim of these experiments was to determine whether our method could learn behaviors that required tracking objects in the world that could only be perceived with vision. To that end, we compared controllers learned by our approach to controllers that did not use vision, instead optimizing for a goal end-effector position. We also compared representations learned with our spatial autoencoder to hidden state representations learned by alternative architectures, including ones proposed in previous work.

The four experimental tasks in our evaluation are shown in Figure 5. The first and simplest task requires sliding a lego block 30 cm across a table. The robot does not grasp the lego block, but rather pushes it along the table with its gripper, which requires coordinating the motion of the block and balancing it at the end of the fingers. In the second task, the robot is holding a spoon that contains a white bag, and it must use the spoon to drop the bag into a bowl. The robot must balance the bag on the spoon without dropping it, and must angle the spoon such that the bag falls into the bowl. In the third task, the robot must use a spatula to scoop a bag of rice off of the table and into a bowl. In this task, the robot must perform a quick, precise sliding motion under the bag and then lift and angle the spatula such that the rice bag slides or flips into the bowl. This task is illustrated in detail in Figure 1. If the robot does not coordinate the motion of the spatula and rice bag, the bag is pushed aside and cannot be lifted properly.

In the fourth task, we combine our method with guided policy search by training four separate linear-Gaussian controllers for hanging a loop of rope on a hook at different positions. These four controllers are combined into a single neural network policy using guided policy search, and we test the capability of this policy to then hang the rope on the hook in new positions. The robot must use vision to locate the hook, so making use of the spatial feature points is essential for performing this task.

VII-B Results and Analysis

For the generalization experiment with the rope loop on hook task, we trained linear-Gaussian controllers for hook positions that were spaced 8 cm apart, and trained a single neural network using guided policy search that unified these controllers into a single nonlinear policy. This nonlinear policy could succeed at each of the training positions, and was also successful at the three test positions not seen during training, as shown in Table III. The policy trained without vision was unable to localize the hook, and resorted to a random strategy, which was occasionally successful but generally failed at the task. The supplementary video shows the behavior of both the “blind” policy and the policy that used our learned visual state representation.

We also evaluate two alternative autoencoder architectures that are representative of prior work, trained with the same training and validation datasets as our model. The first comparison closely mirrors the method of Lange et al. , but with a bottleneck dimensionality of 1010 to account for the greater complexity of our system. The second comparison reflects a more recent architectures, using max-pooling to reduce the dimensionality of the image maps and batch normalization after the convolutional layers. The bottleneck for this architecture is 3232, matching that of our architecture. Details of both baseline architectures are provided in Appendix -B of the supplementary materials2. We evaluated both architectures with and without a smoothness penalty. The results, shown in Table IV, show that these methods struggle with our high-dimensional, real-world tasks, despite the larger model achieving a lower reconstruction loss than our autoencoder. For several of the models, we could not obtain a stable controller, while the others did not effectively integrate visual input into the control strategy. Table IV also shows the performance of our method without the smoothness penalty, and without feature pruning, showing that both of these components are critical for obtaining good results.

Another advantage of our approach is its sample efficiency, which is enabled both by the use of simple linear-Gaussian controllers and a data-efficient neural network architecture. The autoencoders used around 50 trials for each task, with each trial consisting of 100 image frames and 5 second of interaction time, for a total of 5000 frames per task. Training the final vision-based controller required another 50-75 trials, which means that each controller was trained with around 10-15 minutes of total robot interaction time.

VIII Discussion and Future Work

We presented a method for learning state representations using deep spatial autoencoders, and we showed how this method could be used to learn a range of manipulation skills that require close coordination between perception and action. Our method uses a spatial feature representation of the environment, which is learned as a bottleneck layer in an autoencoder. This allows us to learn a compact state from high-dimensional real-world images. Furthermore, since this representation corresponds to image-space coordinates of objects in the scene, it is particularly well suited for continuous control. The trajectory-centric RL algorithm we employ can learn a variety of manipulation skills with these spatial representations using only tens of trials on the real robot.

While we found that controllers trained without vision could adequately explore the space for each task to generate training data for representation learning, there are tasks where visiting a sufficient range of states can be difficult without vision. A promising direction for tackling such tasks is to interleave representation learning with controller optimization in a similar spirit to iterative model learning approaches . Another promising future direction is to use additional sensory modalities to learn more advanced sensory state representations, e.g. depth and haptic sensing.

The prospect of autonomously learning compact and portable state representations of complex environments entirely from raw sensory inputs, such as images, has widespread implications for autonomous robots. The method presented in this paper takes one step in this direction, by demonstrating that a spatial feature architecture can effectively learn suitable representations for a range of manipulation tasks, and these representations can be used by a trajectory-centric reinforcement learning algorithm to learn those tasks. Further research in this direction has the potential to make it feasible to apply out-of-the-box learning-based methods to acquire complex skills for tasks where a state space is very difficult to design by hand, such as tasks with deformable objects, complex navigation in unknown environments, and numerous other tasks.

Acknowledgements: This research was funded in part by ONR through a Young Investigator Program award, by the Army Research Office through the MAST program, and by the Berkeley Vision and Learning Center (BVLC). Chelsea Finn was also supported in part by a SanDisk fellowship.

References

-A Robotic Experiment Details

All of the robotic experiments were conducted on a PR2 robot. The robot was controlled at 20 Hz via direct effort control,The PR2 robot does not provide for closed loop torque control, but instead supports an effort control interface that directly sets feedforward motor voltages. In practice, these voltages are roughly proportional to feedforward torques, but are also affected by friction and damping. and camera images were recorded using the RGB camera on a PrimeSense Carmine sensor. The images were downsamples to 240×240240\times 240. The learned policies controlled one 7 DoF arm of the robot. The camera was kept fixed in each experiment. Each episode was 55 seconds in length. For each task, the cost function required reaching the goal state, defined both by visual features and gripper pose. Similar to previous work, the cost was given by the following equation:

The lego block task required the robot to push a lego block 30 cm30\text{ cm} to the left of its initial position. For this task, we measured and reported the distance between the top corner of the goal block position to the nearest corner of the lego block at the end of the trial. In some of the baseline evaluations, the lego block was flipped over, and the nearest corner was still used to measure distance to the goal.

Bag transfer

The bag transfer task required the robot to place a white bag into a bowl, using a spoon. At the start of each trial, the robot was grasping the spoon with the bag in the spoon. A trial was considered successful if the bag was inside the bowl and did not extend outside of the bowl. In practice, the bag was very clearly entirely in the bowl, or entirely outside of the bowl during all evaluations.

Rice scoop

The rice scooping task required the robot to use a spatula to lift a small bag of rice off of a table and place it in a bowl. At the start of each trial, the spatula was in the grasp of the robot gripper, and the bag of rice was on the table, about 3 cm3\text{ cm} from the bowl. As with the bag transfer task, a trial was considered successful if the bag of rice was inside the bowl and did not extend outside of the bowl. In practice, the bag was very clearly in the bowl, or outside of the bowl during all evaluations.

Loop hook

The loop hook task required the robot to place a loop of rope onto a metal hook attached to a scale, for different positions of the hook. At training time, the scale was placed at four different starting positions along a metal pole that were equally spaced across 24 cm24\text{ cm} of the pole. The test positions were the three midpoints between the four training positions. A trial was considered successful if, upon releasing the rope, the loop of rope would hang from the hook. In practice, the failed trials using our approach were often off by only 11-2 mm2\text{ mm}, whereas the controller with no vision was typically off by several centimeters.

-B Neural Network Architectures for Prior Work Methods

We compare our network with two common neural network architectures. The first baseline architecture is the one used by Lange et al. . The network is composed of 8 encoder layers and 8 decoder layers. To match the original architecture as closely as possible, we converted our 240×240240\times 240 RGB images into 60×6060\times 60 grayscale images before passing them through the network. The encoder starts with 3 convolution layers with filter size 7×77\times 7, where the last convolution layer has stride 22. The last convolution layer is followed by 6 fully connected layers, the size of which are 288288, 144144, 7272, 3636, 1818 and 1010 respectively. The last fully connected layer forms the bottleneck of the autoencoder. We chose 1010 as the dimension of the bottleneck, since the system has roughly 1010 degrees of freedom. The decoder consists of 6 mirrored fully connected layers followed by 33 deconvolution layers, finally reconstructing the down sampled 60×6060\times 60 image. We used ReLU nonlinearities between each layer. Following , we pre-train each pair of the encoder-decoder layers for 40004000 iterations. Then, we perform fine tuning on the entire network until the validation error plateaus.

We also experimented with a more widely adopted convolutional architecture. The 240×240×3240\times 240\times 3 image is directly passed to the network. This network starts with 3 convolutional layers. As in our network architecture, conv1 consists of 6464 7×77\times 7 filters with stride 22, conv2 has 3232 5×55\times 5 filters with stride 11, and conv3 has 1616 5×55\times 5 filters with stride 11, each followed by batch normalization and ReLU nonlinearities. Unlike our architecture, this baseline architecture performs max-pooling after each convolution layer in order to decrease the dimensionality of the feature maps. The convolution layers are followed by two fully connected layers with 512512 and 3232 units respectively, the last of which forms the bottleneck of the network. These layers together form the encoder, and a mirroring architecture, consisting of fully connected layers and deconvolution layers, forms the decoder. We initialize the first convolution layer with weights trained on ImageNet, and train the network until validation error plateaus.