On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset

Muhammad Waleed Gondal, Manuel Wüthrich, Đorđe Miladinović, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer

stat.ML cs.LG

Introduction

In representation learning it is commonly assumed that a high-dimensional observation $\mathbf{X}$ is generated from low-dimensional factors of variation $\mathbf{G}$ . The goal is usually to revert this process by searching for a latent embedding $\mathbf{Z}$ which replicates the underlying generative factors $\mathbf{G}$ , e.g. shape, size or color. Learning well-disentangled representations of complex sensory data has been identified as one of the key challenges in the quest for artificial intelligence (AI) , since they should contain all the information present in the observations in a compact and interpretable structure while being independent from the task at hand .

Disentangled representations may be useful for (semi-)supervised learning of downstream tasks, transfer and few-shot learning . Further, such representations allow to filter out nuisance factors , to perform interventions and to answer counterfactual questions . First applications of algorithms for learning disentangled representations have been applied to visual concept learning, sequence modeling, curiosity-based exploration or even domain adaptation in reinforcement learning . The research community is in general agreement on the importance of this paradigm and much progress has been made in the past years, particularly on the algorithmic level [e.g. 18, 24], fundamental understanding [e.g. 17, 52] and experimental evaluation . However, research has thus far focused on synthetic toy datasets.

The main motivation for using synthetic datasets is that they are cheap, easy to generate and the independent generative factors can be easily controlled. However, real-world recordings exhibit imperfections such as chromatic aberrations in cameras and complex surface properties of objects (e.g. reflectance, radiance and irradiance), making transfer learning from synthetic to real data a nontrivial task. Despite the growing importance of the field and the potential societal impact in the medical domain or fair decision making [e.g. 6, 10, 37], the performance of state-of-the-art disentanglement learning on real-world data is unknown.

To bridge the gap between simulation and the physical world, we built a recording platform which allows to investigate the following research questions: (i) How well do unsupervised state-of-the-art algorithms transfer from rendered images to physical recordings? (ii) How much does this transfer depend on the quality of the simulation? (iii) Can we learn representations on low dimensional recordings and transfer them from the current state-of-the-art of $64\times 64$ images to high quality images? (iv) How much supervision is necessary to encode the necessary inductive biases? (v) Are the confounding and distortions of real-world recordings beneficial for learning disentangled representations? (vi) Can we disentangle causal mechanisms in the data generating process? (vii) Are disentangled representations useful for solving the real-world downstream tasks?

While answering all of the above questions is beyond the scope of this paper, our key contributions can be summarized as follows:

We introduce the first real-world 3D data set recorded in a controlled environment, defined by 7 factors of variation: object color, object shape, object size, camera height, background color and two degrees of freedom of motion of a robotic arm. The dataset is made publicly availablehttps://github.com/rr-learning/disentanglement_dataset.

We provide synthetic images produced by computer graphics with two levels of realism. Since the robot arm and the objects are printed from a 3D template, we can ensure a close similarity between the realistic renderings and the real-world recordings.

The collected datatset of physical 3D objects consists of over one million images, and each of the two simulated datasets contains the same number of images as well.

We investigate the role of inductive bias and the transfer of different hyper-parameter settings between the different simulations and the real-world and the requirements on the quality of the simulation for a succesful transfer.

Background and Related Work

We assume a set of observations of a (potentially high dimensional) random variable $\bm{X}$ which is generated by $K$ unobserved causes of variation (generative factors) $\bm{G}=[G_{1},\dots,G_{K}]$ (i.e., $\bm{G}\rightarrow\bm{X}$ ) that do not cause each other. These latent factors represent elementary ingredients to the causal mechanism generating $\bm{X}$ . The elementary ingredients $G_{i},i=1,\dots,K$ of the causal process work on their own and are changeable without affecting others, reflecting the independent mechanisms assumption . However, for some of the factors a hierarchical structure may exist for which this may only hold true when seeing the hierachical structure as a whole as one component. The graphical model corresponding to this framework and adapted from is depicted in figure 2. The hirachical structure of the factors $G_{K-1}^{1}$ and $G_{K-1}$ might represent one compositional process e.g. connected joints of a robot arm. The most commonly accepted understanding of disentanglement is that each learned feature in $\bm{Z}$ should capture one factor of variation in $\bm{G}$ .

Current state-of-the-art disentanglement approaches use the framework of variational auto-encoders (VAEs) . The (high-dimensional) observations $\bm{x}$ are modelled as being generated from some latent features $\bm{z}$ with chosen prior $p(\bm{z})$ according to the probabilistic model $p_{\theta}(\bm{x}|\bm{z})p(\bm{z})$ . The generative model $p_{\theta}(\bm{x}|\bm{z})$ as well as the proxy posterior $q_{\phi}(\bm{z}|\bm{x})$ can be represented by neural networks, which are optimized by maximizing the variational lower bound (ELBO) of $\log p(\bm{x}_{1},\dots,\bm{x}_{N})$

Since the above objective does not enforce any structure on the latent space, except for some similarity to the typically chosen isotropic Gaussian prior $p(\bm{z})$ , various proposals for more structure imposing regularization have been made. Using some sort of supervision [e.g. 43, 4, 35, 40, 9] or proposing completely unsupervised [e.g. 19, 24, 7, 27, 13] learning approaches. proposed the $\beta$ -VAE penalizing the Kullback-Leibler divergence (KL) term in the VAE objective more strongly, which encourages similarity to the factorized prior distribution. Others used techniques to encourage statistical independence between the different components in $\bm{Z}$ , e.g., FactorVAE or $\beta$ -TCVAE , while DIP-VAE proposed to encourage factorization of the inferred prior $q_{\phi}(\bm{z})=\int q_{\phi}(\bm{z}|\bm{x})p(\bm{x})\,d\bm{x}$ . For other related work we refer to the detailed descriptions in the recent empirical study .

Real-world data is costly to generate and groundtruth is often not available since significant confounding may exist. To bypass this limitation, many recent state-of-the-art disentanglement models have heavily relied on synthetic toy datasets, trying to solve a simplified version of the problem in the hope that the conclusions drawn might likewise be valid for real-world settings. A quantitative summary of the most widely used datasets for learning disentangled representations is provided in table 1.

For quantitative analysis, dSpriteshttps://github.com/deepmind/dsprites-dataset is the most commonly used dataset. This synthetic dataset contains binary 2D images of hearts, ellipses and squares in low resolution. In Color-dSprites the shapes are colored with a random color, Noisy-dSprites considers white-colored shapes on a noisy background and in Scream-dSprites the background is replaced with a random patch in a random color shade extracted from the famous The Scream painting . The dSprites shape is embedded into the image by inverting the color of its pixel. The SmallNORBhttps://cs.nyu.edu/~ylclab/data/norb-v1.0-small/ dataset contains images of 50 toys belonging to 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. The objects were imaged by two cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees) . For Cars3Dhttp://www.scottreed.info/files/nips2015-analogy-data.tar.gz,199 CAD models from were used to generate 64x64 color renderings from 24 rotation angles each offset by 15 degrees . Recently, 3dshapes was made publicly availablehttps://github.com/deepmind/3dshapes-dataset/, a dataset of 3D shapes procedurally generated from 6 ground truth independent latent factors. These factors are floor colour, wall colour, object colour, scale, shape and orientation .

Bridging the Gap Between Simulation and the Real World: A Novel Dataset

While other real-world recordings, e.g. CelebA , exist, they offer only qualitative evaluations. However, a more controlled dataset is needed to quantitatively investigate the effects of inductive biases, sample complexity and the interplay of simulations and the real-world.

In order to record a controlled dataset of physical 3D objects, we built the mechanical platform illustrated in figure 3. It consists of three cameras mounted at different heights, a robotic manipulator carrying a 3D printed object (which can be swapped) in the center of the platform and a rotating table at the bottom. The platform is shielded with black sheets from all sides to avoid any intrusion of external factors (e.g. light) and the whole environment is relatively uniformly illuminated by three light sources installed within the platform.

The generative factors of variation $\bm{G}$ mentioned in section 2 are listed in the following for our recording setup.

All objects have one of six different colors: red (255, 0, 0), green (0, 255, 0), blue (0, 0, 255), white (255, 255, 255), olive (210,210,80) and brown (153,76,0) (see figure 4).

There are objects of four different shapes in the dataset: a cylinder, a hexagonal prism, a cube, a sphere, a pyramid with square base and a cone. All objects exhibit rotational symmetries about some axes, however the kinematics of the robot are such that these axes never align with the degrees of freedom of the robot. This is important because it ensures that the robot degrees of freedom are observable given the images.

There are objects of two different sizes in the dataset, categorized as large (roughly 65mm in diameter) and small (roughly 45 mm in diameter).

The dataset is recorded with three cameras mounted at three different heights (see figure 7 on the right), which represents another factor of variation in the images.

The rotation table (see figure 7) allows us to change background color. Note that for all images in the dataset we orient the table in such a way that only one background color is visible at a time. The colors are: sea green, salmon and purple.

Each object is mounted on the tip of the manipulator shown in figure 3. This manipulator has two degrees of freedom, a rotation about a vertical axis at the base and a second rotation about a horizontal axis. We move each joint in a range of 180∘ in 40 equal steps (see figure 8 and figure 9). Note that these two factors of variation are independent, just like all other factors (i.e. we record all possible combinations between the two).

2 Simulated Data

In addition to the real-world dataset we recorded two simulated datasets of the same setup, hence all factors of variation are identical across the three datasets. One of the simulated datasets is designed to be as realistic as possible and the synthetic images are visually practically indistinguishable from real images (see figure 1 middle). For the second simulated dataset we used a deliberately simplified model (see figure 1 left), which allows to investigate transfer from simplified models to real data.

The synthetic data was generated using Autodesk 3ds Max(2018). Most parts of the scene were imported from SolidWorks CAD files that were originally used to construct the experimental stage including the manipulator and 3D printing of the test objects. The surface shaders are based on Autodesk Physical material with hand-tuned shading parameters, based on full resolution reference images. The camera poses were initialized from the CAD data and then manually fine-tuned using reference images. The realistic synthetic images were obtained using the Autodesk Raytracer (ART) with three rectangular light sources, mimicking the LED panels. The simplified images were rendered with the Quicksilver hardware renderer.

First Experimental Evaluations of (unsupervised) Disentanglement Methods on Real-World Data

Some fields have been able to narrow the gap between simulation and reality , which has led to remarkable achievements (e.g. for in-hand manipulation ). In contrast, for disentanglement methods this gap has not been bridged yet, state-of-the-art algorithms seem to have difficulties to transfer learned representations even between toy datasets . The proposed dataset will enable the community to systematically investigate how such transfer of information between simulations with different degrees of realism and real data can be achieved. In the following we present a first experimental study in this directions.

We apply all the disentanglement methods ( $\beta$ -VAE, FactorVAE, $\beta$ -TCVAE, DIP-VAE-I, DIP-VAE-II, AnnealedVAE) which were used in a recent large-scale study to our three datasets. Due to space constraints, the models are abbreviated with numbers one to five in the plots in the same order. We use (disentanglement_lib) and we evaluate on the same scores as . In all the experiments, we used images with resolution 64x64. This resolution is used in the recent large-scale evaluations and by state-of-the-art disentanglement learning algorithms . Each of the six methods is trained on each of the three datasets with five different hyperparameter settings (see table 2 in the appendix for details) and with three different random seeds, leading to a total of 270 models. Each model is trained for 300,000 iterations on Tesla V100 gpus. Details about the evaluation metrics can be found in appendix C.

2 Experimental Results

Reconstruction Across Datasets: Figure 10 shows that there is a difference in reconstruction score across datasets: The score is the lowest on real data, followed by the realistic simulated dataset (R) and the simple toy (T) images. This indicates that there is a significant difference in the distribution of the real data compared to the simulated data, and that it is harder to learn a representation of the real data than of the simulated data. However, the relative behaviour of different methods seems to be similar across all three datasets, which indicates that despite the differences, the simulated data may be useful for model selection.

In figure 11 we show the Mutual Information Gap (MIG) scores attained by different methods for different evaluations. The same plots for different metrics look qualitatively similar (see figure 22 in the appendix). Given the high variance, it is difficult to make conclusive statements. However, it seems quite clear that all methods perform significantly better when they are trained and evaluated on the same dataset (three plots on the left). Direct transfer of learned representations from simulated to real data (two plots on the right) seems to work rather poorly.

We have seen that transferring representations directly from simulated to real data seems to work poorly. However, it may be possible to instead transfer information at a higher-level, such as the choice of the method and its hyperparameters as an inductive bias.

In order to quantitatively evaluate whether such a transfer is possible, we pick the model (including hyperparameters) which performs best in simulation (according to a metric chosen at random), and we compute the probability of outperforming (according to a metric and seed chosen randomly) a model which was chosen at random. If no transfer is possible, we would expect this probability to be $50\%$ .

However, we find that model selection from realistic simulated renderings (R) outperforms random model selection $72\%$ of the time while transferring the model from the simpler synthetic images (T) to real-world data even beats random selection $78\%$ of the time.

This finding is confirmed by figure 12, where we show the rank-correlation of the performance of models (including hyperparameters) trained on one dataset with the performance of these models trained on another dataset. The performance of a model trained on some dataset seems to be highly correlated with the performance of that model trained on any other dataset. In figure 12 we use the DCI disentanglement metric as a score, however, qualitatively similar results can be observed using most of the disentanglement metrics (see figure 25 in the appendix).

Summary These results indicate that the simulated and the real data distribution have some similarities, and that these similarities can be exploited through model and hyperparameter selection. Surprisingly, it seems that the transfer of models from the synthetic toy dataset may work even better than the transfer from the realistic synthetic dataset.

Conclusions

Despite the intended applications of disentangled representation learning algorithms to real data in fields such as robotics, healthcare and fair decision making [e.g. 6, 10, 20], state-of-the-art approaches have only been systematically evaluated on synthetic toy datasets. Our work effectively complements related efforts [e.g. 38] to address current challenges of representation learning, offering the possibility of investigating the role of inductive biases, sample complexity, transfer learning and the use of labels using real-world images.

A key aspect of our datasets is that we provide rendered images of increasing complexity for the same setup used to capture the real-world recordings. The different recordings offer the possibility of investigating the question if disentangled representations can be transferred from simulation to the real world and how the transferability depends on the degree of realism of the simulation. Beyond the evaluation of representation learning algorithms, the proposed dataset can likewise be used for other tasks such as 3D reconstruction and scene rendering or learning compositional visual concepts . Furthermore, we are planning to use the novel experimental setup for recording objects with more complicated shapes and textures under more difficult conditions, such as dependence among different factors.

This research was partially supported by the Max Planck ETH Center for Learning Systems and Google Cloud. We thank Alexander Neitz and Arash Mehrjou for useful discussions. We would also like to thank Felix Grimminger, Ludovic Righetti, Stefan Schaal, Julian Viereck and Felix Widmaier whose work served as a starting point for the development of the robotic platform in the present paper.

References

Appendix A Platform

Appendix B A Discussion on Disentangled Representations and their Transfer

Empirically, we observed that some of the best trained models were able to disentangle, though imperfectly, the factors of camera height, background colors, object sizes and the motions along the first and second degrees of freedom. Whereas, they performed poorly in disentangling the factors of some object shapes (for e.g. pyramid and cone) and some colors (for e.g. olive and brown). This may well be because of less pixel variation in the respective factors. The features of camera height and background color cause the most difference (maximum L2 distance) in image space. Similarly the object positions at different joint configurations (not the consecutive frames) also have big L2 distance which may explain why the models focus more on learning these factors.

The image reconstructions become more blurry as we move from the simple simulated dataset to the more complex real-world dataset, which can be seen by the reconstruction scores in figure 10. It has been previously noted that the use of overly simplistic priors like standard normal Gaussians in VAE models can lead to overregularization . To achieve disentanglement in VAE models, put higher weights ( $\beta$ >1) on the KL divergence which decreases the reconstruction quality. With the increased complexity in the dataset, the decrease in reconstruction quality becomes even more pronounced, as illustrated in figure 17, 18 and 19.

In figures 20 and 21, we show the reconstruction results of transfers from simple simulated and realistic simulated datasets to the real-world dataset. The models completely fail in transferring the representations from the simple simulated to the real-world data. On the other hand, in the realistic simulated to real-world transfer (figure 21) the models almost always reconstruct the correct background strip color, manipulator pose and the camera height factor. However, the object properties seem to differ a lot. This shows that in the case of complex environments, the models put more focus on learning the environment to get the better reconstruction accuracy than to learn the important but relatively small changing factors. This result for VAE models has also been confirmed by .

Appendix C Details of the Experimental Protocol

Various methods to validate a learned representation for disentanglement based on known ground truth generative factors $\bm{G}$ have been proposed [e.g. 11, 47, 7, 24]. This has for example been expressed as the mutual information of a single latent dimension $Z_{i}$ with generative factors $G_{1},\dots,G_{K}$ , where in the ideal case each $Z_{i}$ has some mutual information with one generative factor $G_{k}$ but none with all the others. Similarly, trained predictors (e.g., Lasso or random forests) for a generative factor $G_{k}$ based on the representation $\bm{Z}$ . In a disentangled model, each dimension $Z_{i}$ is only useful (i.e., has high feature importance) to predict one of those factors. proposed an interventional robustness score. The graphical model of adapted to our setup is illustrated in figure 2. Another form of validation, especially without known generative factors is the visual inspection of "latent traversals" [see e.g. 7].

All the models used the same convolutional encoder and decoder architecture with the fixed latent size of 10.

The training hyperparameters were kept fixed for each of the considered methods.