On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models

Juergen Schmidhuber

General Reinforcement Learning (RL) agents must discover, without the aid of a teacher, how to interact with a dynamic, initially unknown, partially observable environment in order to maximize their expected cumulative reward signals, e.g., . There may be arbitrary, a priori unknown delays between actions and perceivable consequences. The RL problem is as hard as any problem of computer science, since any task with a computable description can be formulated in the RL framework, e.g., .

To become a general problem solver that is able to run arbitrary problem-solving programs, the controller of a robot or an artificial agent must be a general-purpose computer . Artificial recurrent neural networks (RNNs) fit this bill. A typical RNN consists of many simple, connected processors called neurons, each producing a sequence of real-valued activations. Input neurons get activated through sensors perceiving the environment, other neurons get activated through weighted connections or wires from previously active neurons, and some neurons may affect the environment by triggering actions. Learning or credit assignment is about finding real-valued weights that make the NN exhibit desired behavior, such as driving a car. Depending on the problem and how the neurons are connected, such behavior may require long causal chains of computational stages, where each stage transforms the aggregate activation of the network, often in a non-linear manner.

Unlike feedforward NNs (FNNs; ) and Support Vector Machines (SVMs; ), RNNs can in principle interact with a dynamic partially observable environment in arbitrary, computable ways, creating and processing memories of sequences of input patterns . The weight matrix of an RNN is its program. Without a teacher, reward-maximizing programs of an RNN must be learned through repeated trial and error.

It is possible to train small RNNs with a few 100 or 1000 weights using evolutionary algorithms to search the space of NN weights , or through policy gradients (PGs) [245, Sec. 6.6]. For example, our evolutionary algorithms outperformed traditional, Dynamic Programming -based RL methods [245, Sec. 6.2] in partially observable environments, e.g., . However, these techniques by themselves are insufficient for solving complex control problems involving high-dimensional sensory inputs such as video, from scratch. The program search space for networks of the size required for these tasks is simply too large.

However, the search space can often be reduced dramatically by evolving compact encodings of neural networks (NNs), e.g., through Lindenmeyer Systems , graph rewriting , Cellular Encoding , HyperNEAT , and other techniques [245, Sec. 6.7]. In very general early work, we used universal assembler-like languages to encode NNs , later coefficients of a Discrete Cosine Transform (DCT) . The latter method, Compressed RNN Search , was used to successfully evolve RNN controllers with over a million weights (the largest ever evolved) to drive a simulated car in a video game, based solely on a high-dimensional video stream —learning both control and visual processing from scratch, without unsupervised pre-training of a vision system. This was the first published Deep Learner to learn control policies directly from high-dimensional sensory input using RL.

One can further facilitate the learning task of controllers through certain types of supervised learning (SL) and unsupervised learning (UL) based on gradient descent techniques. In particular, UL/SL can be used to compress the search space, and to build predictive world models to accelerate RL, as will be discussed later. But first let us review the relevant NN algorithms for SL and UL.

2 Deep Learning in NNs: Supervised & Unsupervised Learning (SL & UL)

The term Deep Learning was first introduced to Machine Learning in 1986 and to NNs in 2000 . The first deep learning NNs, however, date back to the 1960s (certain more recent developments are covered in a survey ).

To maximize differentiable objective functions of SL and UL, NN researchers almost invariably use backpropagation (BP) in discrete graphs of nodes with differentiable activation functions [245, Sec. 5.5]. Typical applications include BP in FNNs , or BP through time (BPTT) and similar methods in RNNs, e.g., . BP and BPTT suffer from the Fundamental Deep Learning Problem first discovered and analyzed in my lab in 1991: with standard activation functions, cumulative backpropagated error signals decay exponentially in the number of layers, or they explode . Hence most early FNNs had few layers. Similarly, early RNNs [245, Sec. 5.6.1] could not generalize well under both short and long time lags between relevant events. Over the years, several ways of overcoming the Fundamental Deep Learning Problem have been explored. For example, deep stacks of unsupervised RNNs or FNNs help to accelerate subsequent supervised learning through BPTT or BP . One can also “distill” or compress the knowledge of a teacher RNN into a student RNN by forcing the student to predict the hidden units of the teacher .

Long Short-Term Memory (LSTM; ) alleviates the Fundamental Deep Learning Problem, and was the first RNN architecture to win international contests (in connected handwriting), e.g., . Connectionist Temporal Classification (CTC) is a widely used gradient-based method for finding RNN weights that maximize the probability of teacher-provided label sequences, given (typically much longer and more high-dimensional) streams of real-valued input vectors. For example, CTC was used by Baidu to break an important speech recognition record . Many recent state-of-the-art results in sequence processing are based on LSTM, which learned to control robots , and was used to set benchmark records in prosody contour prediction (IBM), text-to-speech synthesis (Microsoft), large vocabulary speech recognition (Google), and machine translation (Google). CTC-trained LSTM greatly improved Google Voice and is now available to over a billion smartphone users. Nevertheless, at least in some applications, other RNNs may sometimes yield better results than gradient-based LSTM . Alternative NNs with differentiable memory have been proposed .

Today’s faster computers, such as GPUs, mitigate the Fundamental Deep Learning Problem for FNNs . In particular, many recent computer vision contests were won by fully supervised Max-Pooling Convolutional NNs (MPCNNs), which consist of alternating convolutional and max-pooling layers topped off by standard fully connected output layers. All weights are trained by backpropagation . Ensembles of GPU-based MPCNNs achieved dramatic improvements of long-standing benchmark records, e.g., MNIST (2011), won numerous competitions , and achieved the first human-competitive or even superhuman results on well-known benchmarks, e.g., . There are many recent variations and improvements . Supervised Transfer Learning from one dataset to another can speed up learning. A combination of Convolutional NNs (CNNs) and LSTM led to best results in automatic image caption generation .

3 Gradient Descent-Based NNs for RL

Perhaps the most well-known RL application is Tesauro’s backgammon player from 1994 which learned to achieve the level of human world champions, by playing against itself. It uses a reactive (memory-free) policy based on the simplifying assumption of Markov Decision Processes: the current input of the RL agent conveys all information necessary to compute an optimal next output event or decision. The policy is implemented as a gradient-based FNN trained by the method of temporal differences [245, Sec. 6.2]. During play, the FNN learns to map board states to predictions of expected cumulative reward, and selects actions leading to states with maximal predicted reward. A very similar approach (also based on over 20-year-old methods) employed a CNN (see Sec. 1.2) to play several Atari video games directly from 84 $\times$ 84 pixel 60 Hz video input , using Neural Fitted Q-Learning (NFQ) based on experience replay (1991) . Even better results were achieved by using (slow) Monte Carlo tree planning to train comparatively fast deep NNs .

Such FNN approaches cannot work in realistic partially observable environments where memories of previous inputs have to be stored for a priori unknown time intervals. This triggered work on partially observable Markov decision problems (POMDPs) . Traditional RL techniques [245, Sec. 6.2] based on Dynamic Programming can be combined with gradient descent methods to train an RNN as a value-function approximator that maps entire event histories to predictions of expected cumulative reward . LSTM (see Sec. 1.2) was used in this way for RL robots .

Gradient-based UL may be used to reduce an RL controller’s search space by feeding it only compact codes of high-dimensional inputs [245, Sec. 6.4]. For example, NFQ was applied to real-world control tasks where purely visual inputs were compactly encoded in hidden layers of deep autoencoders [245, Sec. 5.7 and and 5.15]. RL combined with unsupervised learning based on Slow Feature Analysis enabled a humanoid robot to learn skills from raw video streams . A RAAM RNN was employed as a deep unsupervised sequence encoder for RL .

One important application of gradient-based UL is to obtain a predictive world model, $M$ , that a controller, $C$ , may use to achieve its goals more efficiently, e.g., through cheap, “mental” $M$ -based trials, as opposed to expensive trials in the real world . The first combination of an RL RNN $C$ and an UL RNN $M$ was ours and dates back to 1990 , generalizing earlier similar controller/model systems ( $CM$ systems) based on FNNs ; compare related work [245, Sec. 6.1]. $M$ tries to learn to predict $C$ ’s inputs (including reward signals) from previous inputs and actions. $M$ is also temporarily used as a surrogate for the environment: $M$ and $C$ form a coupled RNN where $M$ ’s outputs become inputs of $C$ , whose outputs (actions) in turn become inputs of $M$ . Now a gradient descent technique (see Sec. 1.2) can be used to learn and plan ahead by training $C$ in a series of $M$ -simulated trials to produce output action sequences achieving desired input events, such as high real-valued reward signals (while the weights of $M$ remain fixed). An RL active vision system, from 1991 , used this basic principle to learn sequential shifts (saccades) of a fovea to detect targets in a visual scene, thus learning a rudimentary version of selective attention.

Those early $CM$ systems, however, did not yet use powerful RNNs such as LSTM. A more fundamental problem is that if the environment is too noisy, $M$ will usually only learn to approximate the conditional expectations of predicted values, given parts of the history. In certain noisy environments, Monte Carlo Tree Sampling (MCTS; ) and similar techniques may be applied to $M$ to plan successful future action sequences for $C$ . All such methods, however, are about simulating possible futures time step by time step, without profiting from human-like hierarchical planning or abstract reasoning, which often ignores irrelevant details.

3.2 Early Predictive RNN World Models Combined with Traditional RL

In the early 1990s, an RNN $M$ as in Sec. 1.3.1 was also combined with traditional temporal difference methods [245, Sec. 6.2] based on the Markov assumption (Sec. 1.3). While $M$ is processing the history of actions and observations to predict future inputs and rewards, the internal states of $M$ are used as inputs to a temporal difference-based predictor of cumulative predicted reward, to be maximized through appropriate action sequences. One of our systems described in 1991 actually collapsed the cumulative reward predictor into the predictive world model, $M$ .

4 Hierarchical & Multitask RL and Algorithmic Transfer Learning

Work on NN-based Hierarchical RL (HRL) without predictive world models has been published since the early 1990s. In particular, gradient-based subgoal discovery with RNNs decomposes RL tasks into subtasks for submodules . Numerous alternative HRL techniques have been proposed . While HRL frameworks such as Feudal RL and options do not directly address the problem of automatic subgoal discovery, HQ-Learning automatically decomposes problems in partially observable environments into sequences of simpler subtasks that can be solved by memoryless policies learnable by reactive sub-agents. Related methods include incremental NN evolution , hierarchical evolution of NNs , and hierarchical Policy Gradient algorithms . Recent HRL organizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control maps inspired by neurophysiological findings . The methods above, however, assign credit in hierarchical fashion by limited fixed schemes that are not themselves improved or adapted in problem-specific ways. The next sections will describe novel $CM$ systems that overcome such drawbacks of above-mentioned methods.

General methods for incremental multitask RL and algorithmic transfer learning that are not NN-specific include the evolutionary ADATE system , the Success-Story Algorithm for Self-Modifying Policies running on general-purpose computers , and the Optimal Ordered Problem Solver , which learns algorithmic solutions to new problems by inspecting and exploiting (in arbitrary computable fashion) solutions to old problems, in a way that is asymptotically time-optimal. And PowerPlay incrementally learns to become a more and more general algorithmic problem solver, by continually searching the space of possible pairs of new tasks and modifications of the current solver, until it finds a more powerful solver that, unlike the unmodified solver, solves all previously learned tasks plus the new one, or at least simplifies/compresses/speeds up previous solutions, without forgetting any.

Algorithmic Information Theory (AIT) for RNN-based AIs

Our early RNN-based $CM$ systems (1990) mentioned in Sec. 1.3.1 learn a predictive model of their initially unknown environment. Real brains seem to do so too, but are still far superior to present artificial systems in many ways. They seem to exploit the model in smarter ways, e.g., to plan action sequences in hierarchical fashion, or through other types of abstract reasoning, continually building on earlier acquired skills, becoming increasingly general problem solvers able to deal with a large number of diverse and complex tasks. Here we describe RNN-based Artificial Intelligences (RNNAIs) designed to do the same by “learning to think.”The terminology is partially inspired by our RNNAISSANCE workshop at NIPS 2003 .

While FNNs are traditionally linked to concepts of statistical mechanics and information theory , the programs of general computers such as RNNs call for the framework of Algorithmic Information Theory (AIT) (own AIT work: ). Given some universal programming language for a universal computer, the algorithmic information content or Kolmogorov complexity of some computable object is the length of the shortest program that computes it. Since any program for one computer can be translated into a functionally equivalent program for a different computer by a compiler program of constant size, the Kolmogorov complexity of most objects hardly depends on the particular computer used. Most computable objects of a given size, however, are hardly compressible, since there are only relatively few programs that are much shorter. Similar observations hold for practical variants of Kolmogorov complexity that explicitly take into account program runtime . Our RNNAIs are inspired by the following argument.

According to AIT, given some universal computer, $U$ , whose programs are encoded as bit strings, the mutual information between two programs $p$ and $q$ is expressed as $K(q\mid p)$ , the length of the shortest program $\bar{w}$ that computes $q$ , given $p$ , ignoring an additive constant of $O(1)$ depending on $U$ (in practical applications the computation will be time-bounded ). That is, if $p$ is a solution to problem $P$ , and $q$ is a fast (say, linear time) solution to problem $Q$ , and if $K(q\mid p)$ is small, and $\bar{w}$ is both fast and much shorter than $q$ , then asymptotically optimal universal search for a solution to $Q$ , given $p$ , will generally find $\bar{w}$ first (to compute $q$ and solve $Q$ ), and thus solve $Q$ much faster than search for $q$ from scratch .

2 One RNN-Like System Actively Learns to Exploit Algorithmic Information of Another

The AIT argument 2.1 above has broad applicability. Let both $C$ and $M$ be RNNs or similar general parallel-sequential computers . $M$ ’s vector of learnable real-valued parameters $w_{M}$ is trained by any SL or UL or RL algorithm to perform a certain well-defined task in some environment. Then $w_{M}$ is frozen. Now the goal is to train $C$ ’s parameters $w_{C}$ by some learning algorithm to perform another well-defined task whose solution may share mutual algorithmic information with the solution to $M$ ’s task. To facilitate this, we simply allow $C$ to learn to actively inspect and reuse (in essentially arbitrary computable fashion) the algorithmic information conveyed by $M$ and $w_{M}$ .

Let us consider a trial during which $C$ makes an attempt to solve its given task within a series of discrete time steps $t=t_{a},t_{a}+1,\ldots,t_{b}$ . $C$ ’s learning algorithm may use the experience gathered during the trial to modify $w_{C}$ in order to improve $C$ ’s performance in later trials. During the trial, we give $C$ an opportunity to explore and exploit or ignore $M$ by interacting with it. In what follows, $C(t)$ , $M(t)$ , $sense(t)$ , $act(t)$ , $query(t)$ , $answer(t)$ , $w_{M}$ , $w_{C}$ denote vectors of real values; $f_{C},f_{M}$ denote computable functions.

At any time $t$ , $C(t)$ and $M(t)$ denote $C$ ’s and $M$ ’s current states, respectively. They may represent current neural activations or fast weights or other dynamic variables that may change during information processing. $sense(t)$ is the current input from the environment (including reward signals if any); a part of $C(t)$ encodes the current output $act(t)$ to the environment, another a memory of previous events (if any). Parts of $C(t)$ and $M(t)$ intersect in the sense that both $C(t)$ and $M(t)$ also encode $C$ ’s current $query(t)$ to $M$ , and $M$ ’s current $answer(t)$ to $C$ (in response to previous queries), thus representing an interface between $C$ and $M$ .

$M(t_{a})$ and $C(t_{a})$ are initialized by default values. For $t_{a}\leq t<t_{b}$ ,

with learnable parameters $w_{C}$ ; $act(t)$ is a computable function of $C(t)$ and may influence $in(t+1)$ , and $M(t+1)=f_{M}(C(t),M(t),w_{M})$ with fixed parameters $w_{M}$ . So both $M(t+1)$ and $C(t+1)$ are computable functions of previous events including queries and answers transmitted through the learnable $f_{C}$ .

According to the AIT argument, provided that $M$ conveys substantial algorithmic information about $C$ ’s task, and the trainable interface $f_{C}$ between $C$ and $M$ allows $C$ to address and extract and exploit this information quickly, and $w_{C}$ is small compared to the fixed $w_{M}$ , the search space of $C$ ’s learning algorithm (trying to find a good $w_{C}$ through a series of trials) should be much smaller than the one of a similar competing system $C^{\prime}$ that has no opportunity to query $M$ but has to learn the task from scratch.

For example, suppose that $M$ has learned to represent (e.g., through predictive coding ) videos of people placing toys in boxes, or to summarize such videos through textual outputs. Now suppose $C$ ’s task is to learn to control a robot that places toys in boxes. Although the robot’s actuators may be quite different from human arms and hands, and although videos and video-describing texts are quite different from desirable trajectories of robot movements, $M$ is expected to convey algorithmic information about $C$ ’s task, perhaps in form of connected high-level spatio-temporal feature detectors representing typical movements of hands and elbows independent of arm size. Learning a $w_{C}$ that addresses and extracts this information from $M$ and partially reuses it to solve the robot’s task may be much faster than learning to solve the task from scratch without access to $M$ .

The setups of Sec. 5.3 are special cases of the general scheme in the present Sec. 2.2.

3 Consequences of the AIT Argument for Model-Building Controllers

The simple AIT insight above suggests that in many partially observable environments it should be possible to greatly speed up the program search of an RL RNN, $C$ , by letting it learn to access, query, and exploit in arbitrary computable ways the program of a typically much bigger gradient-based UL RNN, $M$ , used to model and compress the RL agent’s entire growing interaction history of all failed and successful trials.

Note that the $\bar{w}$ of Sec. 2.1 may implement all kinds of well-known, computable types of reasoning, e.g., by hierarchical reuse of subprograms of $p$ , by analogy, etc. That is, we may perhaps even expect $C$ to learn to exploit $M$ for human-like abstract thought.

Such novel $CM$ systems will be a central topic of Sec. 5. Sec. 6 will also discuss exploration based on efficiently improving $M$ through $C$ -generated experiments.

The RNNAI and its Holy Data

In what follows, let $m,n,o$ denote positive integer constants, and $i,k,h,t,\tau$ positive integer variables assuming ranges implicit in the given contexts. The $i$ -th component of any real-valued vector, $v$ , is denoted by $v_{i}$ . Let the RNNAI’s life span a discrete sequence of time steps, $t=1,2,\ldots,t_{death}$ .

To be able to retrain its components on all observations ever made, the RNNAI stores its entire, growing, lifelong sensory-motor interaction history $H(\cdot)$ including all inputs and actions and reward signals observed during all successful and failed trials , including what initially looks like noise but later may turn out to be regular. This is normally not done, but is feasible today.

That is, all data is “holy”, and never discarded, in line with what mathematically optimal general problem solvers should do . Remarkably, even human brains may have enough storage capacity to store 100 years of sensory input at a reasonable resolution .

Many RNN-like models can be used to build general computers, e.g., neural pushdown automata , NNs with quickly modifiable, differentiable external memory based on fast weights , or closely related RNN-based meta-learners . Using sloppy but convenient terminology, we refer to all of them as RNNs. A typical implementation of $M$ uses an LSTM network (see Sec. 1.2). If there are large 2-dimensional inputs such as video images, then they can be first filtered through a CNN (compare Sec. 1.2 and 4.3) before fed into the LSTM. Such a CNN-LSTM combination is still an RNN.

Here we briefly summarize information processing in standard RNNs. Using notation similar to the one of a previous survey [245, Sec. 2], let $i,k,s$ denote positive integer variables assuming ranges implicit in the given contexts. Let $n_{u},n_{w},T$ also denote positive integers.

At any given moment, an RNN (such as the $M$ of Sec. 4) can be described as a connected graph with $n_{u}$ units (or nodes or neurons) in a set $N=\{u_{1},u_{2},\ldots,u_{n_{u}}\}$ and a set $H\subseteq N\times N$ of directed edges or connections between nodes. The input layer is the set of input units, a subset of $N$ . In fully connected RNNs, all units have connections to all non-input units.

The RNN’s behavior or program is determined by $n_{w}$ real-valued, possibly modifiable, parameters or weights, $w_{i}$ $(i=1,\ldots,n_{w})$ . During an episode of information processing (e.g., during a trial of Sec. 3.2), there is a partially causal sequence $x_{s}(s=1,\ldots,T)$ of real values called events. Here the index $s$ is used in a way that is much more fine-grained than the one of the index $t$ in Sec. 3, 4, 5: a single time step may involve numerous events. Each $x_{s}$ is either an input set by the environment, or the activation of a unit that may directly depend on other $x_{k}(k<s)$ through a current NN topology-dependent set, $in_{s}$ , of indices $k$ representing incoming causal connections or links. Let the function $v$ encode topology information, and map such event index pairs, $(k,s)$ , to weight indices. For example, in the non-input case we may have $x_{s}=f_{s}(net_{s})$ with real-valued $net_{s}=\sum_{k\in in_{s}}x_{k}w_{v(k,s)}$ (additive case) or $net_{s}=\prod_{k\in in_{s}}x_{k}w_{v(k,s)}$ (multiplicative case), where $f_{s}$ is a typically nonlinear real-valued activation function such as $tanh$ . Other $net$ functions combine additions and multiplications ; many other activation functions are possible. The sequence, $x_{s}$ , may directly affect certain $x_{k}(k>s)$ through outgoing connections or links represented through a current set, $out_{s}$ , of indices $k$ with $s\in in_{k}$ . Some of the non-input events are called output events.

Many of the $x_{s}$ may refer to different, time-varying activations of the same unit, e.g., in RNNs. During the episode, the same weight may get reused over and over again in topology-dependent ways. Such weight sharing across space and/or time may greatly reduce the NN’s descriptive complexity, which is the number of bits of information required to describe the NN (Sec. 4). Training algorithms for the RNNs of our RNNAIs will be discussed later.

2 Alternating Training Phases for Controller 𝑪𝑪\boldsymbol{C} and World Model 𝑴𝑴\boldsymbol{M}

Several novel implementations of $C$ are described in Sec. 5. All of them make use of a variable size RNN called the world model, $M$ , which learns to compactly encode the growing history, for example, through predictive coding, trying to predict (the expected value of) each input component, given the history of actions and observations. $M$ ’s goal is to discover algorithmic regularities in the data so far by learning a program that compresses the data better in a lossless manner. Example details will be specified in Sec. 4.

Both $C$ and $M$ have real-valued parameters or weights that can be modified to improve performance. To avoid instabilities, $C$ and $M$ are trained in alternating fashion, as in Algorithm 1.

The Gradient-Based World Model 𝑴𝑴\boldsymbol{M}

Let us address details of training $M$ in a “sleep phase” of step 4 in algorithm 1. (The training of $C$ will be discussed in Sec. 5.) Consider some $M$ with given (typically suboptimal) weights and a default initialization of all unit activations. One example way of making $M$ compress the history (but not the only one) is the following. Given $H(t)$ , we can train $M$ by replaying $H(t)$ in semi-offline training, sequentially feeding $all(1),all(2),\ldots all(t)$ into $M$ ’s input units in standard RNN fashion (Sec. 1.2, 3.1). Given $H(\tau)$ ( $\tau<t$ ), $M$ calculates $pred(\tau+1)$ , a prediction of $sense(\tau+1)$ . A standard error function to be minimized by gradient descent in $M$ ’s weights (Sec. 1.2) would be $E(t)=\sum_{\tau=1}^{t-1}\|pred(\tau+1)-sense(\tau+1)\|^{2}$ , the sum of the deviations of the predictions from the observations so far.

However, $M$ ’s goal is not only to minimize the total prediction error, $E$ . Instead, to avoid the erroneous “discovery” of “regular patterns” in irregular noise, we use AIT’s sound way of dealing with overfitting , and measure $M$ ’s compression performance by the number of bits required to specify $M$ , plus the bits needed to encode the observed deviations from $M$ ’s predictions . For example, whenever $M$ incorrectly predicts certain input pixels of a perceived video frame, those pixel values will have to be encoded separately, which will cost storage space. (In typical applications, $M$ can only execute a fixed number of elementary computations per time step to compress and decompress data, which usually has to be done online. That is, in general $M$ will not reflect the data’s true Kolmogorov complexity , but at best a time-bounded variant thereof .)

Let integer variables, $bits_{M}$ and $bits_{H}$ , denote estimates of the number of bits required to encode (by a fixed algorithmic scheme) the current $M$ , and the deviations of $M$ ’s predictions from the observations on the current history, respectively. For example, to obtain $bits_{H}$ , we may naively assume some simple, bell-shaped, zero-centered probability distribution $P_{e}$ on the finite number of possible real-valued prediction errors $e_{i,\tau}=(pred_{i}(\tau)-sense_{i}(\tau))^{2}$ (in practical applications the errors will be given with limited precision), and encode each $e_{i,\tau}$ by $-logP_{e}(e_{i,\tau})$ bits . That is, large errors are considered unlikely and cost more bits than small ones. To obtain $bits_{M}$ , we may naively multiply the current number of $M$ ’s non-zero modifiable weights by a small integer constant reflecting the weight precision. Alternatively, we may assume some simple, bell-shaped, zero-centered probability distribution, $P_{w}$ , on the finite number of possible weight values (given with limited precision), and encode each $w_{i}$ by $-logP_{w}(w_{i})$ bits. That is, large absolute weight values are considered unlikely and cost more bits than small ones . Both alternatives ignore the possibility that $M$ ’s entire weight matrix might be computable by a short computer program , but have the advantage of being easy to calculate. Moreover, since $M$ is a general computer itself, at least in principle it has a chance of learning equivalents of such short programs.

2 𝑴𝑴\boldsymbol{M}’s Training

To decrease $bits_{M}+bits_{H}$ , we add a regularizing term to $E$ , to punish excessive complexity .

Step 1 of algorithm 1 starts with a small $M$ . As the history grows, to find an $M$ with small $bits_{M}+bits_{H}$ , step 4 uses sequential network construction: it regularly changes $M$ ’s size by adding or pruning units and connections . Whenever this helps (after additional training with BPTT of $M$ —see Sec. 1.2) to improve $bits_{M}+bits_{H}$ on the history so far, the changes are kept, otherwise they are discarded. (Note that even animal brains grow and prune neurons.)

Given history $H(t)$ , instead of re-training $M$ in a sleep phase (step 4 of algorithm 1) on all of $H(t)$ , we may re-train it on parts thereof, by selecting trials randomly or otherwise from $H(t)$ , and replay them to retrain $M$ in standard fashion (Sec. 1.2). To do this, however, all of $M$ ’s unit activations need to be stored at the beginning of each trial. ( $M$ ’s hidden unit activations, however, do not have to be stored if they are reset to zero at the beginning of each trial.)

3 𝑴𝑴\boldsymbol{M} may have a Built-In FNN Preprocessor

To facilitate $M$ ’s task in certain environments, each frame of the sensory input stream (video, etc.) can first be separately compressed through autoencoders or autoencoder hierarchies based on CNNs or other FNNs (see Sec. 1.2) used as sensory preprocessors to create less redundant sensory codes . The compressed codes are then fed into an RNN trained to predict not the raw inputs, but their compressed codes. Those predictions have to be decompressed again by the FNN, to evaluate the total compression performance, $bits_{M}+bits_{H}$ , of the FNN-RNN combination representing $M$ .

The Controller 𝑪𝑪\boldsymbol{C} Learning to Exploit RNN World Model 𝑴𝑴\boldsymbol{M}

Here we describe ways of using the world model, $M$ , of Sec. 4 to facilitate the task of the RL controller, $C$ . Especially the systems of Sec. 5.3 overcome drawbacks of early $CM$ systems mentioned in Sec. 1.3.1, 1.3.2. Some of the setups of the present Sec. 5 can be viewed as special cases of the general scheme in Sec. 2.2.

We start with details of an approach whose principles date back to the early 1990s (Sec. 1.3.2). Given an RNN or RNN-like $M$ as in Sec. 4, we implement $C$ as a traditional RL machine [245, Sec. 6.2] based on the Markov assumption (Sec. 1.3). While $M$ is processing the history of actions and observations to predict future inputs, the internal states of $M$ are used as inputs to a predictor of cumulative expected future reward.

$C$ is an RL machine with $2m+2n+h$ -dimensional inputs and $o$ -dimensional outputs. At time $t$ , $state(t)$ is fed into $C$ , which then computes action $out(t)$ . Then $M$ computes from $sense(t)$ , $hidden(t)$ and $out(t)$ the values $hidden(t+1)$ and $pred(t+1)$ . Then $out(t)$ is executed in the environment, to obtain the next input $sense(t+1)$ .

The parameters or weights of $C$ are trained to maximize reward by a standard RL method such as Q-learning or similar methods . Note that most of these methods evaluate not only input events but pairs of input and output (action) events.

In one of the simplest cases, $C$ is just a linear perceptron FNN (instead of an RNN like in the early system ). The fact that $C$ has no built-in memory in this case is not a fundamental restriction since $M$ is recurrent, and has been trained to predict not only normal sensory inputs, but also reward signals. That is, the state of $M$ must contain all the historic information relevant to maximize future expected reward, provided the data history so far already contains the relevant experience, and $M$ has learned to compactly extract and represent its regular aspects.

This approach is different from other, previous combinations of traditional RL [245, Sec. 6.2] and RNNs which use RNNs only as value function approximators that directly predict cumulative expected reward, instead of trying to predict all sensations time step by time step. The $CM$ system in the present section separates the hard task of prediction in partially observable environments from the comparatively simple task of RL under the Markovian assumption that the current input to $C$ (which is $M$ ’s state) contains all information relevant for achieving the goal.

2 𝑪𝑪\boldsymbol{C} as an Evolutionary RL (R)NN whose Inputs are 𝑴𝑴\boldsymbol{M}’s Activations

This approach is essentially the same as the one of Sec. 5.1, except that $C$ is now an FNN or RNN trained by evolutionary algorithms applied to NNs , or by policy gradient methods [245, Sec. 6.6], or by Compressed NN Search; see Sec. 1. $C$ has $2m+2n+h$ input units and $o$ output units. At time $t$ , $state(t)$ is fed into $C$ , which computes $out(t)$ ; then $M$ computes $hidden(t+1)$ and $pred(t+1)$ ; then $out(t)$ is executed to obtain $sense(t+1)$ .

3 𝑪𝑪\boldsymbol{C} Learns to Think with 𝑴𝑴\boldsymbol{M}: High-Level Plans and Abstractions

Our RNN-based $CM$ systems of the early 1990s (Sec. 1.3.1) could in principle plan ahead by performing numerous fast mental experiments on a predictive RNN world model, $M$ , instead of time-consuming real experiments, extending earlier work on reactive systems without memory . However, this can work well only in (near-)deterministic environments, and, even there, $M$ would have to simulate many entire alternative futures, time step by time step, to find an action sequence for $C$ that maximizes reward. This method seems very different from the much smarter hierarchical planning methods of humans, who apparently can learn to identify and exploit a few relevant problem-specific abstractions of possible future events; reasoning abstractly, and efficiently ignoring irrelevant spatio-temporal details.

We now describe a $CM$ system that can in principle learn to plan and reason like this as well, according to the AIT argument (Sec. 2.1). This should be viewed as a main contribution of the present paper. See Figure 1.

Consider an RNN $C$ (with typically rather small feasible search space) as in Sec. 5.2. We add standard and/or multiplicative learnable connections (Sec. 3.1) from some of the units of $C$ to some of the units of the typically huge unsupervised $M$ , and from some of the units of $M$ to some of the units of $C$ . The new connections are said to belong to $C$ . $C$ and $M$ now collectively form a new RNN called $CM$ , with standard activation spreading as in Sec. 3.1. The activations of $M$ are initialized to default values at the beginning of each trial. Now $CM$ is trained on RL tasks in line with step 3 of algorithm 1, using search methods such as those of Sec. 5.2 (compare Sec. 1). The (typically many) connections of $M$ , however, do not change—only the (typically relatively few) connections of $C$ do.

What does that mean? It means that now $C$ ’s relatively small candidate programs are given time to “think” by feeding sequences of activations into $M$ , and reading activations out of $M$ , before and while interacting with the environment. Since $C$ and $M$ are general computers, $C$ ’s programs may query, edit or invoke subprograms of $M$ in arbitrary, computable ways through the new connections. Given some RL problem, according to the AIT argument (Sec. 2.1), this can greatly accelerate $C$ ’s search for a problem-solving weight vector $\hat{w}$ , provided the (time-bounded ) mutual algorithmic information between $\hat{w}$ and $M$ ’s program is high, as is to be expected in many cases since $M$ ’s environment-modeling program should reflect many regularities useful not only for prediction and coding, but also for decision making. An alternative way of letting $C$ learn to access the program of $M$ is to add $C$ -owned connections from the weights of $M$ to units of $C$ , treating the current weights of $M$ as additional real-valued inputs to $C$ . This, however, will typically result in a much larger search space for $C$ . There are many other variants of the general scheme described in Sec. 2.2.

This simple but novel approach is much more general than previous computable, but restricted, ways of letting a feedforward $C$ use a model $M$ (Sec. 1.3.1)[245, Sec. 6.1], by simulating entire possible futures step by step, then propagating error signals or temporal difference errors backwards (see Section 1.3.1). Instead, we give $C$ ’s program search an opportunity to discover sophisticated computable ways of exploiting $M$ ’s code, such as abstract hierarchical planning and analogy-based reasoning. For example, to represent previous observations, an $M$ implemented as an LSTM network (Sec. 1.2) will develop high-level, abstract, spatio-temporal feature detectors that may be active for thousands of time steps, as long as those memories are useful to predict (and thus compress) future observations . However, $C$ may learn to directly invoke the corresponding “abstract” units in $M$ by inserting appropriate pattern sequences into $M$ . $C$ might then short-cut from there to typical subsequent abstract representations, ignoring the long input sequences normally required to invoke them in $M$ , thus quickly anticipating a few possible positive outcomes to be pursued (plus computable ways of achieving them), or negative outcomes to be avoided.

Note that $M$ (and by extension $M$ ) does not at all have to be a perfect predictor. For example, it won’t be able to predict noise. Instead $M$ will have learned to approximate conditional expectations of future inputs, given the history so far. A naive way of exploiting $M$ ’s probabilistic knowledge would be to plan ahead through naive step-by-step Monte-Carlo simulations of possible $M$ -predicted futures, to find and execute action sequences that maximize expected reward predicted by those simulations. However, we won’t limit the system to this naive approach. Instead it will be the task of $C$ to learn to address useful problem-specific parts of the current $M$ , and reuse them for problem solving. Sure, $C$ will have to intelligently exploit $M$ , which will cost bits of information (and thus search time for appropriate weight changes of $C$ ), but this is often still much cheaper in the AIT sense than learning a good $C$ program from scratch, as in our previous non-RNN AIT-based work on algorithmic transfer learning , where self-invented recursive code for previous solutions sped up the search for code for more complex tasks by a factor of 1000.

Numerous topologies are possible for the adaptive connections from $C$ to $M$ , and back. Although in some applications $C$ may find it hard to exploit $M$ , and might prefer to ignore $M$ (by setting connections to and from $M$ to zero), in some environments under certain $CM$ topologies, $C$ can greatly profit from $M$ .

While $M$ ’s weights are frozen in step 3 of algorithm 1, the weights of $C$ can learn when to make $C$ attend to history information represented by $M$ ’s state, and when to ignore such information, and instead use $M$ ’s innards in other computable ways. This can be further facilitated by introducing a special unit, $\hat{u}$ , to $C$ , where $\hat{u}(t)all(t)$ instead of $all(t)$ is fed into $M$ at time $t$ , such that $C$ can easily (by setting $\hat{u}(t)=0$ ) force $M$ to completely ignore environmental inputs, to use $M$ for “thinking” in other ways.

Should $M$ later grow (or shrink) in step 4 of algorithm 1, in line with Sec. 4.2, $C$ may in turn grow additional connections to and from $M$ (or lose some) in the next incarnation of step 3.

4 Incremental / Hierarchical / Multitask Learning of 𝑪𝑪\boldsymbol{C} with 𝑴𝑴\boldsymbol{M}

A variant of the approach in Sec. 5.3 incrementally trains $C$ on a never-ending series of tasks, continually building on solutions to previous problems, instead of learning each new problem from scratch. In principle, this can be done through incremental NN evolution , hierarchical NN evolution , hierarchical Policy Gradient algorithms , or asymptotically optimal ways of algorithmic transfer learning .

Given a new task and a $C$ trained on several previous tasks, such hierarchical/incremental methods may freeze the current weights of $C$ , then enlarge $C$ by adding new units and connections which are trained on the new task. This process reduces the size of the search space for the new task, giving the new weights the opportunity to learn to use the frozen parts of $C$ as subprograms.

Incremental variants of Compressed RNN Search (Sec. 1) do not directly search in $C$ ’s potentially large weight space, but in the frequency domain by representing the weight matrix as a small set of Fourier-type coefficients. By searching for new coefficients to be added to already learned set responsible for solving previous problems, $C$ ’s weight matrix is fine tuned incrementally and indirectly (through superpositions). Given a current problem, in AIT-based OOPS style , we may impose growing run time limits on programs tested on $C$ , until a solution is found.

Exploration: Rewarding 𝑪𝑪\boldsymbol{C} for Experiments that Improve 𝑴𝑴\boldsymbol{M}

Humans, even as infants, invent their own tasks in a curious and creative fashion, continually increasing their problem solving repertoire even without an external reward or teacher. They seem to get intrinsic reward for creating experiments leading to observations that obey a previously unknown law that allows for better compression of the observations—corresponding to the discovery of a temporarily interesting, subjectively novel regularity (compare also ).

For example, a video of 100 falling apples can be greatly compressed via predictive coding once the law of gravity is discovered. Likewise, the video-like image sequence perceived while moving through an office can be greatly compressed by constructing an internal 3D model of the office space . The 3D model allows for re-computing the entire high-resolution video from a compact sequence of very low-dimensional eye coordinates and eye directions. The model itself can be specified by far fewer bits of information than needed to store the raw pixel data of a long video. Even if the 3D model is not precise, only relatively few extra bits will be required to encode the observed deviations from the predictions of the model.

Even mirror neurons are easily explained as by-products of history compression as in Sec. 3 and 4. They fire both when an animal acts and when the animal observes the same action performed by another. Due to mutual algorithmic information shared by perceptions of similar actions performed by various animals, efficient RNN-based predictive coding (Sec. 3, 4) profits from using the same feature detectors (neurons) to encode the shared information, thus saving storage space.

Given the $C$ - $M$ combinations of Sec. 5, we motivate $C$ to become an efficient explorer and an artificial scientist, by adding to its standard external reward (or fitness) for solving user-given tasks another intrinsic reward for generating novel action sequences ( $=$ experiments) that allow $M$ to improve its compression performance on the resulting data .

At first glance, repeatedly evaluating $M$ ’s compression performance on the entire history seems impractical. A heuristic to overcome this is to focus on $M$ ’s improvements on the most recent trial, while regularly re-training $M$ on randomly selected previous trials, to avoid catastrophic forgetting.

A related problem is that $C$ ’s incremental program search may find it difficult to identify (and assign credit to) those parts of $C$ responsible for improvements of a huge, black box-like, monolithic $M$ . But we can implement $M$ as a self-modularizing, computation cost-minimizing, winner-take-all RNN . Then it is possible to keep track of which parts of $M$ are used to encode which parts of the history. That is, to evaluate weight changes of $M$ , only the affected parts of the stored history have to be re-tested . Then $C$ ’s search can be facilitated by tracking which parts of $C$ affected those parts of $M$ . By penalizing $C$ ’s programs for the time consumed by such tests, the search for $C$ is biased to prefer programs that conduct experiments causing data yielding quickly verifiable compression progress of $M$ . That is, the program search will prefer to change weights of $M$ that are not used to compress large parts of the history that are expensive to verify . The first implementations of this simple principle were described in our work on the PowerPlay framework , which incrementally searches the space of possible pairs of new tasks and modifications of the current program, until it finds a more powerful program that, unlike the unmodified program, solves all previously learned tasks plus the new one, or simplifies/compresses/speeds up previous solutions, without forgetting any. Under certain conditions this can accelerate the acquisition of external reward specified by user-defined tasks.

Conclusion

We introduced novel combinations of a reinforcement learning (RL) controller, $C$ , and an RNN-based predictive world model, $M$ . The most general $CM$ systems implement principles of algorithmic as opposed to traditional information theory. Here both $M$ and $C$ are RNNs or RNN-like systems. $M$ is actively exploited in arbitrary computable ways by $C$ , whose program search space is typically much smaller, and which may learn to selectively probe and reuse $M$ ’s internal programs to plan and reason. The basic principles are not limited to RL, but apply to all kinds of active algorithmic transfer learning from one RNN to another. By combining gradient-based RNNs and RL RNNs, we create a qualitatively new type of self-improving, general purpose, connectionist control architecture. This RNNAI may continually build upon previously acquired problem solving procedures, some of them self-invented in a way that resembles a scientist’s search for novel data with unknown regularities, preferring still-unsolved but quickly learnable tasks over others.