Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing

David L. Donoho, Jared Tanner

Introduction

Recent work has exposed a phenomenon of abrupt phase transitions in high-dimensional geometry. The phase transitions amount to rapid shifts in the likelihood of a property’s occurrence when a dimension parameter crosses a critical level (a threshold). We start with a concrete example, and then identify surprising parallels in data analysis and signal processing.

Suppose we have a sample X1X_{1}, …, XnX_{n} of independent standard normal random variables in dimension dd, forming a point cloud of nn points in Rd\bf R^{d}. Our intuition suggests that a few of the points will lie on the ‘surface’ of the dataset, that is, the boundary of the convex hull; the rest will lie ‘inside’, i.e. interior to the hull. However, if dd is a fixed fraction of nn and both are large, our intuition is completely violated. Instead, all of the points are on the boundary of the convex hull – none is interior. Moreover, the line segment connecting the typical pair of points does not intersect the interior; in complete defiance of expectation, it stays on the boundary. Even more, for kk in some appreciable range, the typical kk-tuple spans a convex hull which does not intersect the interior! For humans stuck all their lives in three-dimensional space, such a situation is hard to visualize.

The phenomenon of phase transition appears as follows: such seemingly strange behavior continues for quite large kk, up to a predictable threshold given by a formula k=dρ(d/n;T)k^{*}=d\cdot\rho(d/n;T), where ρ()\rho() is defined in §2 below. Below this threshold (i.e. kk a bit smaller than kk^{*}), the strange behavior is observed; but suddenly, above this threshold (i.e. for kk a bit larger) our normal low-dimension intuition works again – convex hulls of kk-tuples of points indeed intersect the interior.

This curious phenomenon in high-dimensional geometric probability is one of a small number of fundamental such phase transitions. We claim they have consequences in several applied fields:

in selecting models for statistical data analysis of large datasets,

in coping with outlying measurements in designed experiments,

in determining how many samples we need to take in designing imaging devices.

The consequences can be both profound and important. They range from negative-philosophical – if your database has too many ’junk’ variables in it nothing can be learned from it – to positive-practical – it isn’t really necessary to sit cooped up for an hour in a medical MRI scanner: with the right software, the necessary data could be collected in a fraction of the time commonly used today.

Our paper will help the reader understand more precisely what these phase transitions are and where they may occur in science and technology; it will then discuss our

Main contribution. We have observed a universality of threshold locations across a range of underlying probability distributions. We are able to change the underlying distribution from Gaussian to any one of a variety of non-Gaussian choices, and we still observe phase transitions at the same locations.

We compiled evidence based on millions of random trials and observed the same phase transitions even for several highly non-iid ensembles. We here formally state and test the universality hypothesis.

Our research leads to an intriguing challenge for high-dimensional geometric probability:

Open problem. Characterize the universality class containing the standard Gaussian: i.e. the class of matrix ensembles leading to phase transitions matching those for Gaussian polytopes.

Evidently this class is fairly broad. In view of the significance of these phase transitions in applications, this is quite an attractive challenge. We begin by illustrating three surprising appearances of these phase transitions.

2. First surprise: model selection with large databases

A characteristic feature of today’s data deluge is the tendency in each field to collect ever more and more measurements on each observed entity, whether it be a pixel of sky, a sample of blood or a sick patient. Technology continually puts in our hands high-throughput measurement equipment making ever more varied and ever more detailed numerical measurements on the spectrum of light, the protein expression in whole blood or fluctuations in neural or muscular activity.

As a result, observed entities are represented by ever higher-dimensional feature vectors. In fact the transition between the 20th and 21st centuries marked a sudden increase in the dimensionality of typical datasets that scientists studied, so that it became unremarkable for each observational unit to be represented as a data point in a pp-dimensional space with pp very large – in the hundreds, thousand or millions.

The modern trend to high-throughput measurement devices often does not address the fundamental difficulty of obtaining good observational units. Scientists face the same troubles they always have faced when searching for subjects affected by a rare disease, or observing rare events in distant galaxies. Hence, in many fields the number of observational units stays small, perhaps in the hundreds (or even dozens), but each of those few units can now be routinely subjected to unprecedented density of numerical description.

Orthodox statistical methods assumed a quite different set of conditions: an abundance of observational units and a very limited set of measured characteristics on each unit. Modern statistical research is intensively developing new tools and theory to address the new unorthodox setting; such research comprised much of the activity in the recent 6-month Newton Institute programme Statistical Theory and Methods for Complex, High-Dimensional Data.

Consider a linear modelling scenario going back to Legendre, Gauss and perhaps even before. We have available a response variable YY which we intend to model as a linear function of up to pp numerical predictor variables X1X_{1}, …, XpX_{p}. We contemplate an utterly standard multvariate linear model, Y=α+jβjXj+ZY=\alpha+\sum_{j}\beta_{j}X_{j}+Z where the βj\beta_{j} are regression coefficients and ZZ is a standard normal measurement error. In words, the expected value of YY given {Xj}\{X_{j}\} is a linear combination with coefficients βj\beta_{j}.

Suppose we have a collection of measurements (Yi,Xi,j,j=1,,p)(Y_{i},X_{i,j},j=1,\dots,p), one for each observational unit ii. We will use these data to estimate the βj\beta_{j}’s, allowing future predictions of YY given the XjX_{j}’s.

In the ‘21st Century Setting’ described above, we have more predictor variables than observations, meaning p>np>n. While Legendre and Gauss may have understood the p<np<n case, they would have been very troubled by the p>np>n case: there are more unknowns than equations, and there is noise to boot!

A key feature of high-throughput analysis is that batches of potential predictors are automatically measured but one does not know in advance which, if any, may be useful in a particular project. Researchers in applied sciences where high throughput studies are popular (e.g. genomics, proteomics, metabolonomics) believe that some small fraction of the measured features are useful, among many useless ones. Unfortunately, high-throughput techniques give us everything, useful and useless, all mixed together in one batch.

In this setting, a reasonable response is forward stepwise linear regression. We proceed in stages, starting with the simple model Y=α+ZY=\alpha+Z (i.e. no dependence on XX’s) and at each stage expand the model by adding the single variable XjX_{j} offering the strongest improvement in prediction.

Donoho & Stodden (2006) conducted a simulation experiment using forward stepwise regression with False-Discovery-Rate control stopping rule. Their experiment chose p=200p=200 and explored a range of n<pn<p cases. Letting kk denote the number of useful predictors among the pp potential predictors, they set up true underlying regression models reflecting the choice (k,n,p)(k,n,p), ran the stepwise regression routine, and recorded the mean squared prediction error of the resulting estimate. Figure 1 displays results: the coloured attribute gives the relative mean-squared error of the estimate; the axes present the ratio δ=n/p\delta=n/p of observations to variables, with 0<δ<10<\delta<1 in this brave new world, and ρ=k/n\rho=k/n of useful variables to observational units. Evidently there is an abrupt change in performance: one can suddenly ‘fall off a cliff’ by slightly increasing the number of useless variables per useful variable. The surprise is that the ‘cliff’ is roughly at the same position as the overlaid curve. That curve, denoted by ρ(δ;C)\rho(\delta;C) and defined fully below, derives from combinatorial geometryThe auxilary parameter CC in ρ(δ;C)\rho(\delta;C) is used to indicate the connection of this curve with the standard cross-polytope CC, defined in (2.4), from which it is derived. notions similar to those in §1.1.

A standard scientific data analysis approach in the ‘21st century setting’ ‘falls off a cliff’, failing abruptly when the model becomes too complex.

The location of this failure (ratio of model variables to observations) matches a curve derived from the field of geometric combinatorics!

3. Surprise 2: robustness in designed experiments

We now consider a problem in robust statistics. Suppose that a response variable YY is thought to depend on pp independent variables X1X_{1}, …, XpX_{p}. Unlike the data-drenched high-throughput observational studies of §1.2 we are in a classical designed experiment, with p<np<n. The dependence is linear, so we again have Y=jβjXj+Z+WY=\sum_{j}\beta_{j}X_{j}+Z+W. The error ZZ is again normal, but WW is a ‘wild’ variable containing occasional very large outliers.

is purported to be ‘robust’, particularly in designed experiments where wild XX’s do not occur. Let us focus on a specific designed experiment, where XX is an nn by pp partial Hadamard matrix, i.e. pp columns chosen at random from an n×nn\times n Hadamard matrix. This design chooses Xi,jX_{i,j} either +1+1 or 1-1 in a very specific way; there are no wild XX’s. We let the outlier generators WW have most entries , but, in our study, a small fraction ϵ=k/n\epsilon=k/n – at randomly-chosen sites – will have very large values; in any one realization they can either be all large positive or all large negative.

The simulation results, properly calibrated, closely match seemingly unrelated phenomena in sparse linear modelling in the p>np>n case.

This critical fraction matches a known phase transition in geometric combinatorics.

4. Surprise 3: compressed sensing

We now leave the field of data analysis for the field of signal processing.

Since the days of Shannon, Nyquist, Whittaker and Kotelnikov, the ‘sampling theorem’ has helped engineers decide how much data need to be acquired in design of measurement equipment. Consider, therefore, the following imaging problem. We wish to acquire a signal x0x_{0} having NN entries. Now suppose that only kNk\ll N of those pixels are actually nonzero – we do not know which ones are nonzero, or even that this is true. There are NN degrees of freedom here, since any of the NN pixel values vary.

Figure 3 shows results of computational experiments conducted by the authors. In those experiments we chose N=200N=200 and varying levels of kk and nn. The horizontal axis δ=n/N\delta=n/N measures the undersampling ratio – how many fewer measurements we are making than the customary NN. The vertical axis ρ=k/n\rho=k/n measures to what extent the effective number of degrees of freedom kk is smaller than the number of measurements. The contours indicate the success rate. The superimposed curve ρ(δ;C)\rho(\delta;C) roughly coincides with the empirical 50% success rate curve.

Interpretation: We can violate the usual ‘sampling theorem’ (nNn\geq N) with impunity! The true limit is nk/ρ(n/N;C)n\gtrsim k/\rho(n/N;C), where ρ(δ;C)\rho(\delta;C) is the curve decorating the displayThe symbol \gtrsim denotes an asymptotic relatonship; for precise conditions see Donoho & Tanner (2009aa).. We have seen this curve twice before already; it arises in a superficially unrelated problem in high-dimensional geometric combinatorics.

5. The connection to high-dimensional geometric combinatorics

Recall the problem in geometric probability we discussed in §1.1. Draw a sequence of nn samples X1,,XnX_{1},\dots,X_{n} from a standard dd-dimensional normal distribution. Let P=Conv(X1,Xn)P=Conv(X_{1},\dots X_{n}) denote the convex hull of these nn points; this is a random convex polytope.

Suppose that dd and nn are both large, and let δ=d/n\delta=d/n. Figure 4 presents a black curve to be called ρ(δ;T)\rho(\delta;T) and formally definedThe auxilary parameter TT in ρ(δ;T)\rho(\delta;T) associates this curve with TN1T^{N-1}, the standard simplex (2.3); see §2. in §2. It has the following interpretation.

Let ρ=k/d\rho=k/d. Suppose that ρ<ρ(d/n;T)\rho<\rho(d/n;T) and that nn and dd are both large, with d<nd<n. For the typical k+1k+1 tuple (Xi1,Xik+1)(X_{i_{1}},\dots X_{i_{k+1}}),

every XijX_{i_{j}} is a vertex of PP, 1jk+11\leq j\leq k+1;

every line segment [Xij,Xij][X_{i_{j}},X_{i_{j^{\prime}}}] is an edge of PP, 1j,jk+11\leq j,j^{\prime}\leq k+1;

the convex polytope Conv(Xi1,Xik+1)Conv(X_{i_{1}},\dots X_{i_{k+1}}) is a kk-dimensional face of PP.

Figure 4 also presents a second, lower, black curve, which is actually the one we have seen in our three surprises. This curve, denoted by ρ(δ;C)\rho(\delta;C) and defined in §2, has the following interpretation. Draw the same nn samples X1,,XnX_{1},\dots,X_{n} from a standard dd-dimensional normal distribution. Now let Q=Conv(X1,,Xn,X1,,Xn)Q=Conv(X_{1},\dots,X_{n},-X_{1},\dots,-X_{n}) denote the convex hull of the 2n2n points including the original nn points and their reflections through the origin. This is a random centrosymmetric convex polytope.

Let ρ=k/d\rho=k/d and ϵ>0\epsilon>0. Suppose that ρ<ρ(d/n;C)(1ϵ)\rho<\rho(d/n;C)(1-\epsilon) and that nn and dd are both large. For the typical k+1k+1 tuple (Xi1,Xik+1)(X_{i_{1}},\dots X_{i_{k+1}}),

every ±Xij\pm X_{i_{j}} is a vertex of QQ, 1jk+11\leq j\leq k+1;

every line segment [±Xij,±Xij][\pm X_{i_{j}},\pm X_{i_{j^{\prime}}}] is an edge of QQ, 1j,jk+11\leq j,j^{\prime}\leq k+1;

the typical convex polytope Conv(±Xi1,±Xik+1)Conv(\pm X_{i_{1}},\dots\pm X_{i_{k+1}}) is a kk-dimensional face of QQ.

In short, the curve arising each time in Surprises 1-3 involves convex hulls of symmetrized Gaussian point clouds. The curve involving ρ(δ;T)\rho(\delta;T) would arise if we had instead positivity constraints on the objects to be recovered (Surprises 1 and 3) or on the outliers (Surprise 2).

6. This paper

The curve ρ(δ;C)\rho(\delta;C) describes properties of high-dimensional polytopes deriving from the Gaussian distribution. What now seems surprising about Surprises 1-3 is the lack of formal connection to those polytopes in the applications. For example

In Surprise 1, stepwise regression as practised by statisticians seems unrelated to convex polytopes.

In Surprise 2, neither a Hadamard design nor outliers have any formal connection to any Gaussian distribution.

In Surprise 3, observing random frequencies of the Fourier transform of a two-dimensional signal has no visible connection to any Gaussian distribution.

The curves ρ(δ;C)\rho(\delta;C) and ρ(δ;T)\rho(\delta;T) accurately describe thresholds in many situations where the Gaussian distribution is not present; in fact we have witnessed it in cases where the points of a point cloud were chosen deterministically. We believe this signals a new kind of limit theorem in probability theory that, when formalized, will make precise a new kind of universality phenomenon in high-dimensional geometry.

Geometric combinatorics and phase transitions

Let PP be a convex polytope in RN\bf R^{N}, i.e. the convex hull of points p1,,pmp_{1},\dots,p_{m}. Let AA be an n×Nn\times N matrix. The image Q=APQ=AP lives in Rn\bf R^{n}; it is a convex set, in fact a polytope, the convex hull of points Ap1,,ApmAp_{1},\dots,Ap_{m}. QQ is the result of ‘projecting’ PP from RN\bf R^{N} down to Rn\bf R^{n} and will be called the projected polytope.

The polytopes PP and QQ are have vertices, edges, 22-dimensional faces, … . Let fk(P)f_{k}(P) and fk(Q)f_{k}(Q) denote the the number of such kk-dimensional faces; thus f0(P)f_{0}(P) is the number of vertices of PP and fN(P)f_{N}(P) the number of facets, while f0(Q)f_{0}(Q) is the number of vertices and fn(Q)f_{n}(Q) the number of facets. Projection can only reduce the number of faces, so

Three very special families of polytopes PP are available in every dimension N>2N>2, the so-called regular polytopes. Here we consider two of the three:

the simplex (an (N1)(N-1)-dimensional analogue of the equilateral triangle)

cross-polytope (an NN-dimensional analogue of the octahedron)

Statements concerning the hypercube similar to those made here for the simplex and cross-polytope are available to an interested reader in Donoho & Tanner (2008aa).

2. Connection to underdetermined systems of equations

The regular polytopes are simple and beautiful objects, but are not commonly thought to be useful objects. However, their face counts reveal solution properties of underdetermined systems of equations. Such underdetermined systems arise frequently in modern applications and the existence of unique solutions to such systems is responsible for the three surprises given in the introduction. Consider the case of the simplex.

Consider the underdetermined system of equations y0=Axy_{0}=Ax, where AA is n×Nn\times N, n<Nn<N, and the optimization problem (LP):

Of course, ordinarily, the system has an infinite number of solutions as does the problem (LP) .

Let AA be a fixed matrix with nn columns in general position in RNR^{N}. Consider vectors y0y_{0} with a sparse solution y0=Ax0y_{0}=Ax_{0} where x00x_{0}\geq 0 has kk nonzeros. The fraction of systems (y0,A)(y_{0},A) where (LP) has that underlying x0x_{0} as its only solution is

In short, the ratio of face counts between the projected simplex and the unprojected simplex tells us the probability that the (LP) correctly reconstructs a kk-sparse x0x_{0}.

Consider the case of the cross-polytope. Apply the optimization problem (P1) to the problem instance (y0,A)(y_{0},A) generated by

an underdetermined system of equations y0=Ax0y_{0}=Ax_{0}, where AA is n×Nn\times N, n<Nn<N . Of course, ordinarily, both the linear system and the problem (P1) have an infinite number of solutions.

Let AA be a fixed matrix with nn columns in general position in RNR^{N}. Consider vectors y0y_{0} with a sparse solution y0=Ax0y_{0}=Ax_{0} where x0x_{0} has kk nonzeros. The fraction of systems (y0,A)(y_{0},A) where (P1) has that underlying x0x_{0} as its unique solution is

In short, the ratio of face counts between the projected cross-polytope and the unprojected cross-polytope tells us the probability that (P1) can successfully recover a true underlying kk-sparse object.

In short, it is essential to know whether or not

3. Asymptotics of face counts with Gaussian matrices A𝐴A

We now consider the case where the nn by NN matrix AA has iid Gaussian random entries. Then the mapping PQP\mapsto Q is a random projection. In this case, rather amazingly, tools from polytope theory and probability theory can be combined to study the expected face counts in high dimensions. The results demonstrate rigorously the existence of sharp thresholds in face count ratios.

Let the n×Nn\times N random matrix AA have iid N(0,1)N(0,1) Gaussian elements. Consider sequences of triples (N,n,k)(N,n,k) where n=δNn=\delta N , k=ρnk=\rho n, and NN\rightarrow\infty. There are functions ρ(δ;Q)\rho(\delta;Q) for Q{T,C}Q\in\{T,C\} demarcating phase transitions in face counts:

Figure 4 displays the two curves referred to in this theorem. The simplex’s transition is higher than the cross-polytope’s: ρ(δ,T)>ρ(δ,C)\rho(\delta,T)>\rho(\delta,C) for δ(0,1)\delta\in(0,1).

Empirical results for non-Gaussian ensembles

Over the last few years we ran computer experiments generating millions of underdetermined systems of equations of various kinds, using standard optimization tools to select specific solutions, and checking whether or not the solution was unique and/or sparse. In overwhelmingly many cases, Gaussian polytope theory accurately matches the experimental results, even when the matrices involved are not Gaussian. We here summarize results about experiments with the non-Gaussian ensembles listed in table 1. Further detail is provided in the Electronic Materials Supplement (Donoho & Tanner 2009bb).

We varied the matrix shape δ=n/N\delta=n/N, and the solution sparsity levels ρ=k/n\rho=k/n. At problem size N=1600N=1600 we varied nn systematically through a grid ranging from n=160n=160 up to n=1440n=1440 in 9 equal steps. At each combination N,n,kN,n,k we considered M=200M=200 different problem instances x0x_{0} and AA, each one drawn randomly as above. We both generated nonnegative sparse vectors and solved (LP), and generated signed sparse vectors and solved (P1). The ‘signal processing language’ event ‘exact reconstruction’ corresponds to the ‘polytope language’ event ‘specific kk-face of QQ is also a kk-face of AQAQ’. In both cases we speak of success, and we call the frequency of success in MM empirical trials at a given (k,n,N)(k,n,N) the success rate. At each combination N,nN,n, we varied kk systematically to sample the success rate transition region from 5% to 95%. Figure 4 presents summary results, showing the level curves for 50% success rate, for each of the 9 ensembles above. The appropriate theoretical curves ρ(δ;Q)\rho(\delta;Q) are overlaid. The uppermost nine curves give the case of nonnegative solutions, Q=TQ=T, where we solve (LP); and the nine lower curves present the data for Q=CQ=C, where we solved (P1). (Note: the Hadamard case is exceptional and uses N=512N=512.)

At first glance, figure 4 shows excellent agreement between the actual empirical results in each matrix ensemble Visual evidence, similar to figure 4, of qualitative agreement was presented at conferences in 2006-2009 by Donoho and Tanner for all but the Expander ensemble. Inclusion of the Expander ensemble in the results presented here was motivated by evidence in Bernide et al. (2008) for an Expander ensemble (with a different choice of P(0)P(0)) which also showing qualitative agreement with the asymptotic phase transition ρ(δ;C)\rho(\delta;C). and the asymptotic theory for the Gaussian. This is not very surprising for AA from the Gaussian ensemble; it merely proves that the large-NN polytope theory works accurately already at moderate N=1600N=1600. For the other ensembles there is not, to our knowledge, any existing theory suggesting what we see so clearly here: phase transition behaviour in non-Gaussian ensembles that accurately matches the Gaussian case (compare §4).

2. Universality hypothesis

Hypothesis. Universality of Phase Transitions. Suppose that the nn by NN matrix AA is sampled randomly from a “well-behaved” probability distribution. Suppose that the NN by 11 vector x0x_{0} is sampled randomly from the set of kk-sparse vectors, either with or without positivity constraints on the nonzeros of x0x_{0}. The observed behaviour of solutions to (LP)(LP) and (P1)(P_{1}) will exhibit, as a function of (N,n,k)(N,n,k), success probabilities matching those which are proven to hold when sampling from the Gaussian distribution with large NN.

This hypothesis really contains two assertions: (a) that many matrix ensembles behave like the Gaussian; and (b) that moderate-sized NN exhibit behaviour in line with the NN\rightarrow\infty asymptotic.

The hypothesis also contains an element of vagueness, since we do not know at the time of writing how to delineate the ensembles of random matrices over which Gaussian-like behaviour will hold. Of course universality results are well known in probability theory; the Central Limit Theorem is the most well-known universality result for the distribution of sums of independent random variables. The precise universality class of the Gaussian distribution for such sums was only discovered two centuries after the phenomenon itself was identified. Apparently we are here at the stage of just identifying a comparable phenomenon. We hope it does not take two centuries to identify the corresponding universality class!

Clarification 1. In fact, our hypothesis could also be called a rigidity of the phase transition – it is invariantly located at the same place in the phase diagram across a range of matrix ensembles. In statistical physics, universality of a phase transition means something different, and much weaker – not a rigidity, but instead a flexibility of the location of a phase transition while preserving an underlying structural similarity. Our hypothesis is far stronger.

Clarification 2. In fact, there are trivial counterexamples to the hypothesis; for example the matrix AA of all ones does not generate any useful phase transition behaviour.

3. Experimental procedure

We conducted a Monte Carlo experiment to test the Universality Hypothesis.

The general procedure was like our earlier exploratory studies. We call a suite a distribution of problem instances (y,A)(y,A) fully specified by two factors: (1) the ensemble of matrices AA and (2) the ensemble of coefficients x0x_{0} generating y=Ax0y=Ax_{0}. Matrix ensembles include Bernoulli, Ternary, … Coefficient ensembles studied here have vectors of NN coefficients with only kk nonzeros, in sites chosen at random. The positive sign coefficient ensemble indicated by ++ has all nonzeros drawn uniformly from $.Thesignedcoefficientensembleindicatedby. The signed coefficient ensemble indicated by\pmhasnonzerosdrawnuniformlyfromhas nonzeros drawn uniformly from.Foreachsuitewevisitedacollectionoftriples. For each suite we visited a collection of triples(N,n,k)$. At each triple we drew a sequence of random problem instances of the given size and shape from the given problem suite. We then ran optimization software to compute the solution of the random problem. We computed observables from the obtained solution, in particular the binary observable ExactRecon, which takes the value 1 when the obtained solution is equal to the true solution within 6 digits accuracy, and zero otherwise.

We aimed to be confirmatory rather than exploratory: to use formal inferential tools, and carefully explain apparent departures from our hypothesis. Our experiments differed from earlier efforts in scope and attention to detail.

Scale. We performed 2,948,000 separate optimizations spanning 16984 different situations. Our computations required the use of as many as 200 CPUs in an available cluster and overall required 6.8 CPU years. We considered 16 problem suites based on 8 different matrix ensembles; see table 1. The scope of previous exploratory studies, which can be measured in CPU-days, is tiny by comparison.

Calibration. The vast majority of our experimental computations relied on Mosek, a commercial package. We made runs comparing the results with CVX, a popular open-source optimization package. We believe our results are consistent across optimizers.

4. Inferential formulation

Rather than go on a fishing expedition, from the outset we chose to frame our evaluation of the evidence using standard inferential procedures.

Two-sample comparisons. The strict form of the Universality Hypothesis says that the probability of unique solution under the Gaussian Ensemble is the same as the probability of unique solution at each other ensemble in the universality class. It follows that we may compare two sets of results at the same problem size, one with the Gaussian ensemble and one where everything else in the problem is the same except that a specific non-Gaussian ensemble is used. If at each ensemble we generate MM problem instances and obtain MM realizations of the observable ExactRecon, strict universality requires that the number of successes in each ensemble have a binomial probability distribution with the same success probability in both ensembles. Hence, the hypothesis really amounts to the assertion that two binomial distributions are the same. We proceed with traditional tests for equality of two binomial distributions. We chose to work with the ZZ-score:

Here p^i\hat{p}_{i} denotes “the fraction of cases where Exact Recon = 1 in ensemble ii”, and SD^(p^0p^1;M0,M1)\hat{SD}(\hat{p}_{0}-\hat{p}_{1};M_{0},M_{1}) is the appropriate standard error for comparing proportions with possibly unequal sample sizes MiM_{i}. In this comparison p^0\hat{p}_{0} describes the Gaussian baseline experiment, and p^1\hat{p}_{1} describes the non-Gaussian alternative experiment. Under the Universality Hypothesis, ZZ has an approximate standard normal distribution.

Reducing our problem merely to consideration of ZZ-scores we can formalize our hypothesis:

Strong Null Hypothesis: The ZZ scores have an approximate N(0,1)N(0,1) distribution at each value of (k,n,N)(k,n,N).

Study of asymptotics with problem size. The Strong Null Hypothesis seems implausible a priori on the strength of experience from other settings.

Consider another setting where the Gaussian distribution is universal: the central limit theorem. There, although the Gaussian distribution provides the correct limiting behaviour, there are well-understood departures from Gaussian behaviour at small problem sizes. Such departures of course decay with increasing problem size. The theory of Edgeworth expansions shows that such deviations from Gaussianity decay with problem size according to a specific power of size. Hence, for a symmetric distribution, we will see deviations of order 1/(\mboxproblemsize)1/(\mbox{problem size}) and, for an asymmetric one, deviations of order 1/\mboxproblemsize1/\sqrt{\mbox{problem size}} occur.

Analogously, in this setting we may see systematic behaviour of the ZZ-scores varying with problem size NN and perhaps also with kk. We used three problem sizes - N=200,400N=200,400 and 16001600 - so we might identify trends in the ZZ-scores with problem size.

Weak Null Hypothesis: The ZZ-scores exhibit discrepancies from the standard N(0,1)N(0,1) distribution (e.g. in means, variances, tail probabilities) which decay to zero with increasing NN.

5. Results

Results of our experiment were already summarized in figure 4. For each suite in table 1, for each value of δ=n/N\delta=n/N, we measured the value of ρ=k/n\rho=k/n at which the empirical probability of success crossed 50%. Each of the 18 different curves in figure 4 presents results for one suite; each one depicts the 50% success rate curves as a function of δ\delta. These “empirical ρ\rho curves” exhibit very strong visual agreement with the corresponding theoretical curves ρ(δ;T)\rho(\delta;T) and ρ(δ;C)\rho(\delta;C).

Formal statistical tests are much more sensitive and objective than visual impressions. Figure 5 displays the ZZ-scores (3.5) for two-sample comparisons between the Gaussian ensemble with nonnegative coefficients and the odd numbered suites; similarly, figure 6 presents two-sample comparisons between the Gaussian ensemble and the odd numbered suites. The eight panels in each figure depict differences between each of the non-Gaussian ensembles and the Gaussian ensemble. The problem shape δ=n/N\delta=n/N runs along the horizontal axis; these plots display results for N=200,400,1600N=200,400,1600 all combined.

The vast majority of the ZZ-scores in these displays fall in the range 2,2-2,2.

Finding 1: The ZZ-scores in bulk are consistent with our hypothesis of no difference between the distributions.

In effect, our experiment conducted 16,984 hypothesis tests and found relatively few ‘significant differences’ at the individual test level.

5.2. Rejection of strict universality

There are marked ‘tilts’ in the display of ZZ-scores in figures 5 and 6; linear trends with δ\delta are visually evident. Consider the general mean-shift model

where ZN(0,1){\mathcal{Z}}\sim N(0,1) is standard normal. This expresses the idea that the observed ZZ-scores exhibit ‘drift’ as a function of δ\delta and NN, but otherwise have the expected statistical properties of such scores.

If μ\mu is not truly zero in this model, then of course the null hypothesis of no difference fails. In our setting this means that the Gaussian ensemble does not give truly the same success probabilities as the ensemble being compared to it. Our analysis below rejects the hypothesis that μ=0\mu=0:

Finding 2: The ZZ-scores are not consistent with the hypothesis of Strict Universality. Exact finite-problem-size agreement of success probabilities p1p_{1} and p0p_{0} between each alternative ensemble and the Gaussian ensemble is not supported by our experiments.

5.3. Non-rejection of weak universality

We reformulate the weak universality hypothesis in terms of moments:

Refined Null Hypothesis: for each ensemble EE, μ(δ;N,E)\mu(\delta;N,E) tends to zero with increasing NN, and the standard deviation of ZZ scores approaches 11.

This hypothesis has a clear motivation. Earlier displays aided the eye with lines fitted to the means. Evidently, the mean ZZ-scores within a given suite are generally closer to zero for NN large than for NN small. Hence, in an informal appraisal, the refined null hypothesis seems quite plausible. Inspired by the ‘Higher Criticism’ (see Donoho & Jin (2009)), we compare the observed bulk distribution of ZZ scores with the theoretical distribution N(0,1)N(0,1). Table 2 shows roughly as many large ZZ scores as one would expect under the null hypothesis.

Finding 3: The ZZ-scores do not reject the hypothesis of Weak Universality. The difference between success probabilities p1p_{1} and p0p_{0} for each alternative ensemble and the Gaussian ensemble can be adequately modelled as a matrix-dependent random variable with stochastic order p1(A)p0(A)=Op(N1/2)p_{1}(A)-p_{0}(A^{\prime})=O_{p}(N^{-1/2}), where AA and AA^{\prime} are realizations in the two matrix ensembles.

6. Results not presented in the main text

In the appendix, we present a fuller record of our analyses. Key points include the following.

Transition zone scaling with NN. We verified that the width w(δ,N;Q)w(\delta,N;Q) of the zone where success probability drops from 1 to 0 scales as wN1/2w\propto N^{-1/2}.

Adequacy of probit model. We verified that the success rate varies with ρ\rho as a Probit function Φ((ρρ(δ;Q))/w(δ,N;Q))\overline{\Phi}((\rho-\rho(\delta;Q))/w(\delta,N;Q)). Here Φ\overline{\Phi} is the Gaussian survival function and ww is the transition width.

Exceptional Ensembles. It is evident from figures 5-6 that, at small N=200N=200, certain ensembles offer a relatively poor match to the Gaussian case. Most of these discrepancies can be accounted for by saying that in these exceptional ensembles, at small problem sizes, the level curve for 50% success rate is shifted noticeably below the 50% curve for the Gaussian ensemble. However, at the larger problem size N=1600N=1600, both the shift and the exceptional character of the ensembles are no longer evident. For details, see the appendix.

7. Limitations of our conclusions

We considered a limited set of matrix ensembles in this study. The ensembles are not all based on iid elements – there are dependencies among rows in the Fourier and Hadamard ensembles and among columns in the Expander ensemble. Even so, there is a certain air of ‘orthogonality’ or ‘weak independence’ in these examples.

There are exceptions to the pattern presented here. The classical example of cyclic polytopes shows that (LP) can have a notably higher success rate for very special matrices than it does for random matrices (Donoho & Tanner, 2005aa).

In addition to forward stepwise regression, (LP) and (P1), there are several competing algorithms which we do not study here. Maleki & Donoho (2009) conducted extensive empirical testing of many such algorithms and observed clear phase-transition-like behaviour, which varies from algorithm to algorithm and from the results presented here. Unlike the phase transitions presented here, which match theoretical results in combinatorial geometry, the phase transitions observed for competing algorithms are not yet supported by theoretical derivations.

Conclusion, and a glimpse beyond

Certain phase transitions in high-dimensional combinatorial geometry have been derived assuming a Gaussian distribution. We had informally observed that the Gaussian theory seemed approximately right even in some non-Gaussian cases. In this study, we made extensive computational experiments with more than a dozen matrix ensembles considering millions of instances at a range of problem sizes. Empirical results for both Gaussian and non-Gaussian ensembles show finite-NN transition bands centred around the asymptotic phase transition derived from a Gaussian assumption. The bands have a width of size O(N1/2O(N^{-1/2}), consistent with the proven behaviour for the Gaussian ensemble (Donoho & Tanner 2008bb). Such behaviour at non-Gaussian ensembles goes far beyond current theory. Adamczak et al. (2009) proved that, for a range of random matrix ensembles with independent columns, there is a region in the phase diagram where the expected success fraction tends to one, but there is no suggestion that this region matches the region for the Gaussian.

We used standard two-sample statistical inference tools to compare results from non-Gaussian ensembles with their Gaussian counterparts at the same problem size and sparsity level. We observed fairly good agreement of the two-sample ZZ-scores with the null hypothesis of no difference; however, fitting a linear model to an array of such ZZ-scores we were able to identify statistically significant trends of the ZZ-scores with problem size and with undersampling fraction δ=n/N\delta=n/N. The fitted trends vary from ensemble to ensemble, decay with problem size, and are consistent with weak, ‘asymptotic’, universality but not with strong, finite-NN, universality.

Our evidence points to a new form of ‘high-dimensional limit theorem’. There is some as-yet-unknown class of matrix ensembles that yield phase transitions at the same location as the Gaussian polytope transitions. Delineating this universality class seems an important new task for future work in stochastic geometry.

Appendix: suplementary statistical analysis

Gaussian ensemble. We study the basic properties of success probabilites at the Gaussian ensemble, as a function of δ\delta and ρ\rho.

Transition zone scaling with NN. We quantify the width w(δ,N;Q)w(\delta,N;Q) of the zone where success probability drops from 1 to 0. We verify that our measurement scales as N1/2N^{-1/2}.

Adequacy of Probit/Logit models. We verify that the success probability varies with ρ\rho approximately as a Probit function Φ((ρρ(δ;Q))/w(δ,N;Q))\overline{\Phi}((\rho-\rho(\delta;Q))/w(\delta,N;Q)). Here Φ\overline{\Phi} is the Gaussian survival function and ww is the transition width. A logit function fits just about as well.

Analysis of ZZ-scores. We compare success probabilities at the non-Gaussian ensembles to those at the Gaussian using ZZ-scores arising from two-sample tests for binomial proportions.

Methodology of ZZ-score comparison. We verify that in the null case of no difference, our methodology indeed finds no difference; we also verify that in the case of known difference, it indeed finds a difference.

Scaling of moments with NN. We identify nonzero means in the ZZ-scores, and show that the scaling law μ(δ,N;E)=O(1/N1/2)\mu(\delta,N;E)=O(1/N^{1/2}) best describes the data.

Exceptional ensembles. Two matrix ensembles exhibit substantial lack-of agreement with the others (e.g. some individual ZZ-scores as large as 20) for nn small. In effect the location for 50% success in those ensembles obeys a slight shift away from ρ(δ;Q)\rho(\delta;Q), of order ww. While this is an asymptotically negligible shift, failing to model it causes a noticeable lack of fit at N=200N=200 and nn small. This lack of agreement is observed to dissipate as nn increases.

Validation ensembles. Two matrix ensembles were studied only after all other analysis had been completed. Using the models arrived at in the prior analysis without changing the model form, we found that the same models describe the validation ensembles adequately, reinforcing the validity of our analysis.

All noticeable elements of lack of fit are best accounted for as evidence of effects consistent with the weak universality hypothesis.

We study n×Nn\times N random matrices AA, N>nN>n.

A matrix ensemble is a generating device for n×Nn\times N matrices. We report here results on the 9 different random matrix ensembles , listed in table 1.

We generate vectors y=Ax0y=Ax_{0} where x0x_{0} has kk nonzeros.

The nonzeros in x0x_{0} are either drawn uniformly from (0,1)(0,1) or (1,1)(-1,1), designated as coefficients ++ and ±\pm respectively.

The instances (y,A)(y,A) where the underlying x0x_{0} is nonnegative by intent are then processed using an optimizer to approximately solve (LP)(LP). The instances where the underlying x0x_{0} can be of both signs are processed by using an optimizer to approximately solve (P1)(P_{1}). In principle, the precise values of the nonzeros do not matter for properties of (P1)(P_{1}) and (LP)(LP). (n.b. For other sparsity seeking algorithms this would not be the case.)

The optimizer is presented with the problem instance (y,A)(y,A), but not x0x_{0}.

After running the optimizer, we measure ExactRecon, which takes the value 1 when the obtained solution x1x_{1}, say, is equal to the desired solution x0x_{0}, within 6 digits accuracy. It is zero otherwise.

We conduct MM independent replications at each fixed combination of N,n,kN,n,k, matrix ensemble, and coefficient type.

The variable SS totals the number of times ExactRecon was 11 in the MM replications. Results are tabulated in a data file with column headings

Here EE is an integer code specifying the suite of problem instances. Such a suite specifies both the matrix ensemble (eg Gaussian, Bernoulli, Rademacher, …) and the coefficient type (++ or ±\pm). For each matrix ensemble we consider both coefficient types.

For analysis and presentation, we use coordinates δ=n/N\delta=n/N, the matrix ‘shape’, and the solution sparsity level ρ=k/n\rho=k/n. We generally consider the success fraction p^=S/M\hat{p}=S/M.

The most important structure of the dataset concerns the constant-δ\delta slices, where nn, NN, EE are held constant and kk is varying. The success fraction p^(k,n,N;E)\hat{p}(k,n,N;E) is generally monotone decreasing in such a slice: monotone decreasing in kk for fixed nn, NN, and EE.

We focus on what is called the LD50LD50 in bioassays, the 50% quantal response more generally. It is the value of k/nk/n where S/MS/M is expected to be 1/21/2, for fixed nn,NN, EE.

1.2. Range of experiments

For each suite in table 1 and each combination of N,n,kN,n,k, we consider M=200M=200 different problem instances (y,A)(y,A) each one drawn randomly as above. Each suite is compared with a “baseline” of either suite 1 or 2 for the same combination of N,n,kN,n,k but a larger independent draw of M=1000M=1000 problem instances. We varied the matrix shape δ=n/N\delta=n/N, and the solution sparsity levels ρ=k/n\rho=k/n. At problem size N=1600N=1600, we varied nn systematically through a grid ranging from n=160n=160 up to n=1440n=1440 in 9 equal steps. The ‘signal processing language’ event ‘exact reconstruction’ corresponds to the ‘polytope language’ event ‘specific kk-face of QQ is also a kk-face of AQAQ’. In both cases we speak of success, and we call the frequency of success in MM empirical trials at a given (k,n,N)(k,n,N) the success rate. At each combination N,nN,n, we varied kk systematically to sample the success rate transition region from 5%5\% to 95%95\%.

1.3. Suites studied

In sections 3-7 we report details about experiments with the non-Gaussian suites 3123-12 and 151615-16 listed in table 1. As it happens, after the analysis of these suites was conducted, data became available for four other suites, based on the Hadamard and Rademacher ensembles. The analysis of those suites will be reported only in section 8. The Rademacher ensemble, suites 19 and 20, generated data with the same problem sizes and other parameters as suites 3123-12 and 151615-16. Our study of the Hadamard ensemble, suites 13 and 14, is restricted since only two problem sizes N=256N=256 and N=512N=512 were run.

2. Behaviour of the Gaussian ensemble

In this section, we restrict attention to the Gaussian ensemble, suites 1 and 2, and investigate these questions:

Width of transition zone. How does the width w(δ;N,Q)w(\delta;N,Q) of the transition zone at phase transition vary, as a function of NN?

Quantal Response Profile. How does the probability of success vary as a function of the reduced (ρρ(δ;Q))/w(δ;N,Q)(\rho-\rho(\delta;Q))/w(\delta;N,Q)? Where ρ=k/n\rho=k/n and δ=n/N\delta=n/N.

Behaviour of LD50LD50. In what manner does the empirical 50% point of the quantal response function approach the underlying asymptotic limit ρ(δ;Q)\rho(\delta;Q)?

The Gaussian ensemble is an appropriate place to focus attention, because:

A complete, rigorous understanding of the asymptotic behaviour exists in the Gaussian case (Donoho & Tanner 2009aa); we know that as NN\rightarrow\infty with δ\delta fixed and ρ\rho fixed away from ρ(δ,Q)\rho(\delta,Q), the success probability tends to either zero or 1. So we know that there is a transition zone, and that its width tends to as NN\rightarrow\infty.

A rigorous set of finite-NN bounds has been rigorously proven (Donoho & Tanner 2008bb); we know that the width scales like O(1/n)O(1/\sqrt{n}).

Hence, there are rigorous theoretical constraints: we know the phase transition exists asymptotically and we can constrain its width.

In the field of bioassays, the Quantal response function gives the probabiliy of organism failure (eg death) as a function of dose. In our setting, the analogous concept is the probability of algorithm failure at a fixed problem size n,Nn,N as a function of ρ=k/n\rho=k/n, the “complexity dose”.

Considering a constant-δ\delta slice at the Gaussian ensemble, suite 2, we see a roughly monotone increasing probability of failure as a function of k/nk/n. Figure 7 presents the fraction of success S/MS/M as a function of k/nk/n, at three incompleteness ratios δ=n/N\delta=n/N.

A Probit model for the dose response states that, for parameters aa, and bb, the expected fractional success rate is given by

where Φˉ\bar{\Phi} is the complementary normal distribution, and E{\mathcal{E}} denotes expectation. Figure 8 presents a first pass at checking the suitability of such a model. It identifies empirical estimates of the points where E(S/M)=α{\mathcal{E}}(S/M)=\alpha for α{1/4,1/2,3/4}\alpha\in\{1/4,1/2,3/4\}. and then chooses aa and bb in relation (5.6) to match those. It displays the raw data, the model curves, and residuals from the model.

It is standard in biostatistics to fit generalized linear models to such data; the binomial response model is appropriate here. Such models take the form

where the success probability p(δ)p(\delta), after a fixed transformation η()\eta() , obeys a linear model:

the function η\eta is called the link function.

We considered three standard link functions: the logit, probit and cauchyit links. Figure 9 presents fitted models and what the statistics analysis package R calls the working residuals for these three links. In fact the best loglikelihood is achieved among the three at the probit link, but there is a large residual at the most extreme response; it seems the probit link goes to zero too fast (this is not unexpected, owing to the ’thin tails’ of the normal distribution, and also owing to the finite-N large deviations analysis in Donoho and Tanner (2008bb)). The logistic link is nearly as good in deviance or likelihood senses, makes sense on theoretical grounds and gives more balanced residuals. (Note however, that as figure 8 showed, the Probit fit is adequate as long as we look at ordinary rather than the more statistically sensitive working residuals.)

We used R to fit these models; this has the advantage of automatically providing standard inferential tools – confidence bounds for aa and bb and goodness-of-link tests.

2.2. Behaviour of L​D​50𝐿𝐷50LD50

In bioassays, the LD50LD50 is the dose that corresponds to 50/50 chance of failure. This can be estimated from binomial response data in two ways.

The first, ‘nonparametric’ method finds the largest ratio k/nk/n where S>M/2S>M/2 for a given nn, NN and EE. We found a slight refinement useful: we fit a linear spline to the success ratios and solved for the (smallest) value of δ\delta where the spline crosses 5050%.

Using this method, we obtained table 3, which presents, for suite 2 and M=1000M=1000, the difference between the estimated LD50LD50 and the theoretical large NN limit.

Evidently, the LD50LD50 is approaching the expected phase transition with increasing NN. To quantify this effect, we have table 4, which shows that the LD50LD50 typically approaches its limit at roughly the rate 1/N1/N.

2.3. Transition zone width

We can define the α\alpha-width of the transition zone as the horizontal distance between p=αp=\alpha and p=1αp=1-\alpha on the dose-response.

We again can measure this nonparametrically and parametrically. We present here a nonparametric analog based on fitting splines to the empirical success fractions and measuring the α=0.1\alpha=0.1 and 1α=0.91-\alpha=0.9 quantile locations. We then normalize by the corresponding distance on the standard Probit curve

Table 5 presents values of Nw(δ,N)\sqrt{N}\cdot w(\delta,N) for suite 2 with M=1000M=1000.

3. Methodology of Z𝑍Z-score comparison

How does the methodology of ZZ-score comparison work on cases where we know the ground truth – both where we know there is no difference and we know there is an asymptotic difference? For each of the suites in table 1 the suite with M=200M=200 is compared against suite 1 or 2 with M=1000M=1000 independent problem instances for the same values of N,n,kN,n,k. Suites 1 and 2 with M=1000M=1000 form the baseline against which all ZZ-scores are calculated. Unless specified otherwise, suites with the same coefficient sign are compared; for example suite 9 is compared with suite 1 and suite 16 is compared with suite 2.

When the two problem settings being compared are simply replications of the same underlying conditions, we can be sure that H0H_{0}: no difference is true. We compare here suites 1 and 2 with M=200M=200 against the baseline experiments of suites 1 and 2 with M=1000M=1000. This follows the same procedure as will be conducted later when comparing non-Gaussian ensembles against the baseline.

Figure 10 Panel(a) presents the bulk distribution of ZZ-scores for suite 2; Panel (b) presents the bulk distribution of ZZ-scores for suite 1. The figure presents PP-plots: the fraction of ZZ-scores exceeding a threshold versus the fraction to be expected at the standard Normal. If the ZZ-scores were exactly standard normal, these plots would be close to the identity line, which is, in fact what we see.

Table 6 shows that of 180 ZZ-scores associated with comparisons of suite 11, 170 were less than 22 in absolute value; for 94.494.4 % – very much in line with an assumed standard N(0,1)N(0,1) null distribution. It also shows that of 181 ZZ-scores associated with comparisons of suite 22, 175 were less then 22 in absolute value; for 96.796.7 % – very much in line with an assumed standard N(0,1)N(0,1) null distribution. We thus see that under a true null hypothesis, our ZZ-scores behave largely as if they were N(0,1)N(0,1). This is an observation that needed to be checked, since ZZ-scores, when constructed in the way we have done so here, only are known to have an asymptotically normal distribution.

Figures 5 and 6 presented ZZ-scores in scatterplots of Z(k,n,N;E)Z(k,n,N;E) versus δ\delta comparing non-Gaussian ensembles with Gaussian ensembles; what happens when we have a true null hypothesis?

Figure 11 (a) and (b) presents the ZZ-scores for suites 1 and 2 respectively with fitted lines modelling the dependence on δ\delta. In principle, the ZZ-scores all have mean zero and there is no expected trend. However, owing to sampling fluctuation, we obtain nonzero intercepts and slopes. Table 7 shows the results that obtained in this truly null case. We learn from this that fitted intercepts and slopes of about the size indicated in the table can be viewed as consistent with a true null hypothesis.

3.2. Under a true alternative hypothesis

Is our methodology powerful? Can it detect any differences from null?

To study this question, we considered a simple and blatant mismatch: compare suite 1 with M=200M=200 against the suite 2 with the baseline M=1000M=1000.

The baseline, suite 2, should reflect a transition near ρ(δ;C)\rho(\delta;C) while the comparison group, suite 1, should reflect a transition near ρ(δ;T)\rho(\delta;T). As these two curves are very different we should see this reflected in the ZZ-scores. And we do. Figure 12 shows the QQ-plot, which is noticeably far from the identity line.

4. Bulk behaviour of Z𝑍Z-scores

In figures 13-15 we present a sequence of QQ plots showing the bulk distribution of ZZ-scores for suites 312,15163-12,15-16 at N=200N=200, 400400 and 16001600. In each plot the identity line is also displayed; if the ZZ-scores had truly a standard normal distribution, they would oscillate around this line.

While in most suites we see a good match between the ZZ-scores and the standard normal already at N=200N=200, it is not until N=1600N=1600 when every ensemble seems to yield approximately N(0,1)N(0,1) scores. Even then, suites 11 and 12 (Ternary (1/10)) exhibit some noticeable deviations. These effects are apparent for small nn. In fact, at small nn, trivial linear dependencies are found among the columns of typical random AA at these ensembles. Such linear dependencies force affine dependencies which force lost polytope faces. Consequently, the observed S/MS/M will be decreased, this effect is discussed further in §5.7.

This is consistent with our conclusions in the main text that strict finite-NN universality does not hold, but a weaker notion of asymptotic universality does hold.

5. Linear modeling of the Z𝑍Z-scores

The QQ Plots of ZZ-scores in figures 13-15 show that suites 11, 12, 15, and 16 (the highly sparse matrix suites) behave somewhat differently than other suites.

We now report results excluding those suites, and focus instead on suites 3-10. The excluded suites 1112,151611-12,15-16 will be discussed in §5.7.

In the main text we reported fits explaining the observed ZZ-scores using two models of the form,

We considered two such models, one with the restrictions α0=0\alpha_{0}=0 and β0=0\beta_{0}=0, and one without.

We first report results using the submodel α0=0\alpha_{0}=0 and β0=0\beta_{0}=0, and later the full model, without the restriction. The display below shows the output of the RR linear model command for the restricted model. The symbol de denotes δ1/2\delta-1/2 and the symbol probSize denotes 1/N1/\sqrt{N}. The output lists interaction terms such probSize:En. It turns out that α1(N,E)\alpha_{1}(N,E) is the sum of probSize and probSize:En. Similarly β1(N,E)\beta_{1}(N,E) is the sum of probSize:de and probSize:En:de.

One sees that the coefficients in this model are significant, and that the residual variance is roughly 11, as is to be expected of proper ZZ-scores. (The reduction in sum of squares in the null case due to fitting 16 degrees of freedom would negligible and has no significant bearing on our evaluation of the adequacy of residuals).

In contrast, here is the report on the fit of the full model, where there is no restriction α00\alpha_{0}\neq 0 and β00\beta_{0}\neq 0.

Every fitted term associated to α0\alpha_{0} or β0\beta_{0} is not statistically significant even at the 0.10 level. In contrast, coefficients associated to α1\alpha_{1} or β1\beta_{1} terms are mostly significant.

The adjusted R2R^{2} of the unrestricted model is actually worse than the adjusted R2R^{2} of the restricted model. The standard analysis of variance table produced by the R software anova command gives:

The improvement in variance explained using the unrestricted model is definitely not significant, as the PP value exceeds 1/2.

This lack of significance justifies our finding in the main text that weak universality holds, at least for suites 3-10.

6. Justification of scaling model with exponent 1/2121/2

in words we are saying that, within one suite, the means vary as a function of δ\delta at a given sample size, and that across sample sizes there is a root-n scaling of the means as a function of NN.

The root-n scaling can be motivated both theoretically and empirically.

We first sketch a theoretical motivation. Donoho & Tanner (2008bb) proved that, for δ\delta fixed, the success probability for the Gaussian ensemble p1(A)(0,1)p_{1}(A)\in(0,1) has a transition zone near the asymptotic phase transition ρ(δ;Q)\rho(\delta;Q) of root-N width. Namely, we showed that for ρ\rho below the asymptotic transition, p1(A)Cexp((ρρ(δ;Q))/w(δ,N,Q))p_{1}(A)\leq C\cdot\exp(-(\rho-\rho(\delta;Q))/w(\delta,N,Q)), where the width wN1/2w\asymp N^{-1/2}.

From the viewpoint of probability theory, a width w=O(N1/2)w=O(N^{-1/2}) would be typical when describing the frequency of success of an event among O(N)O(N) weakly dependent indicator variables. Perhaps there is some such interpretation here; although the rigorous proof in Donoho & Tanner (2008bb) does not provide one. If this were the case, it would not be surprising for the hypothesized underlying indicator variables to have success probabilities differing by order 1/N1/\sqrt{N} at the non-Gaussian ensembles from the corresponding ones at the Gaussian ensembles; such discrepancies in limit theorems are common. This would generate N1/2N^{-1/2} scaling in the mean ZZ-scores between the Gaussian and other ensembles.

We now turn to empirical motivation. We fit model (5.6) to each suite and sample size separately, obtaining the full collection of coefficients α(N,E)\alpha(N,E) and β(N,E)\beta(N,E) for N=200,400,N=200,400, and 16001600. We then fit an intercept-free linear model to the collection of fitted intercepts:

Table 8 presents the R2R^{2} of the different fitted models. Evidently exponent γ=1/2\gamma=1/2 gives the best fit both to intercepts and to slopes.

We don’t see an easy way to attach statistical significance to this finding, using standard software.

This fitting exercise also shows that the exponent γ=1/2\gamma=1/2 fits well enough to make the case for α00\alpha_{0}\neq 0 and β00\beta_{0}\neq 0 very weak. We fit two joint models to two datasets, one containing all the fitted intercepts α^(N,E)\hat{\alpha}(N,E) and the other containing all the fitted slopes β^(N,E)\hat{\beta}(N,E) from the same collection of ensembles and the standard problem sizes N=200,400,1600N=200,400,1600.

The first joint model for absolute intercepts included terms that do not vanish as NN grows large

The second joint model did not include such terms:

The traditional t-statistics associated with γ0\gamma_{0} terms were nonsignificant with one exception; this must be treated as unimpressive owing to multiple comparison effects. In contrast, the majority of α1\alpha_{1} terms were significant. The analysis of variance comparing the two fits gave an F statistic of 1.531.53 on 1212 and 2323 degrees of freedom with a nominal PP-value of 0.23940.2394. In short the larger model which includes a constant term independent of NN explains little. Although this cannot be used with the usual interpretation it does show that any tendency to not vanish must be very weak.

The first joint model for slopes included intercept terms:

The second joint model did not include intercept terms:

The traditional t-statistics associated with β0\beta_{0} terms were nonsignificant with one exception; this again must treated as unimpressive owing to multiple comparison effects. In contrast, the majority of β1\beta_{1} terms were significant. The analysis of variance comparing the two fits gave an F statistic of 1.351.35 on 1212 and 2323 degrees of freedom with a nominal PP-value of 0.300.30. Again the larger model which includes a constant term independent of NN explains little. Although this cannot be used with the usual interpretation it does show that any tendency to not vanish must be very weak.

7. Exceptional suites

In §5.5 we excluded suites 11,12,15,16 from analysis. Our decision was based on the fact that three of these – 11,12,15 – were to the naked eye quite different than the others, by two criteria:

Behaviour of the bulk distribution of ZZ-scores (see figures 13-15; panels in the lower row)

Behaviour of the plots of ZZ-scores versus δ\delta.

We grouped suite 16 with these based on the fact that the underlying matrix ensemble was the same as suite 15; they differ only in properties of the solution coefficients.

We remark that although the naked eye perceives differences in the ZZ-scores for δ\delta small, these suites are overall consistent with weak universality. As NN increases, at each fixed δ\delta, in each suite, the distribution of ZZ-scores gets closer to N(0,1)N(0,1). For example, the evident discrepancies that exist for small δ\delta in the QQ Plots of ZZ-scores at N=200N=200 are not apparent for large δ\delta; moreover, the discrepancies are dramatically attenuated by N=1600N=1600.

7.2. Modelling of Z𝑍Z-scores

A major part of the exceptional behaviour of these suites is the apparent nonlinear dependence of mean ZZ-score on δ\delta. We have seen that for suites 3-10, an adequate description is provided by linear dependence on δ\delta. However, for the exceptional suites this is no longer the case. Plots of residual ZZ-scores versus δ\delta show that the exceptional suites exhibit nonlinear dependence on δ\delta; typically downward sloping at δ<.5\delta<.5 and no sloping or upward sloping at δ>.5\delta>.5.

Mild nonlinearity in μ\mu can be modelled through hinged models:

These model δ\delta dependence as a linear spline with knot at δ=1/2\delta=1/2. As before we can model κ\kappa’s dependence on NN in the same way as we have done with α\alpha and β\beta:

Such hinged models are to be preferred over the linear models in the 4 exceptional suites.

We first present the R session transcript of the linear fit with N\sqrt{N} scaling terms but no constant terms:

In this fit, there is substantial lack of fit: the standard deviation of residuals, 1.854, is dramatically larger than 1.0 (the fit for suites 3-10 gave instead a residual standard deviation close to 1).

We fit the hinged model with N\sqrt{N} scaling:

Here dea2 denotes (1/2δ)+(1/2-\delta)_{+}, so κ\kappa terms are associated with dea2, in an impromptu notation: κ(N,E15)=\kappa(N,E15)= probSize:E15:dea2 + probSize:dea2 /N\sqrt{N}.

The key points to observe here are: (1) the individual coefficients associated with hinge terms are significant; and (2) the adjusted R2R^{2} (0.6071) is substantially higher than it was for a linear fit (0.5358). The analysis of variance comparing the hinged model with the linear model gives an FF statistic of 113 on 4720 and 8 DF, which is wildly significant; see the transcript:

In the above fits we considered only models imposing N1/2N^{-1/2} scaling on μ\mu. Allowing terms which do not decay in NN does not improve the fit. The following transcript shows fits of a a model for mean ZZ-score in suite EE containing non-scaling terms: μ(N,E)=μ0(N,E)+μ1(N,E)/N\mu(N,E)=\mu_{0}(N,E)+\mu_{1}(N,E)/\sqrt{N}; the fits shown in the paragraphs immediately above correspond instead to restrictions μ0=0\mu_{0}=0.

The key points to observe are: (1) The standard error of residuals is not meaningfully improved by allowing the extra explanatory terms: it drops from 1.706 for the N1/2N^{-1/2} scaling model to 1.689 for the full model; and (2) The adjusted R2R^{2} is worse for the full model than it is for the scaling model. At the level of individual effects, the bulk of the non-scaling terms are not significant, and the scaling hinge terms remain significant. We view the few significant non-scaling terms as possibly caused by the significant modeling error that still remains: i.e. since the residual standard error, at about 1.7, is about 70% higher than it would be if everything were explained in a satisfactory way.

The next table summarizes the residuals from the hinged scaling model, grouped by suite and NN. The summaries include group means, standard deviations, medians and median absolute values.

Suite 16 is adequately explained, since the standard deviation is fairly close to 1.00 at each NN.

In all suites the residual standard deviation is decreasing with NN, and in every case the residual standard deviation is less than 1.10 at N=1600N=1600. In our view this, together with the structure of the scaling model, merits the conclusion that these random matrix ensembles agree with the Gaussian ensemble for large nn.

However, there is noticeable structure at small nn. The key pattern visible in the table is the tendency of residuals to be positive at small nn.

Some further structure is visible in the δ\delta variable; figures 16 and 17 show that at N=200N=200 there is a dramatic ’blowup’ in variance at small δ\delta, which has largely disappeared at the larger problem size N=1600N=1600. They also show that suite 16 is already well-behaved at N=200N=200, and the ”variance blowup” is largely a phenomenon of suites 11 and 12. Finally, at N=1600N=1600 one can see indications of curvilinear structure, so evidently the hinged model can only be regarded as an approximation.

As it turns out, the ”variance blowup” is not a variance phenomenon at all; it is caused by significant unmodeled structure in the means as a function of ρ=k/n\rho=k/n. Figure 18 shows the ZZ-scores at suite 12, for problem size N=200N=200. The ZZ-scores are quite large and nonrandom, indicating significant unmodelled structure.

On the other hand, this unmodeled structure is largely a phenomenon of small problem size. Figure 19 shows the ZZ-scores at suite 12, now for problem size N=1600N=1600. The ZZ-scores are not so large and much more random, indicating that unmodelled structure is less of a problem, except at small δ\delta.

The unmodelled structure can be largely accounted for as a displacement of the LD50LD50 by about 1 transition width at N=200N=200. The asymptotic phase transition is at ρ(δ;Q)\rho(\delta;Q); if we assume that in suite 12 the LD50LD50 is not at ρ(;Q)\rho(\cdot;Q) but instead at ρ=ρ(δ;Q)w(δ;N,Q)\rho^{\dagger}=\rho(\delta;Q)-w(\delta;N,Q), and we assume a probit link, we can expect the ZZ-scores to have a mean

where ϕ\phi denotes the standard Gaussian density.

Figure 20 fits this model to the ZZ-scores at suite 12, for sample size N=200N=200. It accounts for most of the structure in the ZZ-scores at these values of δ\delta.

7.3. Understanding the exceptional ensembles

When NN is small and δ\delta is small, the exceptional ensembles have LD50LD50’s displaced substantially below the asymptotic phase transition, It also seems that if we keep δ\delta fixed and increase NN, such displacement eventually disappears.

It seems to us that one can understand this phenomenon as a transient effect of random sampling in discrete structures. To explain this, note that when ”success” is declared in our computations, this means that the image of a specific kk-face FF of the polytope QQ ‘survives’ projection under AA, i.e. AFAF is a face of AQAQ. (Here of course QQ is either TN1T^{N-1} or CNC^{N}, see the main text; and the preceding statement deserves and requires rigorous proof; it receives such proof in our other papers). In particular that there is no pair (x0,x1)(x_{0},x_{1}) both in FF with Ax0=Ax1Ax_{0}=Ax_{1}. But then there is no vector vv supported on those k+1k+1 sites with Av=0Av=0.

On the other hand, if there exist vectors supported on k+1k+1 sites that belong to this null space, then there definitely will be kk-faces FF that get lost under projection.

Let Ek,n,NE_{k,n,N} denote the event that there exist k+1k+1-sparse vectors in the null space of AA. Then when event Ek,n,NE_{k,n,N} occurs there will be a kk-face of QQ that does not survive projection.

For a Gaussian random matrix AA, with probability one the columns of AA are in general position, and so P(Ek,n,N)=0P(E_{k,n,N})=0 provided k<nk<n. However, for the exceptional ensembles P(Ek,n,N)>0P(E_{k,n,N})>0 even for quite small kk.

The Ternary (1/10) and Expander (1/15) ensembles generate highly sparse matrices AA. Particularly when nn is small, we have observed that these can have sparse solutions in their null space; that is, Ek,n,NE_{k,n,N} can have substantial probability.

It may help some readers to know we are effectively discussing the sparsity of the sparsest vector in the null space of AA; this is often denoted spark(A)spark(A) (Donoho & Elad, 2003). For Gaussian random matrices AA, spark(A)=nspark(A)=n, while for the Ternary and Expander ensembles, we find that spark(A)spark(A) has a good chance of being much smaller than nn.

Dossal et al. (2009) have proposed a greedy algorithm which can be used to find sparse vectors in the null space of a matrix. It provides an heuristic upper bound spark+(A)spark^{+}(A) on the spark of a matrix . Applying this algorithm to random draws of Ternary (1/10) and Expander (1/15) we observe sparse vectors satisfying Az=0Az=0; ranges of these observed sparsity levels are recorded in columns 2 and 3 of table 10.

The sparsity levels recorded in columns 2 and 3 of table 10 indicate clearly that Ek,n,NE_{k,n,N} can occur for quite small kk, but they do not indicate the size of P(Ek,n,N)P(E_{k,n,N}) or give any other information about the prevalence of these sparse null space vectors. Columns 4-9 of table 10 indicate the empirical bound spark+(A)spark^{+}(A) for suites 1-2,11-12,15-16 with S/MS/M closest to 50%. For Ternary (1/10) with N=200N=200 and n60n\leq 60 it is common to draw matrices which contain a column of all zeros, allowing for 1-sparse vectors in the null space. In such cases, spark(A)=1spark(A)=1, which is dramatically smaller than nn, what we would see with the Gaussian ensemble.

With such easy counterexamples to uniqueness it now seems remarkable that the typical case, as measured by the observed LD50LD50, is as high as we observed. This may be explained by the fact that typically, the bad kk-sets do not intersect the particular kk-set we are interested in (i.e. the one supporting the solution of interest).

For the Expander ensemble with N=200N=200 and n40n\leq 40 there are vectors in the null space with only 6 nonzeros, and null space vectors with more than 6 nonzeros become common; as a consequence, it becomes increasingly common that of the M=200M=200 problem instances presented to (P1) and (LP) for each value of kk moderately larger than 6 there will be vectors x1x_{1} which are sparser than x0x_{0} and yet which satisfy Ax1=Ax0=yAx_{1}=Ax_{0}=y.

We observe that the LD50LD50 for suites 15-16 is significantly displaced below the asymptotic value for this range of (N,n)(N,n). Suites 15-16 are observed to have similar values of LD50LD50 for this range of (N,n)(N,n). Although these LD50LD50s are similarly displaced, the effect is more noticeable in the ZZ-scores of suite 15 due to its being compared with suite 1 which has a markedly higher LD50LD50 than does suite 2 (which suite 16 is compared against).

If we keep n/Nn/N and k/nk/n constant, and k/nk/n small, but let NN increase it seems that the probability P(Ek,n,N)P(E_{k,n,N}) tends to zero rapidly. For N=1600N=1600 and δ=0.1\delta=0.1, the observed LD50LD50 values for these exceptional ensembles are within 1 of those observed for the Gaussian suites. We also observe that for N=1600N=1600 and n=160n=160 we are unable to find null space vectors with fewer than 161161 nonzeros. This suggests the following interpretation of what is happening: for each (δ,ρ)(\delta,\rho) there is a transient effect, such that as NN increases, initially it is most likely that spark(A)spark(A) is less than ρn\rho\cdot n, but for sufficiently large NN, it becomes likely that spark(A)nspark(A)\approx n.

Under our interpretation, the driving effect behind the exceptional ensembles is the phenomenon that for small NN the matrix AA has its columns not in general position; and, for fixed δ=n/N\delta=n/N the chance of observing this phenomenon decays rapidly with increasing problem size NN.

8. Validation ensembles: Rademacher and Hadamard

The computer runs reported here took place over a substantial interval of calendar time. As it happens, two of the matrix ensembles – Rademacher and partial Hadamard – were completed long after the others and the data were not available at the time of the analyses reported so far.

A basic principle in science is out-of-sample validation, namely building a model using data available at a certain moment in time and then later using data that were not available to the model construction to test the model construction.

The fresh data provided by the Rademacher and partial Hadamard runs provide an excellent opportunity for validation of our fitted models.

Such validation could be helpful for the following reason. A critic of our analysis could point out that we have used the same data to choose the decay exponent γ=1/2\gamma=1/2 in our scaling model μN1/2\mu\sim N^{-1/2}, as well as to study lack of fit of that model. This is not inferentially rigorous, as the exponent has specifically been chosen to minimize lack of fit, and we then argue that the lack of fit at such γ\gamma is not significant. The use of the same data for both tasks means that the tt-scores and pp-values no longer have the assumed distributions.

In our case, we have about 10,000 ZZ-scores, and we believe this criticism is not as serious as in many other occurrences of this practice, and we are prepared to argue this point. However, in this instance because we have an independent set of validation data, we can actually perform direct model validation and avoid lengthy rationalizations of our earlier analysis.

The validation data have one oddity: The Hadamard ensemble is computable only for special choices of NN, and we happen only to have data for N=256N=256 and N=512N=512, while for the Rademacher ensemble we have data for N=200N=200, 400400, and 16001600 as per usual.

We now use those data to validate the following points in our analysis.

Fit of N1/2N^{-1/2} model for scaling of means;

Lack of Fit at small δ\delta caused by displacement of the LD50.

It will turn out that the Rademacher ensemble is not exceptional while the Hadamard is; and for both Rademacher and Hadamard the models developed using the other ensembles fit rather well.

In figures 21 and 22 we present a sequence of QQ plots showing the bulk distribution of ZZ-scores for suites 192019-20 (at N=200N=200, 400400 and 16001600) and for suites 13-14 (at N=256N=256 and 512512). In each plot the identity line is also displayed; if the ZZ-scores had truly a standard normal distribution, they would oscillate around this line.

The Rademacher suite ZZ-scores are typical of what we have already seen in suites 3-10. Recall that ZZ-scores are expected to cluster around the line Y=XY=X at the null hypothesis. There is some noticeable bulk misfit at N=200N=200, less at N=400N=400, and by N=1600N=1600 the visual misfit has mostly disappeared. The misfit at N=200N=200 is mostly in the lower tail, and in a downward shift away from the Y=XY=X line, meaning that there are situations where our realization of the Rademacher ensemble at N=200N=200 gives noticeably better success probabilities than the Gaussian ensemble, and it also gives slightly better success probabilities on average.

The Hadamard suite ZZ-scores are typical of what we have already seen in suites 11,12, and 15. There is some very noticeable bulk misfit at N=256N=256, and much less at N=512N=512. We do not have data at larger NN. The misfit at N=256N=256 is mostly in the upper tail, meaning that there are some situations where the Hadamard ensemble at N=256N=256 gives noticeably worse success probabilities than the Gaussian ensemble. This upper tail effect has mostly disappeared by N=512N=512, however for the suite with positive coeficients at N=512N=512 we see that the slope of the ZZ-score graph is higher than 1, so the bulk distribution of ZZ-scores has seemingly higher standard deviation than 1.0. This is a sign of overdispersion, i.e. the underlying success probability p1p_{1} might best be treated as a random variable which varies from realization to realization, but which has the correct expectation p0p_{0}.

8.2. Linear models of the Z𝑍Z-scores

Above we reported fits explaining the observed ZZ-scores using two models of the form (5.7)-(5.8). We considered two such models, one with the restrictions α0=0\alpha_{0}=0 and β0=0\beta_{0}=0, and one without.

The display below shows the output of the RR linear model command for the restricted model. The symbol de denotes δ1/2\delta-1/2 and the symbol probSize denotes 1/N1/\sqrt{N}. The output lists interaction terms such as probSize:de, which is what we have called β1(N)\beta_{1}(N).

Scaling with NN: the mean ZZ-score tends toward zero with increasing NN.

Standard deviation 1\approx 1. The standard devation of the residual ZZ-scores is close to 1.

Hinge terms. The linear trend term is not significant but the hinge term is.

When we fit the unrestricted model, allowing non scaling terms that do not decay with increasing NN, we see that in fact the only term approaching significance is again the scaling Hinge term.

The unrestricted model has a worse adjusted R2R^{2} than the scaling model and does not produce a meaningfully better residual standard deviation. We conclude that the scaling model fits the Rademacher ensemble adequately. Also, since the form of the model and the particular analyses we are presenting were both specified before the data became available, the statistics can be considered inferentially rigorous.

The situation with the Hadamard ensemble is quite different. It belongs with the exceptional ensembles, in the sense that substantial lack of fit is evident, particularly at small δ\delta and and small NN. Also, since we have only two values of NN256256 and 512512 – it is unclear that fitting the linear model will in any way validate scaling.

8.3. Lack of fit at the Hadamard ensembles

The Hadamard ensemble displays, at N=256N=256, the lack of fit which is familiar to us from the exceptional ensembles 11,12, and 15. Figure 23 displays the behaviour of ZZ-scores as a function of ρ=k/n\rho=k/n within each constant-δ\delta slice, together with fits using the displaced LD50LD50 model which we showed described the other exceptional ensembles, with the same displacement of one transition width. Note that this model was developed entirely before these data became available. Hence this display presents a validation of the earlier approach.

One can see in figure 23 that the displaced LD50LD50 model indeed adequately describes the structure at small δ\delta in the Hadamard ensemble at N=256N=256. For comparison, we show in figure 24 which demonstrates that no such model is needed in the Rademacher ensemble.

8.4. Implications of the validation ensembles

Our analysis of the validation ensembles follows the script we arrived at in earlier analysis. Since the models being fit – which in earlier analysis depended on the same data as those being explained – now do not depend on the data being explained, we gain additional confidence about the model we developed. In particular, the partitioning into ordinary and exceptional ensembles, the modelling of mean ZZ-scores with power law means having decay N1/2N^{-1/2} in ordinary ensembles and the existence of transient effects in exceptional ensembles at small NN and δ\delta – all seem confirmed in our validation ensembles.

9. Conclusions

We made extensive computational experiments, solving underdetermined systems of equations y=Axy=Ax using linear programming based algorithms in cases where a known solution x0x_{0} is a kk-sparse vector, and AA is an nn by NN matrix with n<Nn<N. We tracked the success of the linear programming algorithms at recovering this known sparse solution.

We considered millions of such systems at a range of problem sizes and covering a range of random matrix ensembles. Our work would have required more than 6 years using a single modern desktop computer.

The Gaussian ensemble is well understood in the large-problem limit based on prior theoretical work. It is known there is a phase transition in the (δ,ρ)(\delta,\rho) phase diagram at the curve (δ,ρ(δ;Q))(\delta,\rho(\delta;Q)). For (δ,ρ)(\delta,\rho) below this curve we see success – the probability of success tends to 11 – and above this curve we see failure – the probability of success tends to . Here δ=n/N\delta=n/N is a measure of matrix shape and ρ=k/n\rho=k/n is a measure of sparsity of the solution vector.

Empirical work at the Gaussian ensemble shows that for finite NN the location of 50% success closely matches the asymptotic phase transition ρ(δ;Q)\rho(\delta;Q). In fact the success probability behaves with ρ\rho as Φˉ((ρρ(δ;Q))/w(δ,N))\bar{\Phi}((\rho-\rho(\delta;Q))/w(\delta,N)) where Φˉ\bar{\Phi} denotes the N(0,1)N(0,1) survival function. and w(δ,N)w(\delta,N) is a width parameter. In picturesque terms there is a transition zone: for ρ<ρ(;Q)3w\rho<\rho(\cdot;Q)-3w success is overwhelmingly likely; for ρ>ρ(;Q)+3w\rho>\rho(\cdot;Q)+3w success is overwhelmingly unlikely, and in between there is a transition zone of width w\propto w. ww tends to zero with increasing problem size NN as O(N1/2)O(N^{-1/2}).

Broadly similar behaviour is evident at all but two of the non-Gaussian matrix ensembles considered here. Hence, the evidence points to asymptotic phase transitions at the conforming matrix ensembles which are all located at precisely the same place as in the Gaussian case. This is the hypothesis of weak universality advanced and maintained in the main text.

The above conclusions emerged from using standard two-sample statistical inference tools to compare results from non-Gaussian ensembles with their Gaussian counterparts at the same problem size and sparsity level. We considered the behaviour of over 16,000 ZZ-scores and observed good bulk agreement of the two-sample ZZ-scores with the N(0,1)N(0,1) distribution, supporting the null hypothesis of no difference; however, fitting a linear model to an array of such ZZ-scores we were able to identify statistically significant nonzero mean ZZ-scores varying with problem size, with undersampling fraction δ=n/N\delta=n/N, and, at small δ\delta, with ρ\rho. The means exhibit trends varying from ensemble to ensemble, but they decay with problem size NN as O(N1/2)O(N^{-1/2}). The nonzero means are consistent with weak, ‘asymptotic’, universality but not with strong, finite-NN, universality. After accounting for these means, the ZZ-scores exhibit standard deviations close to one, with a small fraction of exceptions.

For the non-Gaussian ensembles studied here the fact that phase diagram behaviour matches the Gaussian case goes far beyond current theory (e.g. Adamczak et al. (2009) )

References