A review of matrix scaling and Sinkhorn's normal form for matrices and positive maps
Martin Idel
Introduction
It is very common for important and accessible results in mathematics to be discovered several times. Different communities adhere to different notations and rarely read papers in other communities also because the reward does not justify the effort. In addition, even within the same community, people might not be aware of important results - either because they are published in obscure journals, they are poorly presented in written or oral form or simply because the mathematician did not notice them in the surrounding sea of information. This is a problem not unique to mathematics but instead inherent in all disciplines with epistemological goals.
The scaling of matrices is such a problem that has constantly attracted attention in various fields of pure and applied mathematicsThe term “Babylonian confusion” to describe the history of this problem was first used in . Recently, generalisations have been studied also in physics to explore possibilities in quantum mechanics where it turns out that a good knowledge of the vast literature on the problem can help a lot in formulating approaches. This review tries to tell the mathematical story of matrix scaling, including algorithms and pointers to applications.
As a motivation, consider the following problem: Imagine you take a poll, where you ask a subset of the population of your country what version (if any) of a certain product they buy. You distinguish several groups in the population (for instance by age, gender, etc.) and you distinguish several types of product (for instance different brands of toothbrushs). From the sales statistics, you know the number of each product sold in the country and from the country statistics you know the number of people in different groups. Given the answers of a random sample of the population, how can you extrapolate results?
Central to a solution is the following innocuous theorem:
Given a matrix with positive entries, one can find matrices such that is doubly stochastic.
The literature on Sinkhorn’s theorem and its generalisations is vast. As we will see, there are some natural ways to attack this problem, which further explains why the different communities were often not aware of the efforts of their peers in other fields.
One of the main motivations for this review was a generalisation of Sinkhorn’s theorem to the noncommutative setting of positive maps on matrix algebras:
with the adjoint matrices and the adjoint map .
Some results and approaches can be translated to this noncommutative setting, but many questions remain open and the noncommutativity of the problem makes progress difficult.
Very recently, a new generalisation of Sinkhorn’s theorem to a noncommutative setting has appeared in .
The goal of this review is therefore threefold:
Trace the historical developments of the problem and give credit to the many people who contributed to the problem.
Illuminate the many approaches and connections between the approaches to prove Sinkhorn’s theorem and its generalisations.
Sketch the generalisation to positive maps and its history and highlight the questions that are yet unanswered and might be attacked using the knowledge from the classical version.
In addition, I will try to give a sketch of the algorithmic developments and pointers to the literature for applications. I will probably have forgotten and/or misrepresented contributions; comments to improve the review are therefore very welcome.
Notation and Preliminaries
Most of the concepts and notations discussed in this short section are well-known and can be found in many books. I encourage the reader to refer to this section only if some notation seems unclear.
An important concept for nonnegative matrices is the pattern. The support or pattern of a matrix is the set of entries where . A subpattern of the pattern of is then a pattern with fewer entries than the pattern of . We write if is a subpattern of , i.e. for every we have .
Finally, let us introduce irreducibility and decomposability. Details and connections to other notions for nonnegative matrices are explained in Appendix A. If is nonnegative, then is fully indecomposable if and only if there do not exist permutations such that
where neither nor contain a zero row or column and . The matrix is irreducible, if no permutation can be found such that already is of form (1). In particular, this implies that all fully indecomposable matrices are at least irreducible.
When talking about positive maps, we will also adopt the notation that denotes the complex matrices, while the shorter is used for complex square matrices.
Different approaches to equivalence scaling
This section explores the historical development and current form of the mathematical landscape surrounding the following extension to Theorem 1.1:
if and only if there exists a matrix with and and the same pattern as . Here, which means that contains the row sums of the scaled matrix and contains the column sums.
Furthermore, if the matrix has only positive entries, and are unique up to a constant factor.
In Section 4, we give maximal formulations of this theorem. Some immediate questions emerge, such as: How to compute and the scaled matrix? Can this be generalised to arrays of higher dimension? All of these questions and many more have been answered in the literature.
Multiply each row of with to obtain with row sums .
Multiply each column of with to obtain with column sums .
If the row sums of are very far from , repeat steps one and two.
If the algorithm converges, the limit will be the scaled matrix. However, there is a priori no guarantee that exist, in which case we can only ask for approximate scaling, i.e. matrices such that .
The iterative algorithm 3.2 is extremely natural and it is therefore not surprising that it was rediscovered several times. It is at least known as Kruithof’s projection method () or Kruithof double-factor model (especially in the transportation community; ), the Furness (iteration) procedure (), iterative proportional fitting procedure (IPFP) (), the Sinkhorn-Knopp algorithm (), the biproportional fitting procedure (in the case of ; ) or the RAS method (especially in economics and accounting; ). Sometimes, it is also referred to simply as matrix scaling (), which is mostly used as the term for scalings of the form , or matrix balancing, which is mostly used for scalings to equal row and column sums. The algorithm is a special case of a number of other algorithms such as Bregman’s balancing method (cf. ) as we will see later on.
When was interest sparked in the RAS method and diagonal equivalence? The earliest claimed appearance of the model dates back to at least the 30s and Kruithof’s use of the method in telephone forecasting (). At a similar time, according to , the Soviet architect Sheleikhovskii considered the method. claims that when he started to evaluate the method, it had already been proposed and in use. His example is the unpublished report . acknowledges in transportation science, who popularised the RAS method in the English speaking communities.
None of these approaches seem to have been thoroughly justified. Bacharach notes that Deming and Stephan only propose an ad-hoc justification for using their method to study their problem, which turned out to be wrong (cf. ). He further claims that the first well-founded approach to use the RAS - this time in economics - was given by Richard Stone, who also coined the name “RAS model” (Bacharach cites , although the name RAS must have occurred earlier as it already occurs in without attribution and explanation). However, one can argue that the first justified approach occurred earlier: had already posed a question regarding models of Brownian motion when given a priori estimates, which led to a similar problem. His approach was justified, albeit the ultimate justification in terms of large deviation theory needed to wait for the development of modern probability theory (cf. ). The problem boils down to solving a continuous analogue of Sinkhorn’s theorem, which leads to the matrix problem using discrete distributions (essentially similar to ) and was first attacked in using a fixed point approach similar to Algorithm 3.13.
However, none of the original papers provided a convergence proof with the possible exception of The notation and writing is very difficult to read today, so I am not entirely sure whether the proof is correct and captures the case we are interested in.. As noted by , after Deming and Stephan provided their account, their community started to develop the ideas, but a proof was still lacking ().
Summarising the last paragraphs, the RAS method was discovered independently for different reasons in the 30s to 40s, although none of the authors provided a proof (with the possible exception of Fortet). A more theoretical analysis developed in the 60s after Stone’s results in economics (e.g. ) and in statistics and algebra. Since then, a large number of papers has been published analysing or applying Theorem 3.1. Every decade since the sixties contains papers where proving the theorem or an extension thereof is among the main results (examples are ).
Many authors are aware of at least some other attempts, but only a few try to give an overview.This suggests once again that the problem had a very complicated history which also makes it difficult to find out whether a problem has already been solved in the past. Several authors have attempted more complete historical overviews such as . In , the authors claims that a colleague collected more than 400 papers on the topic of matrix scaling. The situation is further complicated by a the fact that the technical answer to the question of scalability is tightly linked with the question of patterns, which has a rich history in itself, probably starting with Fréchet (overview of a long line of work in ).
The last point is particularly interesting: In fact, one could summarise matrix scaling matrix scaling as follows: Given a nonnegative matrix it is scalable to a matrix fulfiling some constraints (mostly linear but some nonlinear constraints are allowed), a matrix is scalable with diagonal matrices (in different ways, mostly where and need not be independent) if and only if there exists a matrix with the same pattern as fulfiling the constraints.
Today, proofs and generalisations of Theorem 3.1 and similar questions about scaling matrices in the form or form a knot of largely interconnected techniques. We will now try to give an overview of these results and highlight their connections. A graphical overview is presented in Figure 1.
2 The logarithmic barrier function
Potentials and barrier functions have been important in the study of matrix scaling since at least the unpublished results of . Here, we largely follow , who give a very lucid account about the interconnections between different barrier function formulations for .
If we take partial derivatives, we obtain
which implies that for any stationary point we have
and setting and solves the scaling problem. Conversely, any scaling gives a stationary point of the logarithmic barrier function. In summary:
According to , this observation was first made by who also gave the first complete and correct proof. However, the paper only circulated privately. Gorman apparently did not consider this scaling function directly but used an approach similar or identical to the ones considered in convex geometry described in Section 3.5.
The potential barrier function can also be seen from the perspective of Lagrangian multipliers:
the function is bounded away from zero and is unbounded whenever . The function then attains a minimum defining and .
This was used to prove our Theorem 3.1 in . We observe:
Lemma 3.4 and 3.3 are equivalent: The logarithmic barrier function is the Lagrange function of the optimisation problem in Lemma 3.4.
Now consider for a fixed . Since is a convex function and is linear in , is convex in . The same holds for a fixed , i.e. is convex in both direction. It is then natural to consider the coordinate descent algorithm (for an introduction and overview see ):
Given a nonnegative matrix , take a starting point for , e.g. and iterate:
For fixed , find by searching for the minimum of .
For fixed , find by searching for the minimum of .
It is possible to solve or analytically:
Define and . Then we have and , which implies that we perform successive row- and column normalisations as in the RAS method. ∎
Using the fact that the algorithm is a coordinate descend method, one can obtain a convergence proof including a discussion of convergence speed of this algorithm and a dual algorithm (). See also Observation 3.17 for a discussion of coordinate ascent methods.
However, is not jointly convex. For a purely (jointly) convex reformulation, consider the minimum for along any line , where is convex. If we define
minimising is still equivalent to minimising . The corresponding will be homogeneous and the domain for minimisation will in fact be compact.
hence minimising is equivalent to minimising .
The function is also similar to Karmakar’s potential function for linear programming and Algorithm 3.2 is the coordinate descent method for this function ().
Setting , we arrive at another formulation of the problem. In the doubly stochastic case, this formulation is due to and was later adapted to arbitrary column and row sums in later studied in , who used an entropic approach for the generalised problem and in , who used a direct convergence approach reminiscent of Sinkhorn and others.:
Note that the infimum is attained iff the infimum
is attained. This is the formulation in Lemma 3.4.
Finally, let us sketch a proof using potential methods.
on the set of with and . Consider an arbitrary point on the boundary (i.e. for at least one ). For , since and always, we have that . Hence the function takes its minimum in the interior. At the minimum, the partial derivatives must vanish and we obtain:
If we take all conditions for , then this is equivalent to the condition
The more technical part for nonnegative matrices is a more careful analysis of what happens for nonnegative matrices that are not positive. For doubly stochastic matrices, we can use the fact that fully indecomposable matrices have a positive diagonal, which implies once again that . A similar argument can be made for arbitrary patterns, but we leave it out in this sketch. ∎
3 Nonlinear Perron-Frobenius theory
Another early approach uses nonlinear Perron-Frobenius theory which is essentially a very general approach to tackle fixed point problems for (sub)homogeneous maps on cones. A short overview is given in appendix B. The basic idea is given by:
This also suggests another simple algorithm:
Given a nonnegative matrix . Let . Iterate until convergence:
The development of this idea that started with and was used to provide a full proof of Theorem 3.1 in for doubly stochastic matrices. consider arbitrary row- and column sums and give a complete study of the spectrum of the Menon-operator. Some contraction properties were used to give a direct proof of convergence of the RAS algorithm in . The connection to Hilbert’s projective metric, and therefore to “Nonlinear Perron-Frobenius theory” (cf. ), became clear later on and allowed to give upper bounds on the covergence speed of the RAS ().
However, Menon was not the first to define the operator : Looking closely at the arguments given in , one can see the continuous version of , which lead to an independent rediscovery of and its connection to the Hilbert metric in . Probably, Menon was not even the first to define the discrete version of the operator and to note that the existence of a fixed point can be seen by invoking Brouwer’s fixed point theorem. This dates back to , building on work about matrix patterns (). According to , was also the first paper to conjecture the necessary and sufficient conditions for scalabilityHe also notes that the early history around is a little bit curious, since the authors claim to have a convergence proof but never publish it.. The ideas where rediscovered another time in , where the authors used the fixed point argument to prove that a scaling exists and fulfils their axiomatic approach.
Let us connect the approach to Section 3.2. First note that the algorithm is nothing else but a slight variation of the RAS method:
Setting and we can immediately see that one iteration of Algorithm 3.13 is one complete iteration of the RAS method 3.2.
The connection with the logarithmic barrier method is also very close:
Any fixed point of the Menon operator defines a stationary point of the logarithmic barrier function (2) and vice versa.
Let be a nonnegative matrix. The derivative conditions for the stationary points of (2) are given in equation (3), which are equivalent to:
This implies immediately that , hence is a fixed point of . Similarly, any positive fixed point immediately gives a scaling as a minimum of the logarithmic barrier function. ∎
However, it is not immediately clear when the fixed point is positive if contains zero-entries. This is the main technical difficulty for a complete proof. show that if is fully indecomposable, has at least entries which are nonzero if has exactly entries which are nonzero, which immediately proves that the fixed point must be positive.
Upon closer observation, the map is contractive under Hilbert’s metric and Banach’s fixed point theorem immediately provides existence and uniqueness of the scaled matrix. The fixed point itself provides the diagonal of . ∎
4 Entropy optimisation
Another approach, which underlies many justifications for applications, considers entropy minimisations under linear constraints. An overview of entropy minimisation and its relation to diagonal equivalence can be found in , a broader overview about the relation of the RAS algorithm to entropy scaling with a focus on economic settings can be found in .
where we use the convention that the summand is zero if and infinity if . The relative entropy is nonnegative and zero if and only if and it is therefore similar to a distance measure. Given a set, what is the smallest “distance” of a point to this set in relative entropy? This is known as I-projection (cf. ).
Let be a nonnegative matrix and define
We ask for the I-projection of onto the set , i.e. we want to find such that
The connection to scaling was probably first used in , where the RAS method is used to improve an estimate for positive probability distributions of dimensions in the relative entropy measure (Brown cites as a justification for his approach, where the relative entropy is justified as a “closeness” measure). According to , this approach was later generalised to all multidimensional tables in based on some duality of optimisation by Both references were not available to me.. Another early use of relative entropy occurs in (see also ), where it was noted without proof that the results were the same as the RAS.
A very natural approach to obtain would be to try an iterative I-projection:
If is even, find such that
If is odd, find such that
The algorithms 3.16 and 3.2 are the same.
Partial derivatives and lead to the system of equations:
The latter is the column renormalisation as in the RAS (Alg. 3.2). ∎
This implies that if the iterated I-projection converges to the I-projection of (13), then matrix scalability solves equation (13). This was supposedly proved in In , it is pointed out that a simplified version of this proof appeared in , which however is unavailable to me. and , but the proofs contain an error as pointed out in (see also ). A corrected proof appeared in , however for some of the theorems it is not immediately clear whether more assumptions are needed as noted in .
In addition, the proof in for positive matrices proves that the RAS converges using relative entropy as a “progress measure”. He shows that it decreases under RAS steps to a unique stationary point. Another direct proof appeared in .
At this point, let us make the following observation:
The RAS method can also be seen as the coordinate ascent method to the dual problem of entropy minimisation.
This is justified as follows: When deriving the I-projections of each step of the algorithm, we set up the Lagrangian
and calculate its solution. This consists in explicitly solving the resulting equations for the Lagrangian multipliers . In this sense, the algorithm is not really a primal problem. This is also consistent with the nomenclature above: In Section 3.5 we see that the dual problem of entropy minimisation is a convex program that is basically just the (negative) logarithmic barrier function above. Since the RAS is the coordinate descent algorithm of this problem, it is the coordinate ascent method of the dual problem of entropy minimisation.
In other word, the justification of this observation is due to:
and in particular, the minimum is the scaled matrix.
The proof of this observation will essentially follow from the results in Section 3.5.
Let us finish this section by giving another proof sketch of Sinkhorn’s theorem:
We sketch the proof given in restricted to our scenario, which is similar to the proof in (see also for a comment on the connection). We prove convergence of Algorithm 3.16, essentially by showing that the relative entropy of two successive iterations decreases to zero.
Given a nonnegative matrix , assume that there exists a matrix with required row- and column sums. Otherwise, the relative entropy will always be infinite and the problem has no solution.
The crucial observation is that if is the I-projection of onto , then for any we have
This “Pythagorean identity” usually only holds with . The equality case is a special case of the “minimum discrimination principle” () and it is proven for constraints in . This equality leads to a very useful transitivity result (see also ) stating that if has I-projection on for some and I-projection on , then has I-projection on . This is not necessarily true in the general case.
Let be the I-projection of onto . Denoting by the repeated I-projection as defined in Algorithm 3.16, repeated application of equation (15) shows
Since the first and last term converge to zero, and the I-projection is indeed the limit of Algorithm 3.16. ∎
5 Convex programming and dual problems
Recall the logarithmic barrier function in equation (2) and that it is not jointly convex. However, it is very beneficial to make convex for several reasons:
Convex programming is efficient in the complexity theoretic sense ().
The duality theory for convex programming is very well developed and can lead to new algorithms (see for a heuristic introduction and for a more careful analysis).
Uniqueness proofs can become simpler: A convex function has a unique minimum iff it is strictly convex at the minimum.
To obtain a convex program, one simply needs to substitute and into to obtain ():
A proof based on this approach can be found in .In it is also noted that the function is used in the approach by later to be simplified by . Both papers are unavailable to me. We have already seen:
The convex programming formulation in Lemma 3.20 is equivalent to the logarithmic barrier function approach in Lemma 3.3.
The convex programming formulation 3.20 is the Wolfe dual () or Lagrangian dual () of the entropy minimisation approach.
The entropy minimisation problem was given as:
This implies that the Wolfe dual is given by
and inserting this into the Wolfe dual function (see e.g. ) we obtain:
which is (up to the constant ) the optimisation problem in 3.20. The calculation for the Lagrangian dual is similar (see ). ∎
Another connection to the barrier function is to use geometric programming:
Minimisation of the logarithmic barrier function is equivalent to
This is in standard form of a geometric program, which implies that a substitution gives a convex program (, Section 4.5.3).
This observation was made in , which also gives necessary and sufficient conditions for a matrix to be scalable or approximately scalable.
As described in , one can also reduce the problem to an unconstrained optimisation problem for only a single variable by taking the formulation of Lemma 3.10 and substituting as above to obtain the minimising function
Finally, let us return to entropy minimisation: Relative entropy is jointly convex and therefore a convex program. In fact, relative entropy is a special case of a broader class of functions called Bregman divergences which we will sketch in Section 6.5.
A proof of Theorem 3.1 using convex programming is often similar to the approach in Section 3.2. The advantage is that any critical point is automatically a minimum and one does not need to consider the boundary.
6 Topological (non-constructive) approaches
In the proof of Theorem 3.1 in Section 3.3, the result was achieved by Brouwer’s fixed point theorem, but it is only one of many topological proofs.
is onto. This is somewhat problematic, because the spaces involved are not compact, but by normalising both diagonal matrices and row- and column-sums, one can consider the map as a map from a compact space into itself. This approach was taken in for positive matrices (based on his thesis) and the map was shown to be surjective using a topological theorem, which Bapat claims is sometimes known as Kronecker’s index theorem.I could not find any other instance of where the theorem is given that name. The theorem simply states that for any map , if maps into itself and is of nonzero degree, then it must be surjective. The case for general nonnegative matrices could only be covered by combining the approach with (see ).
and using the dual of this maximisation, one can show that it scales the matrix.
There is a simple connection to entropy minimalization, since .
However, we can also take the converse road: Instead of exploring the possibilites for every , we can start with the set of matrices with prescribed row- and column sums and matrix pattern (call the set ) and map it to the set of all nonnegative matrices of pattern (call it ) by diagonal equivalence, i.e. consider the map:
Again, it would be enough to show surjectivity. As such, it cannot be injective, because we can obviously shift a scalar from to , hence we would at least have to restrict the first coordinate of to be . The resulting map is indeed a homeomorphism as shown in an overlooked paper of .
Another topological proof has recently been proposed in . To describe the approach, note that the following two statements are equivalent:
There exist such that has row sums and column sums .
There exist such that is a stochastic matrix with .
The proof is trivial, in fact and . In , the author therefore restricts to stochastic matrices. To do this, he defines the following map:
7 Other ideas
A very general approach to prove Theorem 3.1 was provided in , where matrix theorems are derived as a consequence of the following theorem:
Sinkhorn’s theorem follows as an easy corollary:
First, let , then we first define the following maps:
Clearly, the choice of is unique up to , which can be made explicit and leads to the usual conditions. ∎
In principle, we have already two geometric interpretations of the RAS: First, the RAS is akin to iterated I-projections and second, the RAS is the application of a contractive mapping on a cone with a projective metric. Two other “geometric” proofs are known:
shows that the RAS is a contractive mapping in the Euclidean metric using that the RAS preserves cross-ratios of a matrix. Given a matrix, the products
remain invariant. This was first observed in , where it was used to justify the use of the RAS in statistical settings (see Section 8). Fienberg then follows that if one associates any positive matrix to a point of the simplex
by normalising the matrix, then any point reachable by diagonal equivalence scaling lies on a certain type of manifold inside the simplex. Using some structural knowledge of these manifolds he then shows that each full cycle of the RAS corresponds to a contraction mapping with respect to the Euclidean metric. The result is general enough to cover multidimensional tables, but in this simplicity handles only positive matrices.
There is an interesting connection: While the cross-ratios within the matrix remain constant, Hilbert’s metric is also closely connected to cross-ratios. In fact, the contraction ratio is connected to the largest cross-ratio within the matrix and it is not finite if the matrix contains zeros. In that case, the matrix does not easily define a contraction in Hilbert’s metric. The same holds true for Fienberg’s proof.
consider the column space of scaled matrices and notes that is doubly stochastic if the columns are included in the convex hull of the columns of and the barycentre of the sets of the two columns coincide. The observation of the barycentre then leads them to a proof involving Brouwer’s fixed point theorem once again. By some continuity argument, the proof can be extended to nonnegative matrices.
7.2 Other direct convergence proofs
Many papers contain direct convergence proofs, not least the original approach in and the proof of the full result (another proof based on this approach is given in ). The idea is to show that some seemingly unrelated quantity always converges. Often, this quantity turns out to be very much related to some potential barrier function or entropy and we already cited the approach in the corresponding section.
One different proof is the short convergence proof of establishing that and similarly for every . This proof is in some sense derived from Bacharach’s approach (, see also ) and is very straightforward.Macgill also mentions yet another work that contains a proof of Theorem 3.1, namely , however no details are given beyond the fact that it contains also approximate scaling. In parallel to Bacharach’s earlier work but not cited in his later , proved the convergence of the RAS method in the general case of multidimensional matrices via the same idea which he attributes to (see the appendix of ).
A second direct proof of convergence in , uses a norm difference as convergence measure. More precisely, he considers the map
A third proof of direct and approximate scaling is given in by combining the approach of Bacharach with a simple -error function borrowed from .
Equivalence scaling
Let us now collect maximal results. A similar but scarcely referenced collection of results was provided in . We follow the cleaner presentation style of . Starting with equivalence scaling, we have:
There exist positive diagonal matrices such that has row sums and column sums .
There exists a matrix with row sums and column sums with the same pattern as ().
For every such that we have that
and equality holds if and only if ().
The RAS method converges and the product of the diagonal matrices of the iteration also converges to positive diagonal matrices ().
The equivalence of the first two items was essentially established in the proof sketches in section (3). The equivalence to the fourth item follows from the characterisation of matrix patterns (see appendix A) and the third follows from studying the geometric program 3.23.
For doubly stochastic scaling, using the classification of doubly stochastic patterns, we then know that scalability is equivalent to having total support (cf. ). The scaling matrices are unique up to scalar multiplication if and only if is fully indecomposable.
For approximate equivalence scaling, the results are similar. The only difference is that certain elements of can become zero in the limit (which implies that elements of must become zero and others infinite, hence the diagonal matrices cannot exist):
For every there exist diagonal matrices such that satisfies
There exists a matrix such that is scalable to a matrix with row sums and column sums .
There exists a matrix with row sums and column sums ().
For every such that we have that
The RAS method converges ( for the d.s. case).
For doubly stochastic scaling, using the classification of doubly stochastic patterns, we have that approximate scalability is equivalent to having support. Using , this is a trivial consequence of the fact that a matrix has total support if and only if it has doubly stochastic pattern and Proposition A.5One recent observation of this is in . The observation has however already been made before such as in .
The uniqueness conditions are also simple enough to state:
Furthermore, if there exist no permutations such that is a direct sum of block matrices, then are unique up to scalar multiples. Otherwise, the scaled matrices are only unique up to a scalar multiple in each block.
For doubly-stochastic scaling, this result already appears in . For the case of general marginals, it occurs in and for general matrices and marginals in . The tools can also be applied to prove that the approximately scaled matrix is unique.
Let us now have a closer look at the difference between approximate scaling and equivalence scaling. What can be said about the convergence of the RAS?
Then converges to a matrix and the same result holds for with if and else.
The continuity of the scaling can also be studied:
Let be nonnegative and be prescribed row- and column sums. Then the limit of the Sinkhorn iteration procedure is a continuous function of on the space of matrices with -pattern.
When the scaling matrices are unique up to a scalar multiple, this also implies that the scaling is continuous in .
The first proof of this result limited to the doubly-stochastic case was given in . The full result follows directly from the homeomorphism properties of the map (20) from . A discussion is also presented in . Furthermore, the continuity can be achieved using arguments of Section 9.3.
Finally, let us mention another characterisation of equivalence scaling using transportation graphs.
Then it is easy to see that the matrix is approximately scalable if and only if the maximum flow of this network is equal to . The flows along the edges then define a matrix with the wanted pattern. The matrix is exactly scalable if and only if the maximum flow of this network is equal to and every edge contains flow.
Other scalings
The problem of equivalence scaling is closely connected to different forms of scalings, the most prominent ones asking for a diagonal matrix such that is row-stochastic or such that has equal row and column-sums.
Many modern approaches to matrix equivalence scaling are general enough to cover most of those different scalings (see Section 6).
Given a matrix , does there exist a matrix such that has equal column- and row sums? Clearly, this is a special case of scaling with a different set of constraints. We have the following characterisation:
There exists a diagonal matrix such that fulfills .
is completely reducible or equivalently, a direct sum of irreducible matrices ().
There exists with the same pattern as and ().
The scaling of is unique and is unique up to scalars for each irreducible block of .
The problem was first considered in in the context of preconditioning matrices (see Section 8) by proposing an algorithm and proving its convergence (and uniqueness). , building on Osborne’s results, considers the matrix balancing method and provides an algorithm and convergence proof for completely reducible matrices. Unaware of the effort of Osborne and Grad, but considering “the analogue of [Sinkhorn’s] result in terms of irreducible matrices” proves essentially the same result. His approach is based on a progress measure which is basically the maximum difference of the row- and column sums. provided an interpretation in terms of patterns. The same was later proved in by yet different means.
Similar to the RAS method, one can propose a simple iterative approximation algorithm:
For , let be the row sum and similarly be the column sum. Then define as the minimum index such that is maximal among .
Define such that .
Let with at the -th position. Then define and iterate.
According to , this algorithm is also similar to the proposed scheme in . At any step, the -th row is already correctly scaled, while all other rows change their scaling a bit. Note that unlike in the RAS method, the selection of the row and column to be scaled are done using norm differences. Given the results of that the RAS converges regardless of the order of column and row sum normalisations, a similar condition might also accelerate RAS convergence.
The algorithm converges to a balanced matrix . This matrix is also the unique minimiser of the function
The fact that the balanced matrix minimises the entropy functional can be seen by direct calculation (the minimiser must be a scaling of the original matrix and the balancing conditions ensure that the scaling is of the form ).
A proof is similar to observation 3.17: Each step of the algorithm is an I-projection onto the set of matrices with only one row/column balancing constraint. Since the conditions are linear, the repeated projection will converge.
It remains to see that the order of the projections does not matter as long as all directions are chosen arbitrarily often. ∎
2 DAD scaling
Another closely related problem is the question, whether given a nonnegative matrix , there exists a single diagonal matrix such that has prespecified row- or column sums. A short but quite good overview is given in .
Let us first focus on the case where is symmetric. It seems natural that this follows directly from Sinkhorn’s theorem: If has equal row-sums and is symmetric, so does . By uniqueness of up to scaling, this implies that one can choose . This was noted for example in .
The first discussion of the case of symmetric can be traced back to the announcements This is covered in many papers, for instance .. A first proof for the case of positive matrices and doubly stochastic scaling was given in . Shortly later, consider the case of doubly stochastic scaling for nonnegative matrices with positive main diagonal, while shows that a doubly stochastic scaling exists if and only if there exists a symmetric doubly stochastic matrix with the same zero pattern if and only if the matrix has total support. This was extended in to cover the case of arbitrary row sums giving the following theorem:
There exists a symmetric nonnegative matrix with the same pattern as and row sums .
For all partitions of such that , with equality if and only if .
The equivalence of 2. and 3. is given in . 1. follows from 2. using Sinkhorn’s theorem and the reverse direction is proved via contradiction. Using the uniqueness in Sinkhorn’s theorem then provides uniqueness for the scaling.
Note that the following observation gives a very simple proof of Theorem 3.1:
has a row-sum symmetric scaling to .
First assume that there exist positive diagonal such that fulfills
Then setting we have
and clearly .
Conversely, if has a row-sum symmetric scaling , by an analogous argument will have an equivalence scaling with row sums and column sums . ∎
This was already known in the 70s, maybe even earlier; explicit formulations include . Note that the observation can easily be extended to not just row- and column sums, but all -norms for as considered in Section 6.7. It can also be extended to approximate scalings with the same proof. This implies:
Results from symmetric scaling for symmetric nonnegative matrices can always be translated to cover equivalence scaling for arbitrary nonnegative matrices.
The other direction is not true, since clearly not all symmetric matrices are of the special form (24). However, it can still be beneficial to study equivalence scaling on its own, as many algorithms (e.g. the RAS) do not preserve symmetry.
Theorem 5.4 can be generalised to cover matrices that are not necessarily nonnegative:
If is positive semidefinite, then is scalable if and only if is strictly copositive ().
Any principal submatrix of (including ) is scalable if and only if is strictly copositive ().
In general, at least one of the following two propositions is true ():
More general conditions for scalability of arbitrary symmetric can be found in . We make a number of remarks concerning the results:
Another necessary condition for scalability (the matrix must be diluted) is provided in .
The question of equivalent conditions for the scalability of matrices remains open. However, these conditions might not have a very useful description, since scalability of arbitrary symmetric matrices is NP-hard (This was conjectured also in , who noted that deciding whether a matrix is (strictly) copositive is NP-complete according to . The authors seemed to have been unaware of the paper by Khachiyan. The alternative in Theorem 5.7 is also not very useful computationally, because deciding the emptiness of the set (25) is also NP-hard (, according to ).).
The second result implies in particular that if a matrix is strictly copositive, it is scalable, which was first proved in . Note that positive definite matrices are in particular strictly copositive, which means that this result encompasses the claimed proofs of scalability of completely positive matrices in . An elementary proof for matrices with strictly positive entries has recently appeared in based on an iterative procedure.
For doubly stochastic scaling, the alternative conditions of can also be derived using linear programming duality and/or the hyperplane separation theorem using extremely general methods of duality in self-concordant cones ().
Scaling of the special class of Euclidean predistance matrices has been considered in . It turns out that all such matrices are scalable.
Note that the equivalence conditions for positive semidefinite matrices can be strengthened. If a matrix is scalable and positive semidefinite,
can be bounded in terms of the matrix dimension (cf. , where it is also noted that the scaling problem is related to linear programming).
Uniqueness of matrix scaling has also been studied:
If has two or more distinct scalings, then there exists a matrix such that has eigenvalues and ().
For scalable positive definite matrices there exist diagonal matrices such that , one for each sign pattern of (). In particular, scaling by positive diagonal matrices is unique.
If is positive semidefinite, then if is scalable to row sums , the positive diagonal matrix is unique ().
For the scaling of positive semidefinite matrices, upper and lower bounds on were derived in .
also note that for nonnegative matrices uniqueness holds in particular if is primitive (including the case of positive matrices already covered in ) or if is irreducible and there does not exist a permutation such that
If we do not restrict to symmetric matrices we can only hope to scale to a matrix with given row-sums. The only notable result seems to be:
The theorem can be extended to cover arbitrary row sums. The first proof occurred in . Likewise, the proof in does not need symmetry of .
3 Matrix Apportionment
Another scaling problem which is interesting particularly for its applications, is asking for an equivalence scaling, but with the added constraint that the resulting matrix have integer entries. This is important for instance when attributing votes to seats in a parliament and has been applied as early as 1997 (, see also for one of many explicit accounts for actual changes).
This problem, which is often called matrix apportionment has first been studied in . Algorithms akin to the RAS method exist and others based on network flows can be obtained from ; an overview and many references can be found in .
4 More general matrix scalings
This review has so far largely been concerned with nonnegative matrix scaling, with the exception of symmetric scaling. This is understandable, as most of the applications concern nonnegative matrices. However, in view of completeness, let us mention a few of the (mostly quite recent) other cases of matrix scaling.
While arbitrary scaling is interesting for real symmetric matrices, scalings of general real matrices have never sparked a similar amount of interest. It is merely known that the question whether or not a matrix is scalable is NP-hard () - a question that has also been considered for matrices over the algebraic numbers in . Since the problem of nonnegative matrix scaling turns out to be equivalent to the existence of matrices with given pattern, it seems natural to ask whether the -pattern of matrices with prescribed row- and column sums play a similar role. For positive diagonal scaling the sign pattern of the matrix cannot change and it is a necessary condition for scalability, which is not sufficient as shown in . Nevertheless, the authors achieve a characterisation of general matrix patterns (generalising , see also ).
Note that in case all entries are nonnegative the matrix is doubly stochastic. For the rest of this section, let us restrict to square matrices. Quasistochasticity is interesting, because if is quasistochastic, then , where and is the discrete Fourier transformation. This is true since and is an eigenvector of by quasistochasticity. A doubly quasistochastic matrix therefore satisfies that has as its first row and column. Repeating diagonal scalings and Fourier transform can then lead to new matrix decompositions.
The natural generalisation of scaling would be -scalings for positive semidefinite matrices. These were first studied in and later in . Observing that the proof of extends to complex entries, the authors obtain already part of the following partial results:
Neither nor the scaled matrices are necessarily unique. However, there exists at most one scaling with positive matrices .
The authors suggested that such scalings can be applied to generate highly entangled symmetric states. They furthermore conjectured that the number of such scalings would be upper-bounded, but this was disproved recently in by giving counterexamples for , which have infinitely many scalings. For , there exist at most four scalings. An RAS type algorithm can be obtained from the fact that an equivalent version of Observation 3.14 also holds in the complex case.
For the subclass of unitaries, we proved the following theorem:
For every unitary matrix there exist diagonal unitary matrices such that is doubly quasistochastic. Neither nor are generally unique, in fact in some cases there may even be a continuous group of scalings.
An algorithm how to obtain similar to the RAS method is given and studied in , however its convergence is unknown.
The theorem was conjectured in and used later () to prove that any unitary matrix can be considered as a product of diagonal unitary matrices and Fourier transforms on principal submatrices. Recently, it has also been applied to prove an analogue of the famous Birkhoff theorem for doubly-stochastic matrices ().
The proof of Theorem (5.13) boils down to noticing that a scaling exists if and only if there exists a vector with and for all . This is a problem of symplectic topology in disguise and can be solved using a theorem in . When we published the theorem in we were unaware of the fact that this proof had in principle already been found, since the equation with , which defines so called biunimodular vectors (see for instance ), also pops up in several other places. In this context, essentially the same proof was described in . A first formal publication containing this proof was probably applying it to error-disturbance relations in quantum mechanics.
Generalised approaches
All of the approaches above can be generalised to some extend. Many can then incorporate also different scalings. With an eye towards matrix equivalence, we will attempt to see the different ways of generalisations and what can be gained. A quick summary can be found in Table LABEL:tab:generalise.
Especially in transportation planning, equivalence scaling of arrays with three indices has been important from the beginning. Except for nonlinear Perron-Frobenius theory, the approaches can be readily generalised to this case. As already pointed out, was the first to consider multidimensional scaling. According to (see also ), Furness pointed out iterative scaling as a possible solution to certain transportation planning problems in the unpublished paper . Evans and Kirby themselves proved convergence in a limited scenario by extending the convex programming approach of equation (16) and proofs have been provided or pointed out in several other papers such as . The case of approximate multidimensional scaling is discussed in .
For multidimensional exact or approximate scaling, the convergence results of reflected in Theorem 4.4 still hold. In addition, the order in which we normalise any of the indices of the multidimensional array is irrelevant:
Let be an array with indices (or dimensions) and let be the dimension of the array that is scaled in the -th step. If each element of appears in the sequence infinitely often, then the scaling converges to the limit of the cyclic RAS method, the I-projection of .
2 Log-linear models and matrices as vectors
Most of the ideas above use matrices as matrices, as sets of numbers with two indices. One can likewise consider just vectors of numbers and define columns and rows by defining partitions of the vectors. This approach has the advantage that the generalisation to multidimensional matrices is immediate. It was probably pioneered by , although credit Murchland, who circulated his results later (The papers were not available to me). The approach was then taken on in (see also , Chapter 6 for an overview and a more lucid presentation of their ideas). While used an entropic approach, is based on a combination of optimisation and topological approaches as discussed in Section 3.6. The same theorem is also proved in in a very elementary fashion and in using optimisation techniques.
The original goal of was not to study matrix scaling but rather obtaining probability distributions using so called log-linear models. Given a positive (sub)probability distribution over some finite index set , a log-linear model is a probability distribution such that
which satisfies some constraints . Here, and have to be determined while is given from the problem. The name derives from the fact that the solution is an exponential family of probability distributions.
Depending on the choice of , one can write matrix balancing, equivalence scaling or scaling as finding a log-linear model.
To achieve equivalence scaling with row-sums and column sums , consider for simplicity the case of a matrix. Then and are given by
and we define (example from ).
To achieve matrix balancing with row-sums equaling column sums, consider for simplicity the case , then is given by
and and we order again as before (example from ).
Note that this is a major generalisation of scaling as the matrix can contain any real numbers.
The limiting factor of the theorem is the boundedness of . While the constraints in the case of matrix equivalence are bounded, the constraint set defined by (28) is not necessarily bounded. applies a completely different proof which only works for positive matrices. However we can still apply Theorem 6.2: is unbounded, because the matrix entries can become unbounded since we only want equal row and column sums but do not specify them further. We fix that by using
and . The last row just implies that the sum of all matrix entries should be one which makes a bounded set. A simple calculation then shows that this is equivalent to searching for a diagonal matrix and a scalar such that has equal row- and column sums and the sum of all matrix entries is one. Clearly, this is equivalent to matrix balancing and we can apply Theorem 6.2.
The connection to entropy minimisation is simple:
Given a positive (sub)probability distribution , if a positive probability distribution satisfying (27) and the linear constraints exists, then it minimises relative entropy subject to the linear constraints.
The proof in is a straightforward calculation and follows directly from . If is a probability distribution satisfying the linear constraints, then
which implies the lemma by the nonnegativity of relative entropy. ∎
3 Continuous Approaches
Nonnegative matrices were always tied to joint probability distributions. Obviously, there is no reason to only study discrete probability distributions. The first such generalisation was obtained in . Also the basic theorems of and are more general than counting measures (although both have problems with parts of their arguments, see ).
As pointed out in , there are essentially two approaches to continuous versions, the entropy maximisation approach studied by Kullback and later Csiszár, and the approach via fixed point theorems or contractive ratios (one can see this as a precursor to nonlinear Perron-Frobenius theory) studied in, for instance, . The natural continuous extension of the theorem for symmetric matrices was studied in (via fixed points or iterative contractions). The most general results in combine these two approaches. To give a flavour of their results, we cite
Given a finite measure spaces and marginal distributions , consider the following minimisation problem:
where . Furthermore, we require . Then the minimisation problem has a unique optimal solution. If there exists a which fulfils the constraints and there exist and such that almost everywhere, then is the unique solution. Conversely, if there exists a feasible solution with almost everywhere, then the unique optimal solution satisfies almost everywhere and there exist sequences and such that
This in fact also covers the approximate scaling case. Note that the results also extend to more than two marginals.
4 Infinite matrices
Instead of continuous functions, we can also consider infinite matrices. Results usually posit that row and column sums should be finite in some norm.
The first such result was obtained in , which proves Theorem 3.1 in the case where the column and row sums are uniformly bounded in the -norm and the matrix entries are uniformly bounded. The proof is reminiscent of and Nonlinear Perron-Frobenius theory, using a fixed point argument involving Schauder’s fixed point theorem.
Another approach was presented in , where matrices that are infinite in one direction are studied (rows or columns are finite in a -norm). Once again, the matrix entries must be bounded uniformly (in this case, in a -norm) and convergence of the iterative algorithm to a unique solution is proved in certain topologies.
5 Generalised entropy approaches
As we saw in Section 3.4, we can write matrix scaling as the problem
where is an intersection of linear constraints. This approach can be generalised in two ways: First, one could consider other functions than relative entropy but related to it or second, one can consider more general sets .
behaves similarly to a metric, although it is not necessarily symmetric and obeys no triangle inequality. If one takes (negative entropy modulo the linear term), then . This example was already studied in giving in addition an iterative algorithm to find the projections onto the minimum Bregman distance given linear constraints, which is a variant of the RAS method (see also ).
Another way to generalise is the basic observation underlying : Relative entropy for matrices is equivalent to the sum of cross-entropies between matrix columns, where a cross-entropy of the column of the matrices is just . Instead of taking the sum of all cross-entropies, it might be justified to take weighted sums of cross-entropies. This is relevant in economic settings and, aside from , was studied in e.g. References corrected but taken from as they were unavailable to me.
On the other hand, we can work with relaxed constraints. This was covered in : The extension of linear families of probability distributions is still covered by , while finding the I-projection for closed, convex but nonlinear constraints requires different means such as Dykstra’s iterative fitting procedure (cf. ).
6 Row and column sum inequalities scaling
Instead of wishing for matrices to have prespecified row and column sums, it might be interesting to consider cases where only lower and upper bounds on the row and column sums and the matrix entries are considered.
such that if , then and , then and the same conditions for and . This problem was studied in , where they call such a matrix a fair share matrix. The main purpose of the approach is described in Section 8.2. Using the arguments from nonlinear Perron-Frobenius theory (Section 3.3), they prove
Let be a nonnegative matrix. There exists a unique fair share matrix for if and only if there exists a matrix with the same pattern as .
A different generalisation is called truncated matrix scaling. It is studied in and can also account for equivalence scaling. While the proofsuse a combination of optimisation techniques for an optimisation problem defined via entropy functionals, the motivation and interpretation uses graphs and transportation problems for graphs.
There is a matrix whose pattern is a subpattern of such that is balanced and ,
and asks for a diagonal matrix and a nonnegative matrix such that
is balanced and ,
The connection to equivalence scaling is simple (cf. ): Given a nonnegative matrix and row and column sums , we start with the graph of Figure 2: we join the two vertices and into one vertex (call it ), keeping everything else fixed. We label the edges between the nodes with the corresponding matrix entries . is now the matrix corresponding to the graph.
Now we copy the graph twice and erase the weights of the edges and instead label the first graph by and the second by where
Now () is the matrix corresponding to the graph with labels (). Finally, is the triple for truncated matrix scaling.
The main result of the paper then includes:
Once again, the answer is dominated by the pattern of the matrix and the conditions boil down to the usual conditions for similarity scaling. In a sense, this gives another explanation as to why both problems, equivalence scaling and matrix similarity, need pattern conditions for feasibility: They are both similar graph-related problems.
7 Row and column norm scaling
Instead of asking the question whether one can scale a matrix to prescribed row- and column sums, one can ask for a scaling to prescribed row- and column norms.
For the -norm, this is discussed in . Their proof relies on an algorithm for symmetric scaling using Observation 5.5. In fact, an algorithm for the problem had already been studied for the symmetric case in Reference from among others. I could not obtain the reference..
There exist diagonal matrices and such that has prescribed row and column maxima and .
There exists a matrix with row and column maxima and with the same pattern as .
There exists a matrix with row and column maxima and with some subpattern of .
for every subsets and such that .
There are two further technical conditions given in as well as an algorithm that converges to the solution.
The usual equivalence scaling now corresponds to -norm scaling. For -norms of row and columns with , it is shown that this problem reduces to -norm scaling in . If denotes the entrywise power, we have:
There exist matrices and such that has prescribed row and column -norms and .
There exist matrices and such that has prescribed row and column sums and .
There exists a matrix with the same pattern as and row and column sums given by and .
Hence the answer again reduces to a question of patterns. Likewise, the -scalability can immediately be transferred. Much weaker results were obtained in , where the problem of -norm scaling was studied for arbitrary (not necessarily nonnegative) matrices. A (fast) algorithm is also derived in .
8 Row and column product scaling
At this point, one might wonder what happens when replacing the row- and column sums by row- and column products. This has been treated in , however it is not connected to entropy or maximum likelihood estimation, but instead to least square estimations, which is why we will not discuss the techniques here. However, this is interesting in light of the original justification of the RAS method in transportation planning by . The results are simple:
Let be a nonnegative matrix. Then the following are equivalent:
There exist positive diagonal matrices and such that has row and column products and .
There exists a matrix with the same zero pattern as and row and column products and .
The scalded matrix is unique.
Furthermore, if has no zero rows or columns, there always exists a matrix such that has equal row and column products.
Note that in the case of matrix balancing to equal row and column products, the result is also the same: This is possible if and only if a balanced matrix with the same pattern exists which is always the case (cf. , Theorem 5.2).
Algorithms and Convergence complexity
After the basic existence problems of matrix scaling were solved in the 60s to 80s, the focus shifted to algorithms and complexity theory in the 90s. The story is equally convoluted, not least because algorithmic complexity is difficult and not always well-defined in itself: One can decide to study worst case or average convergence speed, count algorithm steps or computational operations. Given the RAS method and the fact that it is a coordinate descent method for an intrinsically convex optimisation problem, which is amenable to a host of other techniques, the choice of a relevant class of algorithms is already not unique.
Since this review is geared more towards the mathematical aspects of the problem, our focus will lie on exact complexity results instead of proofs by example. Papers focussed on numerical aspects appeared as early as the late 70s, early 80s with average convergence considerations, and . A small overview about many of the recent developments can be found in .
Most algorithms explicitly require that the matrix be scalable (or positive). This means that we first need to check for scalability.
The fact that approximate scalability can be efficiently checked was probably first seen in . A complete and well-readable proof giving explicit bounds appeared in , exact scalability can be found in .
We first follow the proof in , which uses the transportation graph described in Figure 2.
The matrix is approximately scalable iff the maximum flow of this network is equal to . The flows along the edges then define a matrix with the wanted pattern. Such a network flow problem can be solved in time with and the number of nonzero elements in ().
In order to check for exact scalability, one has to check whether there exists a solution where each edge has a positive amount of flow (otherwise the entry would have to be reduced to zero). We can check for a solution to the maximum flow problem with minimum flow through each edge bigger than a prespecified value with the same costs as solving a maximum flow problem twice. Clearly, this does not help as, we would have to check scalability for any .
However (following ) if have only rational entries, we can find a number such that have only integer values. In this case, the flow problem has a solution iff there exists a matrix with column sum and row sum where each positive entry fulfils , where denotes the number of edges in .
This implies that it suffices to check for a solution with capacities and minimum flow through each edge having prespecified value . ∎
For positive semidefinite matrices, scalability can also be checked easily:
The formulation is already nearly in canonical form. We maximize subject to the equality constraints and . ∎
For arbitrary matrices scalability is mostly NP-hard (see Section 5.4).
2 The RAS algorithm
The RAS algorithm, being the natural algorithm to compute approximate scaling, is also the most studied algorithm. For the case of doubly stochastic matrices, it has long been known (cf. ) that for positive matrices, the RAS converges linearly (sometimes called geometrically) in the norm. gave a simple argument that the iteration will get better at any step. This can also be inferred from the fact that the RAS method is iterated I-projection onto a convex set using . Later, showed that the convergence is also linear in Hilbert’s projective metric, while showed linear convergence for all exactly scalable matrices basically in arbitrary vector norms, albeit without explicit bounds. Conversely, it was shown that only scalable matrices can have linear convergence meaning that the RAS converges sublinear for matrices with support that is not total (). We have the following best bounds:
where is the second largest singular value of .
The proof of this theorem crucially relies on the fact that matrices with a doubly stochastic pattern are direct sums of primitive matrices (modulo row and column permutations). Hence it cannot easily be extended to matrices with arbitrary row and column sums if those matrix patterns allow for non-primitive matrices. The approach in for positive matrices can also be extended to arbitrary row and column sums.
We observe that the occurrence of the second singular value should not come as a big surprise: Given a stochastic matrix , converges to a fixed matrix and the convergence is dominated also by the gap between the largest singular value and the second largest singular value of .
steps, where all matrix entries are in the interval and the maximal error is upper-bounded by . (, Theorem 1). They also derive a bound for a randomised version of the RAS, where at each step, the direction of descent is selected randomly and once in a while, the whole error function is computed randomly. The expected runtime is then slightly lower.
The two results (specific bounds and asymptotic linear behaviour for convergence speed) imply that the RAS method has generally good convergence properties if the matrix is positive.
A fully polynomial time algorithm (i.e. without a factor involving the size of the matrix entries) for general marginals was given in based on the RAS method with preprocessing. However, the algorithm scales with for the general -scaling and (using a different algorithm closer to the RAS) with for doubly-stochastic scaling.
In summary, the RAS method, while not fully polynomial by itself, can be tweaked in various ways to allow for fully polynomial algorithms. In addition, it has the advantage of being parallelisable as demonstrated in . However, the scaling behaviour is not particularly fast in specific examples (see for instance ), in particular it doesn’t seem to be very good at handling sparse matrices.
3 Newton methods
One of the first alternative algorithms to the RAS methods was provided in as a minimisation of using a modified Newton method as described in (it is not related how the equality constraints are introduced into the problem. This can be done using a -penalty function).
Newton methods were also developed to solve the scaling problem for positive semidefinite matrices. They can either be seen as Newton’s method applied to for symmetric (cf. ) or as Newton’s method applied to the Sinkhorn iteration equation (cf. ). Yet a different method was considered in .
shows that their algorithm converges in Newton iteration steps, where , if the matrix is scalable.
4 Convex programming
As noted in section 3.5, the convex programming formulation of the problem makes it amenable to a host of (polynomial time) techniques such as the ellipsoid method or interior point algorithms.
The bound uses ellipsoid methods. Later, the bounds were extended to cover generalised marginals in (also including the generalisation discussed in ) specifically using ellipsoid methods for the convex optimisation formulation of equation (16). The first instance of an interior point algorithm was probably formulated in applied to the entropy formulation. The authors find a strongly polynomial algorithm which scales better than with It seems that the authors were unaware of Kalantari et al. and ..
A different ansatz for an algorithm was used in , where the author uses the duality in convex programming and a coordinate ascent algorithm for the dual problem of his truncated matrix scaling. This algorithm will then be some form of generalisation of the RAS method.
5 Other ideas
We give a short primer of other algorithms considered in the literature:
The first paper to develop new algorithms with a focus on speed and not only concepts was , where a bunch of slightly different and optimised algorithms is derived.
An algorithm which is somewhat related to convex algorithms is considered in . It is a total gradient based, steepest descent algorithm for the homogeneous log-barrier potential.
In , using an algorithm for the matrix apportionment problem involving network flows and using ideas of , they provide an algorithm where the number of iterations scales with
Once again, for all .
With ever larger matrices, it is sometimes infeasible to access each element of the matrix on its own, because the matrix is not stored in that form or processed somewhere else. This makes it interesting to consider algorithms that do not need access to all elements, such as the RAS method for nonnegative matrices. Algorithms that are “matrix free” in this sense were developed in for doubly stochastic scaling of positive semidefinte matrices.
Finally, let us mention that algorithms were also developed for infinity norm scaling (cf. ) and other norm scaling (cf. ).
6 Comparison of the algorithms
A first comparison of several algorithms was performed in , however, the comparison is not really in terms of speed (for instance, all algorithms were implemented on different programming platforms), but in terms of useability.
While there have been many papers claiming superior convergence speed for their algorithm, the most comprehensive analysis has probably been achieved in , which is limited to doubly-stochastic scalings. In the paper, the authors compare the RAS method, a Gauss-Seidel implementation of the ideas of , and the fastest algorithm in with their own Newton-method algorithm. The test matrices are mostly large sparse matrices and the new algorithm is usually the fastest and most robust algorithm. The authors also claim that their Newton-based implementation is superior to and . They also suggest that the algorithm should outperform the convex optimisation based algorithms, albeit a direct comparison to the most recent algorithm in is missing, who only showed that their algorithm clearly outperforms the RAS method. also mention that their purely matrix free algorithm will outperform explicit methods such as those in if accessing single elements in the matrix is actually slow.
In general, matrix scaling can today be done on a routine basis even for very large matrices.
Applications of Sinkhorn’s theorem
The following problem can be encountered in many areas of applied mathematics (see also ):
We could also ask for balanced marginals or any other type of marginals. The problem is certainly not well-posed. What does “close” mean? This part of the review will try to give an overview why “close” means equivalence scaling in many applications. We will limit our attention mostly to the mathematical justification of matrix scalings, but I will try to give pointers to other literature.
This implies that we only consider “nearness” leading to equivalence scaling or matrix balancing as the result. In the literature, other nearest matrices have also been considered such as addition of small matrices (e.g. ). describe network flow algorithms that allow for a wider variety of applications.
Matrix scaling has many different real world applications, which implies that it also needs different justifications. While statistical justifications exist, many applications argue with the simplicity of the method and the fact that it performs well in practice. These are valid arguments, but they are unsatisfactory from a mathematical point of view. In the two following subsections we collect mathematically rigorous (or partly rigorous) justifications and their history.
As seen, matrix scaling solves entropy minimisation with marginal constraints. This is one of the most powerful entries for justifications of equivalence scaling as the right model, since relative entropy has strong statistical justifications, mostly in the form of maximum entropy or minimum discrimination information - see or later in for a justification in physics, or and for a justification in statistics.
Another justification closely connected to entropy minimisation is maximum likelihood models. For instance, if given a set of distributions and an empirical i.i.d. sample , then the maximum likelihood for being a sample of is given by the minimal relative entropy (). Max-Likelihood justifications for applications in contingency tables are given in .
A different class of justifications for the validity of the matrix scaling approach are arguments showing that matrix scaling conserves certain form of interactions within the matrix. For instance, matrix scaling conserves cross products () and so-called -cycles (, -cycles are certain products of matrix and inverse matrix entries). Both can be desirable for modeling reasons.
Finally, let us mention that the original justification (matrix scaling is a least-square type optimisation) made in turned out to be wrong very quickly and was superseded by real least-square methods in and later in or . Those however are not the same as equivalence scaling (see Section 6.8).
2 Axiomatic justification
Another justification for matrix scaling, which is particularly useful for application in elections is given in . Instead of considering just any matrix “close” to the original estimate, we want this matrix to fulfil a set of axioms.
Let be a nonnegative matrix, () be upper (lower) bounds to the row and column sums and be a scalar. As in Section 6.6, we denote the set of all matrices fulfiling the bounds with by . For any matrix and any set of bounds , we search for a method to allocate one out of potentially many matrices fulfiling and the following axioms:
Excactness: If and then
Relevance: If is another set of bounds such that and there exists a possible , then .
Uniformity: For any matrix with bounds , if we construct a new matrix by exchanging any submatrix by another submatrix which fulfils the same row and column sums minus the part of these bound allocated in , then .
Monotonicity: If we have two matrices with for all , then it also holds that for all possible allocations.
Homogeneity: Suppose and . Then, if two rows of are proportional and are constrained to the same row sum, then the corresponding rows in are always equal.
Then show that equivalence scaling (the fair share matrix of Section 6.6) is the unique allocation method for all nonnegative matrices where contains a matrix with the same pattern as .
3 A primer on applications
We will only sketch applications here since a complete list and discussion is probably infeasible.
A natural problem in geography is connected to predicting flows in a traffic network. If one considers for example a network of streets in a city at rush hour and a number of workers that want to get home, it is important to know how the traffic will be routed through the network. This is to a large degree a problem of physical modeling and a number of methods have been developed in the last century (for a recent introduction and overview see The authors also discuss the RAS method in chapter 5. I am however not convinced by their claim that Bregman provided the best analysis of the mathematics of the problem.).
For our purposes, the most interesting question results from estimating trip distribution patterns from prior or incomplete data. In a simplified model, one could consider only origin and destination nodes (e.g. home quarters and work areas), given by a nonnegative matrix . While the matrix is known for one year, it might be necessary to predict the changes given that the amount of trips to and from one destination change.
Several papers have treated a justification of the RAS method in this case. For instance, argues that the method provides a unique outcome and it is easier to handle and to compute than other methods (Detroit method, growth factor method,…). A discussion of trip distribution with respect to Problem 1 can be found in .
In many situations ranging from biology to economics, contingency tables need to be estimated from sample data. Contingency tables list the frequency distributions of events in surveys, experiments, etc. They are highly useful to map several variables and study their relations.
As a specific example, suppose a small census in Germany tries to estimate migration between the states. While the number of citizens is recorded, which means that the total net migration is known, it is not known where each individual migrant came from. From a small survey among migrants, how can one estimate the true table with correct marginals in the best possible way? If one does a maximum likelihood estimation, the result is once again matrix scaling (cf. ).
Social accounting matrices, or SAMs, are an old tool developed in and later popularised in to represent the national account of a country. To date, it is an important aspect of national and international accounting (as a random example see from the German national institute of statistics. An introduction to social accounting can also be found in and, from a short mathematical perspective, in ).
The idea is to represent income and outcome of a national economy in a matrix. Often, good growth estimates are known for the row and column sums and certain estimates are known for individual cells. The account estimates are then often not balanced, which can be achieved using matrix balancing or matrix scaling. Justifications can be imported from statistics, most notably maximum likelihood.
In , the author considered the following setup: Suppose we have a Brownian motion and a model which we are very confident about. In an experiment we observe its density at two times . Now suppose they differ significantly from the model predictions. How can we reconcile these observations by updating our model without discarding it completely?
This problem has been studied in a whole line of papers since then from to . The minimum relative entropy approach can be justified using large deviations (see ).
Given a system of linear equations with nonsingular , solving it relies on the Gaussian elimination procedure, which is known to be numerically unstable for matrices with bad condition number . In order to increase the stability, we have to modify , for example by multiplying with diagonal matrices and considering . Given that linear systems are ubiquitous in numerical analysis, it is of paramount importance to know how best to precondition a matrix in order to minimise calculation errors (see for instance the survey ). The answer to this question is problem dependent. Particular problems, where equivalence scaling is helpful to go include integral controllability tests based on steady-state information and the selection of sensors and actuators using dynamic information (see ).
One of the first papers to consider minimisation of using diagonal scaling was , who focused on matrix balancing instead of equivalence scaling (see also and for sparse matrices). Since the condition number contains the maximum norm, it might be best to require balanced maximum rows and columns instead of balanced row sums as observed in . This works particularly well for sparse matrices. Equivalence scaling has been studied as early as Reference from and later in . If we use other -norms in the definition of , a convex programming solution for minimising using equivalence scaling is provided in .
Note that unlike in all applications studied so far, preconditioning a matrix is useful not only for nonnegative matrices. This is one reason why matrix balancing was studied for copositive and not simply nonnegative matrices. A very different measure of the “goodness” of scaling which might also be of numerical relevance was studied in , where the authors solved the problem of matrix balancing of a matrix such that the ratio between the biggest and smallest element of the scaled matrix becomes minimial.
A very important application of equivalence scaling can be found in Voting: Given election results in a federal election, how can one best distribute the seats among the parties within the states such that each party and each state is represented according to the outcome of the election? Note that here, we need to adjust for natural numbers, which requires rounding (cf. ).
Early methods based on a discretised RAS method were developed in . The problem is very intricate in itself, because the justifications rely on what is perceived as “fair” and any method that is fair in some instances is unfair in others (see for instance the discussions in ; for an overview, see ).
Various other applications of equivalence scaling and matrix balancing exist such as:
A Sudoku Solver based on a stochastic algorithm based on the RAS was developed in .
An algorithm to rank web-pages was developed in . The RAS allows to derive an algorithm similar in scope to the HITS algorithm ().
The RAS method is analysed as a relaxed clustering algorithm in data mining (). However, it turns out that methods based on other Bregman-divergences are more favourable.
Given a (discretised) quantum mechanical time evolution, can we construct a local hidden variable model of its evolution corresponding to a deterministic stochastic transition matrix ()?
Given a Markov chain with a doubly stochastic transition matrix and given an estimate of the transition matrix, the best estimate of the real transition matrix is given by a scaled matrix ().
Regularising optimal transportation by an entropy penalty term such that it can be computed using the RAS, which is already much faster than optimal transport algorithms ().
Scalings for positive maps
We have already seen that Sinkhorn scaling is interesting for classical Schrödinger bridges as well as for scaling transition maps of Markov processes, etc. From a physics perspective, all these applications are classical physics, transforming classical states (probability distributions) to classical states.
In quantum mechanics, the basic objects are quantum states. For finite dimensional systems (such as spin systems), these quantum states are positive semidefinite matrices with unit trace. A quantum operation then maps states to states, i.e. it needs to be positive: If , then . In fact, this is not all that is required for quantum operations, but one actually needs to be completely positive (for an overview about quantum operations and quantum channels, see ). (Completely) Positive trace-preserving maps are then the natural generalisation of stochastic matrices. This raises the question whether concepts as irreducibility and a Perron-Frobenius theorem exist also for quantum channels and indeed they do. A Perron-Frobenius analogue was probably first described in , while the analogue for full indecomposability was first used in .
Likewise, it is called fully indecomposable if for any two nonzero orthogonal projections with the same rank such that
Finally, a map is called positivity improving (the analogue to positive matrices) if for all , .
A lot of different characterisations have been found (see Appendix C).
Let be a positive map. Then is called rank non-decreasing if for all
It is called rank increasing, if the sign in equation (38) is replaced by a .
The connections of Definitions 9.1 and 9.2 are explained in Appendix C.
Let us now define what we mean by scaling a positive map:
Let be a positive, linear map. We say that is scalable to a doubly stochastic map, if there exist such that
We call -scalable if there exists a scaling as in equation (39) to an -doubly-stochastic map .
The error function , which is similar to an -error function for matrices, will serve twofold: first, it defines approximate scalability (which can alternatively be defined by convergence of the RAS) and second, it defines a progress measure for convergence similar to error functions as considered in .
We can now state the full analogue of equivalence scaling to doubly stochastic form:
Given a positive map , it is scalable to a doubly stochastic map iff there exist some matrices and such that is a direct sum of fully indecomposable maps.
The scaling matrices are unique iff is fully indecomposable.
The fact that fully indecomposable matrices are uniquely scalable was first proved in . His work built on earlier work in (see also ), based on a generalisation of the convex approach in equation (16) and the London-Djokovic approach in equation (8).
Recently, the problem was considered with the hope to apply it for unital quantum channels in (which has never been formally published) and shortly afterwards in while trying to define and study “quantum” Schrödinger bridges. Both approaches use nonlinear Perron-Frobenius theory and thereby a generalisation of equation (11) to get a result. The approaches derived from classical results are discussed in Section 9.1.
Even earlier than Gurvits, a very limited version of the theorem was proven in (with subsequent generalisations) using an approach that does not derive from any of the classical approaches but instead uses the Choi-Jamiolkowski isomorphism. This is described in Section 9.2.
The extension of the theorem to necessary and sufficient conditions has as far as I know not been formally publishedGurvits actually claims a proof for the fact that a positive map is uniquely scalable iff it is fully indecomposable, but I did not understand how the only if part follows..
Furthermore, we can state an analogue of approximate scaling, which to date has only been considered in :
Let be a positive, linear map. Then is approximately scalable (i.e. -scalable for any ) if and only if is rank non-increasing.
An overview about different approaches and how they derive from existing approaches can be found in Figure 4.
We will now study how the theorems above were derived extending classical approaches to positive maps starting with an analogue of the RAS method for positive maps:
Let be a positive, linear map.
Start with .
By construction, we iterate between trace-preserving (even) and unital (odd) maps.
As stated, an approach along the lines of the London-Djokovic approach of equation (8) is found in (with methods of ). Since the complete proofs are lengthy and scattered over several papers, we provide full proofs in Appendix D for the benefit of the reader. In this section, we only sketch the path of the proofs.
Recall that a matrix scaling exists iff the following is positive and the minimum is attained:
Exchanging products with determinants and sums with traces, we obtain the following definition:
Let be a positive, linear map. Then define the capacity via
We will start with covering approximate scaling:
The capacity is the right functional to study scaling:
Let be a positive map. If then the RAS method of Algorithm 9.6 converges and is -scalable for any .
The proof of this lemma uses the following observation: For any we have
Then, a quick calculation shows that Algorithm 9.6 only decreases using this equality. If , one can then show that for .
Next, we need to see when the capacity is actually positive. To do this, for every unitary we need to define the tuple
where is the -th column of . This is done to connect the capacity with so called mixed discriminants (see also C), which are needed for the proof. In fact, we have:
Let be a positive, linear map and a fixed unitary. Then defining
where are once again the rows of , we have the following properties:
Using the mixed discriminant defined in Appendix C, we have
Most of the proof is very technical and found in . Some parts are explained in Appendix D.
Finally, this proves most of Theorem 9.5: We know that is rank non-decreasing if and only if is positive. In that case, the RAS algorithm converges in which case the map is approximately scalable. For the other direction, one can use a simple contradiction argument: Any map close to a doubly stochastic map must be rank non-decreasing and as scaling does not change this property of a map, any approximately scalable map must be rank non-decreasing.
The capacity is also the correct generalisation for exact scaling:
Let be a positive map. Then is scalable to a doubly-stochastic map if and only if and the capacity can be achieved.
The proof of this Lemma following is given in Appendix D. The direction “Capacity is achieved the map is scalable”, is proved by taking the Lagrangian and showing that at the minimum we have that
which implies scalability by Lemma 9.14. The converse direction is given by a direct calculation.
In order to prove that a map can be scaled to doubly stochastic form, one then needs to connect this lemma to full indecomposability of matrices. The proof is done using an argument involving strict convexity. Like the original London-Djokovic potential (8), the capacity is not a convex function, but one can make a substitution similar to Formulation (16) by considering the following function for any tuple of positive definite matrices:
Let be a positive, linear map and given , let . Then
The proof is technical and uses mixed discriminants as well as results about them from , which is why we only discuss it in the appendices. This is then used to prove
Let be a positive, linear map. If is fully indecomposable, there exists a unique scaling of to a doubly stochastic map.
The idea is somewhat similar to the approximate Sinkhorn theorem: Since fully indecomposable maps are in particular rank non-decreasing, we know that the capacity is positive. For any , which is diagonalised by , using the tuple , one can then see that with the eigenvalues of . Showing that the infimum must lie inside a compact set then finishes the proof, since Lemma 9.11 implies existence and uniqueness of the minimum as is strictly convex.
Using Lemma C.7, we can see then see that up to a unitary, every doubly stochastic map is a direct sum of fully indecomposable maps (much like doubly stochastic matrices are up to permutations a direct sum of fully indecomposable matrices, see Proposition A.5). Hence, any map which is a direct sum of fully indecomposable maps up to some scaling is clearly scalable to doubly stochastic maps. Sadly, the condition seems not very useful and the question remains open, whether one can simplify this condition.
However, that does in fact answer the question of unique scaling: Since the scaling for direct sums is not unique (we can always interchange summands), a map is uniquely scalable if and only if it is fully indecomposable.
1.2 Nonlinear Perron-Frobenius theory
The main idea of the alternative proofs in my Master’s Thesis () and the paper is to extend the Menon operator to positive semidefinite matrices:
Let be a positive map, such that for all . Let denote matrix inversion, then we define the following nonlinear map:
This map is well-defined and after normalisation, it sends positive definite matrices of trace one onto itself.
We can then reformulate the existence problem into a fixed point problem:
Let be the positive definite eigenvector of . Then define . Since is an eigenvector, one immediately sees that with . Now define and (i.e. and ), then are positive definite and if we define the map:
On the other hand, a similar calculation shows
but since was shown to be unital, is trace-preserving and to begin with. Conversely, given as in the lemma, would be a fixed point of the Menon-operator. ∎
Note that this completes the proof of Lemma 9.10. We only have to see that the conditions at the minimum are met if and only if the Menon operator has a fixed point. This also provides the connection between the Menon operator and Gurvits’ approach: As in the classical case, the conditions for a fixed point of the Menon operator are given by the Lagrange conditions of the London-Djokovic potential.
We observe that the Menon-operator, if it is defined, is a continuous, homogeneous positive map. This lets us give a proof of a weak form of the Operator Sinkhorn Theorem:
where we used the unitality of in the third step. This implies
Let be a trace-preserving and positivity improving map, then there exist maps such that is a doubly stochastic map.
1.3 Other approaches and generalised scaling
We have seen that at least two classical approaches for proving that a matrix can be scaled to a doubly stochastic matrix can be extended to the quantum case without too much trouble (the proofs however might be more difficult): nonlinear Perron-Frobenius theory and the barrier function approach. In a sense, we have also seen convex programming approaches. An immediate question is whether one can extend the entropy approach. This was also asked in and it is a major open question in , since the motivation of Schrödinger bridges heavily relies on relative entropy minimisation. The answer is not clear since something like a quantum relative entropy is only used only on the level of matrices and a justification via the Choi-Jamiolkowski isomorphism is not immediate (see Section 9.2):
Another question is how to extend the theorems from the doubly-stochastic map to cover arbitrary marginals, i.e. we want to scale a positive map such that
with some prespecified . For Gurvits’ approach based on equation (8) this is not really straightforward, since it is unclear how to take appropriate powers of . For the approach via nonlinear Perron-Frobenius theory, this can be done to some degree:
Given a positivity improving map and two matrices with , there exist matrices and a constant such that fulfills
Step 1: Let be positivity improving, then is a well-defined, continuous, and homogeneous map. It is well-defined, since maps and and send if . It is homogeneous, because is linear and for . Finally, is continuous as taking the square root of a positive definite matrix is continuous and matrix multiplication and inversion of positive definite matrices is continuous. Likewise, is continuous and thus as composition of continuous maps.
Step 2: We now claim that a scaling of with as in the theorem with exists iff has an eigenvector. This was observed in and is a straightforward but lengthy calculation.
Step 3: Finally, we can prove the existence of such that a scaling exists by invoking Brouwer’s fixed point theorem. The map
is a continuous, well-defined map, hence it has a fixed point. This is necessarily an eigenvector of , hence defines a scaling. ∎
The problem with this proof is that this Menon operator is no longer clearly a contraction mapping, hence uniqueness and convergence speed of the algorithm are not clear. Also, the obvious algorithm derived from this proof differs from the usual RAS algorithm. It is not clear how to remedy this or extend one of the other approaches. Also, this map has even worse prospect of being generalised to positive and not necessarily positivity improving maps. In any case, it is not immediately clear what the right necessary and sufficient conditions are. In the case of matrices, patterns were the important concept, but what is a pattern supposed to be for positive maps? One can always choose a basis and represent the map as matrix, but this is very much map-dependent and it is not clear what the correct interpretation will be.
2 Operator Sinkhorn theorem via state-channel duality
Another formulation of the operator Sinkhorn theorem is given by the Choi-Jamiolkowski isomorphism. It states that given any positive map , we have that
is a block-positive matrix (i.e. for all ). Here is the so-called maximally entangled state. If is completely positive, i.e. is a positive map, then is a positive semi-definite matrix. Now consider and . We have
Therefore, the task can be reformulated: Given a block positive matrix , find such that
We can then state an equivalent version of Sinkhorn scaling for positive map:
Let be a positive definite density matrix. Then there exist matrices such that
But then, since the are linearly independent, for all . Likewise, for all and we have the required normal form. ∎
3 Convergence speed and stability results
Gurvits’ proof already gives an estimate for the convergence speed of the scheme (see Theorem 4.7.3. in ). Let us give an alternative proof using Hilbert’s metric which is equivalent to the classical proof in and reminiscent of the convergence proof in . Throughout the proof, we use several results from Appendix B, in particular the definition of Hilbert’s projective metric on the cone of positive semidefinite matrices and the definition of the contraction ratio in equation (68). To proceed, we define a metric on the space of positive maps that are scalable:
Let be two positive maps such that for some positive matrices . Then
defines a metric on the space of positive maps (two maps that cannot be scaled to each other have infinite distance).
A proof that this constitutes a metric may be found in , Chapter 2. Recall the Sinkhorn iteration as defined in equations (41)-(42). For convenience we use a slightly different notation:
Let be a positivity improving, trace preserving map. Let be the unique doubly stochastic scaling limit.
where is the contraction ratio of equation (68). In particular, this implies via proposition B.6 (implying that here, ) that the convergence is geometric.
The proof is similar to the classical one in . First recall the definition of from Appendix B:
Then but finite, since is a positivity improving map and the maximum is attained. This is true, because it suffices to consider on the compact set using Proposition B.6 (iii).
We first make the following observations:
Equation (58) follows from the definition of . Equation (57) follows from the definition of and in the definition of the Hilbert metric and the fact that taking noncommutative inverses does not change positivity.
Let us now focus on . Let be invertible, then
using observation (58) and then being invertible. In particular, this implies that for every we have a universal upper bound
Since but finite, this implies that we can upper bound each and by some . The rest is basically an iteration.
These are the key observations. Now using the definition of we obtain:
and therefore, if denotes the limit of the Sinkhorn iteration, using the geometric series
The other inequality for the maps follows from symmetric arguments. ∎
Note that in contrast to the classical case in , because of the noncommutativity in equation (57), a simple extension to the general scaling of positivity improving maps seems not possible.
Next, we wish to generalise also the stability results. It seems natural that this should follow from the contraction results above:
Let be positivity improving, then the scaling is continuous in .
Let be a positivity improving map and be a perturbation which is again positivity preserving, where is a positive map with (for instance in the operator norm).
Then let be such that they scale to a doubly stochastic map. This implies that
But then, by equation (64), we have that if is the scaling of to a doubly stochastic map, then
Using the triangle inequality and the fact that for some constant finishes the proof. ∎
As noted, both theorems can be extended to cover all exactly scalable positive maps using Gurvits’ Theorem 4.7.3. of . Given the result for classical matrices, it seems natural that the convergence speed for rank non-decreasing but not exactly scalable matrices should not be geometric.
4 Applications of the Operator Sinkhorn Theorem
Let us finally mention applications of the operator version of Sinkhorn’s theorem. The state-version of the theorem, since it can be seen as a normal form for states under local operations, has been applied in the study of states under LOCC operations (see for instance ).
The approximate operator version was developed to obtain polynomial-time algorithms (Sinkhorn scaling) for a problem known as “Edmond’s problem”. It asks the following question (): Given a linear subspace of , does there exist a nonsingular matrix in ? The question can be asked also over different number fields and in different contexts. It is particularly interesting, because it is related to rational identity testing over non-commutative variables as studied in . For further input we refer the reader to the extended and well-written review of the applications in .
Finally, the exact scalability of fully indecomposable positive maps provided bounds on the mixed discriminant of matrix tuples, which is interesting to provide permanent bounds ().
I was first acquainted with the topic of matrix scaling through my Master’s thesis supervisor Michael Wolf. We also worked together on unitary scaling and I thank him for his valuable input. I am supported by the Studienstiftung des deutschen Volkes.
References
Appendix A Preliminaries on matrices
Sinkhorn’s theorem is closely related to irreducibility and notions connected to it.
Therefore, we first recall the following charactersation of irreducible matrices.
The digraph associated to is strongly connected.
For each and there exists a such that .
For any partition of , there exists a and an such that .
An overview about these and similar properties can be found in . Graph related properties are proven in .
A bipartite graph has a perfect matching, if it contains a subgraph where the degree of any vertex is exactly one, i.e. any vertex is matched with exactly one other vertex.
Note that this definition is not dependent on the size of the entries of , it only depends on whether an entry is positive or zero.
is fully indecomposable for all permutations ,
There exist permutations such that is irreducible and has a positive main diagonal.
For any the edge set of the bipartite graph for , there exists a perfect matching in containing this edge.
The equivalence is obvious and is done in . The direction follows from the Frobenius-König theorem (cf. , Chapter 2). The converse direction follows from a short contradiction proof.
Finally, follows essentially from Theorem 4.1.1 in , which was first observed in In Hungarian. Reference taken from . ∎
Since multiplication of positive diagonal matrices from the right and from the left does not change the pattern of a matrix, having a matrix that has the required row and column sums is an easy necessary condition for scalability.
Let us now consider the special case of doubly stochastic matrices in more detail. We define:
Furthermore, has support, if there exists a positive diagonal, i.e. there exists an such that for some permutation with we have .
After independent permutations of rows and columns, is a direct sum of fully indecomposable matrices.
Furthermore, has support if and only if there exists a matrix with a subpattern of with total support.
For we follow the proof in .
Let be doubly stochastic. Since we can permute rows and columns independently, we can assume that is of the form
Finally, consider a matrix with total support. Clearly, if it is a submatrix of some other matrix , then will have support, since any element which is contained in will lie on a nonzero diagonal. Conversely, if has support, setting any element which does not lie on a positive diagonal to zero produces a matrix that has total support. ∎
Appendix B Introduction to Nonlinear Perron-Frobenius theory
The basic result underlying Perron-Frobenius theory is an old theorem from and stating:
The theorem was later interpreted geometrically in , where the authors noted that it follows using Hilbert’s projective metric and contraction principles. From then on, it was slowly extended to (not necessarily linear) operators, which lies at the heart of nonlinear Perron-Frobenius theory. The connection to matrix scaling became clear int the 60s to 80s and parts of each theory have developed alongside each other since. Probably the best current reference on the topic is . In the following, we sketch some of the most important ideas surrounding the theory:
Recall that given a topological vector space , a cone is a set such that for all , for all . A convex cone is a cone that contains all convex combinations. By definition, it is equivalent to say that is a convex cone if and only if for all and all . A cone is called solid, if it contains an interior point in the topology of the vector space and closed if it is closed in the given topology. It is called polyhedral, if it is the intersection of finitely many closed half-spaces (cf. ).
As is the case with classical Perron-Frobenius theory, the spectral radius is the crucial notion. For general maps between cones, there are several definitions, which turn out to be the same for most purposes, hence we restrict to one such notion:
Let be a solid closed cone in a finite dimensional vector space with a fixed norm and a continuous homogeneous map. Define the cone spectral radius as
The first crucial observation is that the spectral radius is actually attained for homogeneous order-preserving maps (which is not the case for general maps):
Let be a solid closed cone in a finite dimensional vector space . If is a continuous, homogeneous, order-preserving map, then there exists with
Note that the theorem does not tell us whether the eigenvector lies inside the cone or on its boundary. The next and very powerful theorem does also not settle existence, but if existence is known, then it assures uniqueness and convergence:
Let be a solid closed cone in a finite dimensional vector space and let the dual cone. If is a homogeneous strongly order-preserving map and there exists a with such that , then
for all .
This theorem is essentially due to the fact that the map is contractive in what is known as Hilbert’s projective metric of the cones. Once the attractiveness is established, uniqueness follows immediately, since if we had another fixed point , then would imply a contradiction to equation 66. Since the attractiveness can be used to estimate convergence speeds (see ), we include the relevant proposition due to Birkhoff ().
Let be a closed, convex, solid cone in some vector space, then for any such that and for some , define
Then we can define Hilbert’s projective metric as
We will leave out , when it is clear from the context. Furthermore, we set and if is otherwise not well-defined.
Let be a closed, convex, solid cone in some vector space . Then satisfies:
and for all .
for all such that the quantities are well-defined.
for all and .
Note that the first two properties show that is indeed a metric and the third property shows why it is called a projective metric.
Let be a bounded, closed, convex and solid cone in some vector space . Let be a linear map, then
where .
Furthermore, and .
A proof can be found in . The result can be extended to much more general scenarios, see also and references therein. One can actually show that equality holds, i.e. is also attained, but this is not important here.
With all this machinery, we still need to prove existence of a fixed point in the interior of . The general theory for proving that a fixed point lies in the interior is weak and we generally have to prove it “by hand”.
For, the Menon operator we have an additional problem: It is at first only well-defined on the interior of the cone of positive semidefinite matrices. A natural question is whether it can be extended to cover the closed cone as well. For matrices, this is covered by the following theorem:
Let be cones and be an order-preserving, homogeneous map. If is solid and polyhedral, then there exists a continuous, order-preserving, homogeneous extension .
Appendix C Preliminaries on Positive Maps
Let us start with irreducible maps. Many different characterizations exist, which we recall for the reader’s convenience:
For a positive, linear map the following properties are equivalent:
Most of these properties are well-known. A proof can be found in .
As with matrices in the original Sinkhorn theorem, irreducibility is not the right characterization to work with, since given an irreducible map and two , is not necessarily irreducible. Before giving a characterization of fully indecomposable maps, we will study rank non-decreasing maps.
To every positive map and any unitary , we associate the decoherence operator via:
where is the -th row of . Furthermore, we associate to every decoherence operator the tuple
This will be important during the proof of the Sinkhorn scaling, because every map will be associated to the mixed discriminants of its decoherence operators:
Let be an -tuple with , then
Then we have the following characterization of rank non-decreasing maps, which is essentially due to :
Let be a positive, linear map. Then the following expressions are equivalent:
is rank non-decreasing for any unitary .
For any , if , then
For any , .
is rank non-decreasing for any of full rank.
The proofs that (i), (ii), (iii) and (v) are equivalent are essentially the same as for the fully indecomposable case in C.6. It remains to show the equivalence of (v) with (i). This was done in .
We can define what will turn out as a measure of being indecomposable for a tuple of matrices:
Let be an -tuple of matrices and denote by the tuple where is substituted by . Then define:
For fully decomposable maps, we have the following characterization (part of which is already present in ):
Let be a positive, linear map. Then the following expressions are equivalent:
is fully indecomposable
For all singular, but nonzero , .
Property (iii) holds for for every .
is fully indecomposable for all .
For any , if , then
for all .
Furthermore, when this is satisfied, and via also map the open cone into itself. Note also that the properties (vi) to (viii) are also equivalent for any fixed unitary.
(i) (iii): let be fully indecomposable and assume there was a nonzero , with . Since the kernels are vector spaces, this implies we can find a unitary matrix transforming the basis such that . Thus:
(iii) (i): Note that given of the same rank such that , we have in particular for some . Since is of the same rank as , , which is a contradiction.
(i) (v): Let . By positivity of , we have
Hence in particular and .
(i) (ii): This equivalence follows directly from (iv) by expressing and in terms of the projections onto the orthogonal complements.
The remaining equivalences (i) (vi),(vii),(viii) can be found in (with proofs scattered throughout the earlier papers by the same author). We just repeat them here using our notation for the reader’s convenience.
(iii) (vi): By definition, where . Obviously, is doubly stochastic, hence rank non-decreasing. Therefore, if (iii) holds for , it must also hold for .
(vi) (iii): Let be fully indecomposable for all unitaries and assume that for some with . Let be the unitary that diagonalizes , then , hence is not fully indecomposable. This is a contradiction.
(vi) (vii): Let be fully indecomposable. Then
Note that if , then and hence follows the claim. For the other direction, we can use the same idea.
(vii) (viii): Let for all unitary fulfill (vii). Define as the tuple where the -th element is replaced by the -th. Note that
Conversely, let not fulfill (vii) for some unitary , i.e. for some with we have . Let , then for the tuple as before, we have:
But then, by proposition C.4, and hence also . ∎
This proposition shows in particular that any fully indecomposable map is primitive: For any unit trace , . Note that the converse might not be true. By the characterization of primitive maps (, Theorem 6.7), this implies that each fully indecomposable map has only one fixed point.
Appendix D Gurvits’ proof of scaling and approximate scaling
This appendix provides the details of Gurvits’ approach, hence it does not contain original material. For easier readability, we repeat all Lemmata.
Let and a positive, linear map. Then we define a locally scalable functional to be a map such that
Locally bounded functionals are the right tools to study scalability:
Let be a positive, linear map. Given a bounded locally scalable functional such that , the Sinkhorn-iteration procedure converges:
We follow . Recall the definitions of the Sinkhorn iteration in equations (41)-(42). Because of property (73), we have
where we used that increases monotonically in the last inequality.
Let us now only consider even. Then we have just seen that
Since is trace-preserving as is even, for all . If we set , then by strict concavity of the logarithm,
since and . But then:
hence for . ∎
is a bounded locally scalable functional.
hence is a locally scalable functional. Via the AGM inequality, we have
Let be a positive, linear map and a fixed unitary. Then defining
where are once again the rows of , we have the following properties:
Using the mixed discriminant , we have
The first part of the lemma is one of the main results of . Since the proof is long due to many technicalities, we leave it out here.
The second part gives the relation between the two capacities. Let with and be such that , . Then there exist unitaries such that is diagonal with diagonal entries . By construction,
where are again the columns of . Hence
Likewise, we can construct a sequence of such that converges to the infimum and we can construct a sequence with for each such that
Taking the diagonal sequence we obtain a sequence converging to . Finally, define , then and
Finally, we can write down the Operator Sinkhorn theorem (Theorem 9.5 in the main text):
Let be a positive, linear map. Then is -scalable for all iff is rank non-decreasing.
We mostly need to combine the lemmas. By lemma D.3, the capacity is a bounded, locally scalable functional, which implies by proposition D.2 that converges, if . Now, by lemma 9.9,
Since is compact, it suffices to show that for every , . Again, by lemma 9.9,
but for every if and only if is rank non-decreasing by proposition C.4. Hence, converges for rank non-decreasing maps.
Now, suppose that is not rank non-decreasing. Then, by proposition A.5 (vii), there is a such that fulfills
Using equation (77), by the Cauchy Schwarz inequality,
which contradicts , hence must be rank non-decreasing. ∎
D.2 Exact scalability
Let be a positive map. Then is scalable to a doubly-stochastic map if and only if and the capacity can be achieved.
Suppose there exists with . The Lagrangian of the capacity is
We claim that the conditions are equivalent to
Let be the usual matrix unit, then
where in the last step we use Cramer’s rule. For we obtain the right hand side of equation (79) from equation (78), hence follows the claim. It now follows from Lemma 9.14 that any fulfilling Equation (79) defines a scaling.
A quick calculation shows that attains equality in Equation (81). This then necessarily minimises the capacity. ∎
Let be a positive, linear map and given , let . Then
We follow the proof of . Given a tuple of positive definite matrices, one can show ():
where denotes the general inner product, is the set of -tuples of integers such that and
This implies that we can rewrite :
It is well known that for positive matrices this is a convex function, but let us follow the proof of here.
Let . We need to prove that , the Hessian, is positive (semi)definite. By definition, , hence it is sufficient that .
hence the Hessian of is positive semi-definite and therefore is convex.
Now, assume that is fully indecomposable, hence the tuple fulfills proposition A.5 (vii) and (viii) for all . In particular, if is the tuple with the -th entry being replaced by the -th. entry. Then in particular. Note that by equation (83), where
where and by proposition A.5 (viii).
We only need to consider and show that this is a positive definite matrix on the hyperplane . Using the usual matrix units we can write:
We find that , since only the first four summands contribute to the diagonal terms. For the off-diagonal terms, note that in , all unordererd combinations of occur, twice with a positive sign and four times with a negative. Hence we obtain terms with either or that are not cancelled by other terms and therefore . In short:
Note that the image of is just the hyperplane and it is easy to see that is a multiple of the projection onto the hyperplane . Therefore, is strictly convex on . ∎
Let be a positive, linear map. If is fully indecomposable, there exists a unique scaling of to a doubly stochastic map.
where denotes the general inner product, is the set of -tuples of integers such that and .
where we use that certainly for all , with for all and , is a valid -tuple where the coefficient . Since all the terms in the sum of equation 84 are positive, we can just leave out all other . By definition, for every , hence:
from the lemma above. But then, the infimum must be attained on the compact subset . Therefore, also for the capacity the infimum can be considered on a compact subset of and is then attained. Uniqueness is ensured by the strict convexity of . ∎