In this post, we continue the design and analysis of an interior point method algorithm to solve a linear program that we started last time.

1. Overview and intuition for the analysis

In this section we try to distil the essential elements of the analysis, so that the technical sections later on make more sense. We make some non-rigorous simplifications along the way, so take them with a grain of salt.

We solved the equations giving us the Newton steps for the solution of the perturbed linear program. We now need to, basically, figure out what to do with them. One thing would be to iterate them a bunch of times until we get really close to the perturbed solution. This is not what we'll do; instead, we'll only take one Newton step, and then decrease the perturbation. This makes sense because, on the one hand, the Newton step is expensive to compute (there's a matrix inversion in there), and on the other, each Newton step makes a lot of relative progress: the convergence is Q-quadratic, in the following sense:

Definition 1.

A sequence $ \{x_n\}_{n\in\mathbb{N}}$ converges Q-quadratically to $ x$ if there's a constant $ C$ such that for all big enough $ n$ we have \begin{align*} \frac{\left\| x_{n+1}-x \right\|}{\left\| x_n-x \right\|^2} \leq C \end{align*}

That's fast! A prototypical example of a sequence with exactly this rate of convergence is $ 2^{-2^k}\to 0$. A caveat is that we need to be close enough to the optimum for Newton's method to have this property. So we will have to carefully coordinate the way we decrease the perturbation with our bound on the improvement from a single Newton step in order to squeeze the optimal speed-accuracy tradeoff from our method, but let's worry about that later.

Now that we've made this design decision, we need a bound on the improvement from the Newton step. Let's stare real hard at the equations defining the step: we have \begin{align*} A\Delta x &= 0 \\ A^T\Delta y + \Delta s &=0 \\ x_js_j+\Delta x_js_j + x_j\Delta s_j &= \mu \end{align*} where the current values satisfy $ Ax=b$ and $ A^Ty+s=c$. The ideal state of affairs is that $ (x_j+\Delta x_j)(s_j+\Delta s_j) = \mu$ for all $ j$, and we're off from that by $ \Delta x_j\Delta s_j$, so we will measure progress in terms of the deviation from this in the $ 2$-norm (because it's nice); so we're interested in bounding the new deviation \begin{align*} \sum_{j}^{} \Delta x_j^2\Delta s_j^2 \end{align*} in terms of the previous deviation $ \sum_{j}^{} (x_js_j-\mu)^2$. In the light of $ x_js_j+\Delta x_js_j + x_j\Delta s_j = \mu$, the new deviation feels like a degree-4 thing, while the old one feels a degree-2 thing, so it looks promising; in particular, this is the intuition for why we get the Q-quadratic convergence property.

But! Think about bounding $ \sum_{j}^{} \Delta x_j^2\Delta s_j^2$. Our handle on $ \Delta x$ and $ \Delta s$ comes from the equation $ x_js_j+\Delta x_js_j + x_j\Delta s_j = \mu$, which essentially tells us something about the sum $ \Delta x_j +\Delta s_j$ after appropriate rescaling: the sum will be something very related to $ \mu-x_js_j$, which looks nice as it's related to the current deviation. However, $ \Delta x_j^2\Delta s_j^2$ may have very little do to with $ \Delta x_j + \Delta s_j$ or the absolute value or square thereof, as the latter may as well be zero all the time... but it can't, because $ \Delta x$ and $ \Delta s$ are noticeably independent.

This is where the interplay between the primal and dual is crucial. Going back to staring at the equations, we notice something very nice. The increment in the primal variables satisfies $ \Delta x\in\ker A$; and the increment in the slack variables satisfies $ \Delta s\in \mathop{\rm im}\nolimits A^T$. That's because we're moving along $ Ax=b$ in the `primal space', and along $ A^Ty+s=c$ in the `dual space'. Thus $ \Delta x\perp\Delta s$. This is what allows us to write down something of the sort \begin{align*} \sum_{j}^{} \Delta x_j^2\Delta s_j^2\lesssim \left(\sum_{j}^{} (x_js_j-\mu)^2\right)^2 \end{align*}

Now, let's think about what happens when we change $ \mu$ after one Newton step. Suppose we take a Newton step and replace $ \mu$ by $ (1-\theta)\mu$ where $ \theta\ll 1$. This introduces a uniform error term $ \theta\mu$ in all coordinates, so that the new error in coordinate $ j$ is $ \Delta x_j\Delta s_j - \theta\mu$. But the orthogonality $ \Delta x\perp\Delta s$ tells us that the old error vector $ v$ with $ v_j = \Delta x_j\Delta s_j$ is orthogonal to the space of uniform vectors. So, the 2-norm of the new error is going to be \begin{align*} \sqrt{\left\| v \right\|^2 + \theta^2\mu^2n} \end{align*} because the squared 2-norm of the all-1s vector is $ n$. So we see that setting $ \theta\approx O( 1/\sqrt{n})$ will probably be a good idea - if it's polynomially bigger, then our deviation goes to $ \infty$ as $ n$ increases, and if it's any smaller, it will make us take more steps than necessary. This way, it will take about $ \sqrt{n}$ iterations to decrease the deviation by a constant factor, and thus $ O\left(\log(\frac{1}{\varepsilon})\right)$ iterations to decrease it to $ \varepsilon$.

While the orthogonality (almost-orthogonality would have worked too) of $ \Delta x$ and $ \Delta s$ was essential in bounding $ \left\| v \right\|$, it doesn't seem to be in the calculation we just did: we would get pretty much the same behavior if $ v$ was collinear with the all-1s vector.

That hopefully gave you some intuition for the proof! Another curious consequence of the orthogonality that we should mention is that \begin{align*} \sum_{j}^{} \Delta x_j \Delta s_j = 0 \end{align*} and so our first-order approximation to the constraint $ (x+\Delta x)(s+\Delta s)=\mu$ is right on average, in the sense that \begin{align*} \left<x+\Delta x,s+\Delta s\right> = \mu n \end{align*}

2. Convergence of the Newton steps

Recall what we got at the end of the last post: the Newton step for the KKT equations is given by \begin{align*} \Delta y &= (AXS^{-1}A^T)^{-1}(b-\mu AS^{-1}e) \\ \Delta s &= -A^T\Delta y \\ \Delta x &= S^{-1}(\mu e + XA^T\Delta y) - x \end{align*} where the current values satisfy $ Ax=b$ and $ A^Ty+s=c$. We want to get to values $ (x',s',y')$ such that $ x_js_j = \mu$ for all $ j$, and the Newton step represents an effort in that direction. The hope is that if we iterate this step, we will converge to values of $ x$ and $ s$ for which we indeed get $ x_js_j=\mu$.

But first we need to make sure we even stay feasible, in the sense that $ x>0$ and $ s>0$. The hope is that if we're close enough to the actual solution (which we know is feasible), the Newton step won't do anything too crazy and will stay within the feasible region. We can write down some conditions that easily imply that:

Observation 1.

If $ (x_j+\Delta x_j)(s_j+\Delta s_j)>0$, then both terms are $ >0$. Indeed, we have \begin{align*} 0<\mu + x_js_j = (x_j+\Delta x_j)s_j + (s_j+\Delta s_j)x_j \end{align*} and thus they can't both be negative.

In particular, the condition $ (x_j+\Delta x_j)(s_j+\Delta s_j)>0$ is sort of saying the product is close to what we want it to be ($ \mu$), and it will be easy to incorporate it into some metric of closeness to that.

To be more quantitative, what we're really hoping is that $ \Delta x_j \Delta s_j $ is small compared to $ \mu$ across all $ j$. To see how far we are from that, we introduce the proximity measure, the vector $ v(x,s,\mu)$ given by $ v(x,s,\mu)_j = \frac{x_js_j}{\mu}-1$, motivated by $ v(x+\Delta x, s + \Delta s,\mu)_j =\frac{\Delta x_j \Delta s_j}{\mu}$, which we want to be small. We denote $ \left\| v(x,s,\mu) \right\|=\sigma(x,s,\mu)$

Then we ideally want to show this vector keeps shrinking after each iteration. After some poking around, we get something nice:

Theorem 1.

If $ \sigma(x,s,\mu)=\sigma < 1$, then \begin{align*} \sigma(x+\Delta x,s+\Delta s,\mu) \leq \frac{1}{2}\frac{\sigma^2}{(1-\sigma)} \end{align*}

Proof 1.

We have \begin{align*} \mu^2\left\| v(x+\Delta x,s+\Delta s,\mu) \right\|^2 = \sum_{i}^{} \Delta x_i^2 \Delta s_i^2 \end{align*} We know that $ \Delta x_is_i + \Delta s_ix_i=\mu-x_is_i$, so it's natural to try to bound the above sum by applying appropriate weights to the $ x$ variables and $ s$ variables; doing so, we also want to preserve the product $ x_js_j$.

Thus, we consider the vectors $ dx,ds$ with $ dx_i = \Delta x_i \sqrt{\frac{s_i}{x_i}}, ds_i = \Delta s_i \sqrt{\frac{x_i}{s_i}}$. We will try to get some bound on $ \sum_{i}^{} dx_i^2ds_i^2$ using the sum $ dx_i+ds_i$ (which we know a nice expression for).

The key part now is that since $ A\Delta x = 0$ (remember the KKT conditions from the previous post) and $ \Delta s = -A^T\Delta y$, we have $ \Delta x \perp \Delta s$, and consequently, since $ dx_ids_i = \Delta x_i\Delta s_i$, $ dx\perp ds$. So, \begin{align*} \sum_{i}^{} \Delta x_i^2 \Delta s_i^2 &= \sum_{i}^{} dx_i^2 ds_i^2\leq \frac{1}{4}\sum_{i}^{} (dx_i^2 + ds_i^2)^2 \\ &\leq \frac{1}{4} \left(\sum_{i}^{} dx_i^2 + ds_i^2\right)^2 \\ &= \frac{1}{4}\left(\left\| dx \right\|^2 + \left\| ds \right\|^2\right)^2 \\ &=\frac{1}{4}\left(\sum_{i}^{} \left\| dx+ds \right\|^2\right)^2 \\ &= \frac{1}{4} \left\| dx + ds \right\|^4 \end{align*} by Pythagoras' theorem. We compute \begin{align*} \left\| dx+ds \right\|^2 &= \sum_{i}^{} \left(\frac{\Delta x_is_i + \Delta s_ix_i}{\sqrt{s_ix_i}} \right)^2 \\ &= \sum_{i}^{} \left(\frac{\mu-x_is_i}{\sqrt{x_is_i}} \right)^2 = \sum_{i}^{} \frac{\mu^2}{x_is_i} v(x,s,\mu)_i^2 \end{align*} Piecing things together, \begin{align*} \left\| v(x+\Delta x, s+\Delta s,\mu) \right\|&\leq \frac{1}{2} \sum_{i}^{} \frac{\mu}{x_is_i}v(x,s,\mu)_i^2 \\ &\leq \frac{1}{2}\displaystyle\max_{i} \frac{\mu}{x_is_i} \left\| v(x,s,\mu) \right\|^2 \\ &=\frac{1}{2}\displaystyle\max_{i} \frac{\mu}{x_is_i}\sigma^2 \end{align*} It remains to argue that the maximum is at most $ \frac{1}{1-\sigma}$, which follows readily from $ \left| v_j \right|<\sigma$ for all $ j$.

You should be impressed by this rate of convergence, as we talked about earlier. Yet, we will not follow it through, but just take a single step, as we discussed in the intuitive description of the algorithm!

3. Following the path

Now we need to execute the path-following strategy with some care of the parameters. Let's first see how our proximity measure changes when we scale down $ \mu$:

Theorem 2.

If $ \left<x,s\right>=n\mu$ and $ \sigma=\sigma(x,s,\mu)$, then \begin{align*} \sigma(x,s,\mu(1-\theta)) = \frac{1}{1-\theta}\sqrt{\sigma^2+\theta^2n} \end{align*}

Proof 2.

We have \begin{align*} v(x,s,(1-\theta)\mu) &= \frac{XSe}{\mu(1-\theta)}-1 = \frac{1}{1-\theta}\left(\frac{XSe}{\mu}-e\right) + \frac{\theta}{1-\theta}e \\ &= \frac{1}{1-\theta}v(x,s,\mu) + \frac{\theta}{1-\theta}e \end{align*} Our assumption tells us that $ v(x,s,\mu)\perp e$, so by Pythagoras' theorem we're done, as $ \left\| e \right\|^2 = n$.

So now we need to get our iterations right. By poking around the parameter space, we see that

Observation 2.

If $ \sigma(x,s,\mu)<2/3$ then $ \sigma(x+\Delta x,s+\Delta s,\mu) < \sigma(x,s,\mu)$, and if $ \theta = \frac{1}{4\sqrt{n}}$, then $ \sigma(x,s,\mu)\leq 1/2\implies \sigma(x+\Delta x,s+\Delta s, \mu(1-\theta))<1/2$. Moreover, going back to our analysis of whether $ x+\Delta x,s+\Delta s \geq 0$, it's easy to see that $ \sigma(x,s,\mu)<2/3$ implies that too.

Thus, to decrease the duality gap $ n\mu$ to $ \varepsilon$, we need to make about $ \sqrt{n} \log \left(\frac{\mu}{\varepsilon}\right)$ iterations! So, we end up with the following main theorem:

Theorem 3 (Primal-dual interior point method) .

Suppose we have a pair of feasible solutions $ Ax_0=b, A^Ty_0+s_0=c$, and a starting point $ (x_0,s_0,\mu)$ with proximity measure $ \sigma(x_0,s_0,\mu)\leq 1/2$. Then, using the iterations \begin{align*} x_{t+1}\gets x_t+\Delta x_t, s_{t+1}\gets s_t + \Delta s_t, \mu_{t+1}\gets\mu_t(1-\theta) \end{align*} with $ \theta = \frac{1}{4\sqrt{n}}$, we have that the duality gap of the solution $ (x_t,s_t,\mu_t)$ after $ t$ iterations is \begin{align*} n\mu_t \leq \varepsilon \end{align*} for $ t= O \left(\sqrt{n}\left(\log \frac{\mu_0}{\varepsilon}+\log n\right)\right) $. Consequently, the primal solution $ x_t$ is at most $ \varepsilon$ away from the optimum.

What we brushed under the rug was getting the initial good solution with proximity measure $ \leq 1/2$. Maybe we'll do it in another post, maybe not! Peace