In this post, we begin describing a basic interior point method to solve a linear program. We follow material from these notes by Professor Michel Goemans (relevant lectures are 6,7 and 8), filling in the details.

1. Motivation

In interior point methods, we are approaching the classical linear programming problem \begin{align*} P:\min \left<c,x\right> \quad \text{ subject to } x\geq 0 \text{ and }Ax=b \end{align*} from the following perspective: we want to find a point inside the feasible region $P = \left\{ x \ \big| \ x\geq 0, Ax=b\right\}$, and by `feeling our way' trace a path from that to a solution of the linear program.

In particular, we perturb the objective function in a special way to force it to have a single minimum in the interior of $P$. Such a minimum can be found essentially by differentiation and some second-order approximation to solve the resulting system. The key point is that passing to a function minimized on the interior helps us avoid the major difficulty of dealing with minima on the boundary of the feasible set, and makes some powerful tools applicable.

Once we find an approximate minimum for the perturbed objective, we scale down the perturbation term, so that the new minimum doesn't move too far away from the previous one, and recompute an approximation of the new minimum with a good starting point (the old minimum). As we decrease the perturbation term more and more, we converge to a solution of the original linear program.

There is a lot to be understood about this iterative process of changing the objective function, and many points of view on it. Additionally, there are qualitatively different ways one can do the iterations. In this post, to keep things short, we only focus on achieving a basic understanding of one of one instance of an interior point method.

It comes from the family of primal-dual interior point methods, which emphasize the interplay between a feasible solution of the primal and a feasible solution of the dual. The duality gap between the two feasible solutions serves as a guide for the progress the algorithm is making towards the real solution of the linear program.

From now on, suppose that $A\in\mathbb{R}^{m\times n}$ is of full rank, and has rows $r_1,\ldots,r_m$ and columns $c_1,\ldots,c_n$.

2. The barrier problem and its basic properties

Observation 1.

If $P$ is not a single point, we have \begin{align*} \mathop{\rm int}\nolimits(P) = \left\{ x_1c_1 + \ldots + x_nc_n \big| x_i>0\right\} = \left\{ Ax \big| x>0\right\} \end{align*} Indeed, this set is open because for any guy $y=x_1c_1 + \ldots + x_nc_n$ in $\mathop{\rm int}\nolimits(P)$, since the $c_i$ span $\mathbb{R}^m$ by assumption, we have that all vectors in a small ball around $y$ are expressible as a linear combination of the $c_1,\ldots,c_n$ with positive coefficients.

Conversely, suppose we have $y\in U\subseteq P$, where $U$ is open. Then there is a ball around $y$ in $P$. Since $P$ is not a single point and clearly $0\in P$, there is some linear combination $y'=\alpha_1c_1 + \ldots + \alpha_nc_n\neq 0$ with $y'\in P$ where all $\alpha_i>0$, and $y-\varepsilon y'\in P$ for some small enough $\varepsilon>0$. Thus, $y-\varepsilon y'$ is a combination of the $c_i$ with $\geq 0$ coefficients, and $y$ is a combination of the $c_i$ with positive coefficients, so $y\in \mathop{\rm int}\nolimits(P)$.

We will try to find solutions that converge to the solution of the LP but stay within the polytope $P$. To enforce staying in $\mathop{\rm int}\nolimits(P)$, we perturb the objective by defining a class of new optimization problems, denoted $\textbf{BP}(\mu)$: \begin{align*} \min \left<c,x\right> - \mu \sum_{j}^{} \ln(x_j) \quad \text{ subject to }x>0 \text{ and }Ax=b \end{align*} where $\mu>0$. The function we added is called a logarithmic barrier function. It enforces $x>0$ by making the objective function go to $+\infty$ as any coordinate approaches $0$, but it's also chosen to have many other useful properties.

One of them is that it enforces a unique minimum by strict convexity:

Observation 2.

For any $\mu>0$, the function $f(x) = \left<c,x\right> - \mu \sum_{j}^{} \ln(x_j)$ is strictly convex on its domain $\mathop{\rm int}\nolimits(P)$. Indeed, when verifying strict convexity, the linear term $c^Tx$ will cancel, and we're done by strict convexity of $-\ln x$.

This ensures there can be at most one minimum. How do we go about finding this minimum when it exists? We can differentiate the objective function in the appropriate manner, which would reduce our task to solving a linear system. As it turns out, this linear system will boil down to finding a particular feasible solution of the dual program with an additional constraint, which takes a particularly convenient from because we're using the logarithmic barrier function.

Proposition 1 (Karush-Kuhn-Tucker optimality conditions) .

For any $\mu>0$, $x>0$ is a minimizer for $\textbf{BP}(\mu)$ iff there exist $y,s>0$ such that \begin{align*} Ax=b,& \quad x>0 \\ A^Ty + s = c & \\ x_js_j = \mu,&\quad \forall j \end{align*}

Proof 1.

Suppose $x$ is a minimizer. Let $f(x) = \left<c,x\right> - \mu \sum_{j}^{} \ln(x_j)$. Observe that if we pick some $Ax_0=b$, then translating $\left\{ x \ \big| \ Ax=b\right\}$ by $x_0$ gives us the vector subspace $\left\{ x \ \big| \ Ax=0\right\} =\ker A$ Thus, if we define $g:\ker A \to \mathbb{R}$ by \begin{align*} g(v) = c^T(x_0+v) -\mu \sum_{j}^{} \ln(x_{0j}+v_j) \end{align*} we have \begin{align*} f(x) = g(x-x_0) \end{align*} Since $\ker A$ is a subspace of $\mathbb{R}^n$, the gradient of $g$ is the projection of the gradient of the extension of $g$ on $\mathbb{R}^n$ on $\ker A$ (this can be seen from the coordinate-free formulation of the gradient). Thus the gradient of $g$ is zero iff the gradient of its extension is orthogonal to $\ker A$, which gives \begin{align*} \nabla g(v) = 0 \iff \left(c_1-\mu \frac{1}{x_{01}+v_1}, \ldots, c_n - \mu \frac{1}{x_{0n}+v_n}\right)\perp \ker A = \left<r_1,\ldots,r_m\right>^\perp \end{align*} since the kernel of $A$ is precisely the orthogonal complement of the row span of $A$. Hence, $x$ is a minimizer of $f(x)$ iff there is a $y$ such that \begin{align*} c - \mu X^{-1}e = A^Ty, \text{ i.e. } A^Ty + \mu X^{-1}e = c \end{align*} where $X=\mathop{\rm diag}\nolimits(x_1,\ldots,x_n)$ and $e=(1,\ldots,1)^T$. Then we can set $s = \mu X^{-1}e$, and we see that indeed $x_js_j=\mu$ as wanted.

We should note that there is a more straightforward way to treat the gradient of the constrained function, using Lagrange multipliers to pass to an unconstrained problem (there should be a post about this at some point; maybe my future self has written it by the time you're reading this). However, it's nice seeing the proof from first principles too.

Finally, a minimum will always exist as long as the original primal (with $x>0$) is feasible and finite. The intuition for the following proposition is that by strong duality we know that if the original primal is feasible, then the original dual also is, and that gives us a convenient certificate for lower bounding $\left<c,x\right>$. The difference here is that we require strict inequalities on the primal variable and the slack variable.

Proposition 2.

Suppose the original primal is feasible, i.e. there exists a $\widehat{x}>0$ with $A \widehat{x}=b$, and there exists a $\widehat{s}>0$ and a $\widehat{y}$ such that $A^T \widehat{y}+ \widehat{s}=c$. Then $\textbf{BP}(\mu)$ is finite and has a unique solution.

Proof 2.

In this case we can write the objective as \begin{align*} c^Tx - \mu \sum_{j}^{} \ln(x_j) = \left<\widehat{y},b\right> + \sum_{j}^{} \left(s_jx_j - \mu \ln x_j\right) \end{align*} Now consider the function $f_s(t) = st-\mu \ln t$ for some $s,\mu>0$ on $\mathbb{R}_{+}$. At this point, we make use of some nice properties of the logarithm: $f_s$ is continuous on its domain, and $\lim_{t\to 0} f_s(t) = \lim_{t\to\infty}f_s(t) = \infty$, and moreover $f_s'(t) = s - \mu/t$ so $f_s$ has only one critical point. It follows that $f_s$ has a single minimum. Thus, \begin{align*} \min \left<c,x\right>= \left<\widehat{y},b\right> + \min \sum_{j}^{} f_{s_j}(x_j) \end{align*} and this attains a unique minimum, being the sum of functions each attaining a unique minimum in the relevant domain.

The problem with the Karush-Kuhn-Tucker conditions is that they are not linear because of the $x_js_j=\mu$ terms. We solve this by applying Newton's method: starting with some $x,y,s$ which satisfy (1) and (2) of the KKT conditions but not necessarily (3), we will move to $x+\Delta x, y+\Delta y, s+\Delta s$ which will hopefully be closer.

3. Solving the Newton step for the KKT system

As described, we assume we are given $x,y,s$ such that \begin{align*} Ax=b,&\quad x>0 \\ A^Ty+s=c,&\quad s>0 \end{align*} and we want to solve the system \begin{align*} A(x+\Delta x) = b,& \quad x>0 \\ A^T(y+\Delta y) + s = c,&\quad s>0 \\ (x_j+\Delta x_j)(s_j+\Delta s_j) = \mu,&\quad \forall j \end{align*} by approximating it to first-order terms, which gives \begin{align*} A\Delta x &= 0 \\ A^T\Delta y + \Delta s &=0 \\ x_js_j+\Delta x_js_j + x_j\Delta s_j &= \mu \end{align*} For convenience, we define matrices $X=\mathop{\rm diag}\nolimits(x_1,\ldots,x_n)$ and $S=\mathop{\rm diag}\nolimits(s_1,\ldots,s_n)$ and the vector $e=(1,\ldots,1)\in \mathbb{R}^n$. After substitutions, we get \begin{align*} \Delta x &=S^{-1}(\mu e - X\Delta s) - x = S^{-1}(\mu e + XA^T\Delta y) - x \end{align*} Applying $A$ to both sides, we get \begin{align*} AS^{-1}XA^T\Delta y = b - \mu AS^{-1}e \end{align*} The key step is that since $A$ has full row rank, the $m\times m$ matrix $AS^{-1}XA^T$ will be invertible, which follows from this lemma:

Lemma 3.

Suppose $A\in\mathbb{R}^{m\times n}$; then the rank of $A$ is the same as the rank of $AA^T$ and $A^TA$.

Proof 3.

Since the rank over $\mathbb{R}$ is invariant with respect to taking the transpose, it suffices to show that $\mathop{\rm rank}\nolimits(A)=\mathop{\rm rank}\nolimits(A^TA)$, for which it suffices to show that $\ker A = \ker A^TA$ since the kernel is the orthogonal complement of the row space.

We get $\ker A\subseteq\ker A^TA$ since $Ax=0\implies A^TAx=0$, and $\ker A^TA\subseteq\ker A$ since $A^TAx=0\implies x^TA^TAx=0 \implies \left<Ax,Ax\right>=0 \implies Ax=0$.

To apply this to $AS^{-1}XA^T$, notice that since $S^{-1}X$ is a diagonal matrix with positive diagonal entries, we can write it as $DD^T=\sqrt{S^{-1}X}\sqrt{S^{-1}X}^T$ where $D$ is again diagonal with positive diagonal entries. Then $AS^{-1}XA^T = ADD^TA^T = (AD)(AD)^T$, and $AD$ has the same rank as $A$ because its columns are just the columns of $A$ but rescaled nontrivially (i.e. no scaling by zero).

So, we can finish solving the system: \begin{align*} \Delta y &= (AXS^{-1}A^T)^{-1}(b-\mu AS^{-1}e) \\ \Delta s &= -A^T\Delta y \\ \Delta x &= S^{-1}(\mu e + XA^T\Delta y) - x \end{align*}

4. What's next

In the next post, we will see how the Newton step affects the current estimates for $x$ and $s$, and will derive initial conditions under which feasibility is maintained. Then, we will combine the Newton steps with the strategy of decreasing the perturbation parameter $\mu$ in order to converge to a solution of the linear program!