Friday, March 28, 2014

SwissQuant Math Challenge on LASSO Regression: the MOSEK way

The Swiss company SwissQuant publishes monthly a math challenge in which they call for people to elaborate and discuss problems of different kind, somehow related to the company business. This month they focus on (quadratic) convex optimization, asking people to proposed specialized algorithm for the OLS-LASSO problem that can beat the established commercial code for quadratic programming as MOSEK. You can find details here Math Challenge March 2014.

Clearly we are glad they appreciate somehow our product, but also we feel challenged in tackling the problem itself. MOSEK is not specialized for OLS-LASSO, but still it can be very efficient we thought, especially in dual form. Moreover, we are always eager to find good test problems, for the core solver but also for our APIs.

The OLS-LASSO problem

So let's start from the formulation given in the challenge, where we are given $T$ samples in dimension $m$:

\begin{equation}\label{main}\tag{OLS-LASSO}
  \begin{aligned}
    \min_{u,b} & \frac{1}{2} || u ||^2 + \sum_{j=1}^{m} \lambda_j |b_j|\\
    &\\
    s.t.&\\
    & u = y - X b
  \end{aligned}
\end{equation}

where $b,\lambda\in \mathbb{R}^m, u,y\in \mathbb{R}^T$. We reformulate problem \ref{main} so to have a linear objective function:

\begin{equation*}
  \begin{aligned}
    \min_{u,b,z,w} &\quad z + \lambda^T w\\
    &\\
    s.t.&\\
    & u + X b = y\\
    & s=1\\
    & (s,z,u) \in Q_R^{T+2}\\
    & w_i \geq |b_i|& \forall i=1,\ldots,m.
  \end{aligned}
\end{equation*}

where $Q_R$ is the rotated cone. There are two way to to get rid of the absolute value:

(1) Linear constraints 

\[ w_i\geq |b_i| \Rightarrow \left\{ \begin{array}{l}  w \geq b \\ w \geq -b \\\end{array}\right.\]

(2) Second-order conic constraint

\[ w_i \geq |b_i| \Leftrightarrow w_i\geq \sqrt{b_i^2} \equiv (w_i,b_i)\in Q^2\]


We will see which primal form works the best. But it's the dual that we want to really see at work. Using the standard procedure outlined in our post about duality, we derive the dual of problem (PQ). It is easier than PL, believe us.

First let's fix a partition of the variable set as $[s,z,u,w,b]$, and clearly states the basic vector and
matrices:


(1) cost vector: $c^T= [0 \quad 1 \quad 0_T \quad \lambda^T \quad 0_m]$;
(2) right-hand side: $b=^T[1 \quad y^T]^T$;
(2) matrix coefficients: $A= \left[ \begin{array}{ccccc} 1 & 0&0_T&0_m&0_m\\ 0&0&I_T& 0_m&X\end{array}\right]$;


We then introduce as many dual variables as primal linear constraints:
$\mu_0$ for the first constraint, and $\mu\in \mathbb{R}^T$ for the
second one. We also include additional slack variables $\gamma$
partitioned in the same way as the primal variables, i.e. $\gamma^T=[\gamma_s,\gamma_z,\gamma_u,\gamma_w,\gamma_b]$.

The dual problem is

\begin{equation*}
  \begin{aligned}
    \max_{\mu_0, \mu, \gamma} & \quad \mu_0 + y^T \mu \\
    &\\
    s.t.&\\
    & c - A^T \left[ \begin{array}{c}\mu_0\\\mu\end{array} \right]= \gamma\\
    & (\gamma_s,\gamma_z,\gamma_u) \in Q_R^{T+2}\\
    & (\gamma_w,\gamma_b) \in \Pi_{i=1}^m Q^2.
  \end{aligned}
\end{equation*}

After some calculations, we obtain:

\begin{equation}\tag{D}
  \begin{aligned}
    \max_{\mu_0,\mu,\gamma_z} &  \quad \mu_0 + y^T \mu \\
    &\\
    s.t.&\\
    &  X^T \mu \leq  \lambda\\
    & \gamma_z=  1\\
    & \left(-\mu_0, \gamma_z, -\mu \right) \in Q_R^{2+T}.
  \end{aligned}
\end{equation}

Try to do the same with PL, you'll get the same model.

The results

We implement these three models using the MOSEK Fusion Python API and run a batch of tests for $m=[10,25,75,100]$ and $T$ from $10000$ to $100000$. To make it easy, $X,\lambda$ and $y$ are all uniformly generated in $[0,1]$. The code that interface with the model looks like


Here you can see the running time for problem PL in green, PQ in red and D in blue. In the first  plot we fix $m$ and let $T$ varies:


and then if we fix $T$ either to $10000$ or $100000$ and we get



Again, the dual is again a clear winner, as in a previous post! No way the primal can even get close to the dual.

But it is also interesting to see how  PQ is as good as PL. Red and green lines are almost indistinguishable. Only for larger $m$, i.e. when the number of cones increases, we can see a significant difference. Otherwise the difference is very very small.




Friday, March 21, 2014

Working with the dual in practice

Recently we have been discussing about the use of dual problem: in our last post  we showed that building the dual can be a fairly automatized process; we also argued that there are at least two good reasons why we may want/need to work with it:
  1. solving the dual may be more efficient (see this example);
  2. the dual variables relate to the reduced coefficients and shadow costs (see this post);
So either we solve the primal and we would like to have the values of the dual variables, or the other way round! In both cases, we boil down to the question (remember the dual of the dual is the primal):

Given the solution of a linear program, how can we recover its dual variables?

We will discuss how to work on the dual solution for linear programs, but most of our arguments will apply to conic programs as well. Remember our primal in standard form:

\begin{equation}\tag{P}
\begin{aligned}
& \min \quad c^T x\\
& s.t.\\
& Ax -b =0\\
& x\geq 0,
\end{aligned}
\end{equation}

and the corresponding dual

\begin{equation}\tag{D}
\begin{aligned}
& \max \quad b^Ty\\
& s.t.\\
& c - A^Ty - s =0\\
& s\geq 0.
\end{aligned}
\end{equation}

For a given optimal primal solution $x^\star$ and dual $y^\star,s^\star$, duality theory tells us:

(1) Strong Duality: the optimal values of (P) and (D) coincide

(2) Complementary Slackness: it must hold  $x_i^\star s_i^\star = 0$ for all $i=1,\ldots,n.$ Thus $s_i=0$ for all $i=1,\ldots,n$ such that $x_i^\star>0$.

(3) Optimal Basis: using the partition of the primal variables induced by the optimal basis, we can define a full rank sub-matrix of the constraints coefficient matrix, and use it to compute $y^\star$...but this goes way to far for a simple blog post, you should refer for instance to Section 3.5 in [1].

If points (1) and (2) are straightforward, point (3) involves some work and attention....should we really spend time on that? The answer is NO! The reason is solvers usually provide some functionality to retrieve the dual variables after the primal have been solved:

1. Primal-dual linear solvers (such as MOSEK) solve the primal and the dual at the same time, and thus provide the dual variables directly.

2. Primal or Dual solvers based on the Simplex method have all the information to recover the dual variables easily basis (see again [1]).

So you just need to query somehow the solver that you use to retrieve the dual values. Let's play with MOSEK and a toy linear programming problem taken from [2] Section 6.2. In LP format the primal is


and its dual is
Note that the two problems are formulated in a natural way. MOSEK  represents linear problems as

\begin{equation}\tag{MP}
\begin{aligned}
& \min c^T x + c_f \\
& s.t. & \\
&& l^c \geq Ax \geq u^c\\
&& l^x \geq x \geq u^x
\end{aligned}
\end{equation}

which is a more general and natural way for the user. After some boring but trivial calculations we obtain the dual problem in the form

\begin{equation}\tag{MD}
\begin{aligned}
& \max (l^c)^T s_l^c - (u^c)^T s_u^c + (l^x)^T s_l^x - (u^x)^T s_u^x\\
& s.t. & \\
&& A^T(s_l^c - s_u^c)  + s_l^x - s_u^x=c\\
&& s_l^x,s_u^x, s^c_l, s_u^c \geq 0
\end{aligned}
\end{equation}

As you may see, there are exactly a variable for each primal constraints. Running the solver on the primal we get
and the solution looks like The columns named as dual represent the corresponding dual variables. Running the solver on the dual we obtain
and the solution looks like
But wait a minute...it seems MOSEK just swap primal and dual!?!?!?

True in some sense! MOSEK implements a primal-dual algorithm that solves both primal and dual at the same time. For small problems there might be no difference at all. Few comments on the solver output:
  1. Feasibility and optimality are achieved at the same time: during the execution the intermediate solutions are neither primal nor dual feasible. This is shown in the columns PFEAS and DFEAS as the norm of the constraint violation.
  2. Column POBJ and DOBJ show the progress towards the optimal value and, as Strong Duality Theorem told us, they are the same once optimality is reached.
  3. Dual variables are reported in the LOWER/UPPER DUAL columns.
  4. To each non active bound corresponds a zero dual variable, as for the Complementary Slackness.
  5. Dual values are zero for each missing bounds: this is because a solver, as MOSEK does, actually sets a very large bound (in absolute value) in order to overcome numerical issues. A missing bound resolves in practice to a non active bound, as in the previous point.

If you want to dig more in the meaning of the dual information, follow Section 6.2 in [2]. For us,  it is important you get the message:

Dual information are readily available from your solver and can be useful!

Most of what we have said is also, and even more, true for second order conic programs: duality holds but to play with it you should ask your preferred solver.

[1] Papadimitriou, C. H., & Steiglitz, K. (1998). Combinatorial optimization: algorithms and complexity. Courier Dover Publications.
[2] Williams, H. P. (1999). Model building in mathematical programming.

Tuesday, March 11, 2014

How to write the dual problem in a nutshell

Duality in linear programming  is often mentioned and sometimes even used (see [2]). But in our experience, many practitioners unfortunately have a quite blurred idea  about the dual problem and the meaning of it. Things are even worse when considering conic programming, despite the fact that most of the theory remains unchanged. There are at least three good reasons to have a good understanding conic duality:
  • The dual problem could be faster to solve (see our post), even if the complexity is the same.
  • Sensitivity analysis uses dual information (see our post), just think about shadow prices.
  • Infeasibility and unboundedness certificates are stated in terms of duality, using the Farkas' Lemma.
Duality theory for conic optimization is rich and well-established and deserves much more space than a blog post (see for instance [1]). As first step, we will show how to easily derive the dual formulation of a given conic optimization problem. For sake of clarity, we will not consider semidefinite optimization problems, even if the conic framework readily extends to them.

Let assume we are given an optimization problem, which we refer as the primal. If we are lucky, the primal can be reformulated as a conic optimization problem by means of some transformations (see [1],[3]). An hopefully small additional effort is usually needed to cast our conic problem in (primal) standard form. Let the primal variables be reordered and partitioned such that  $x=(x_1,x_2,\ldots,x_k)^T\in \mathbb{R}^n$ with $x_i\in \mathbb{R}^{n_i}$. Then the standard primal form is:

\begin{equation}
\begin{aligned}
  &\min \quad c^Tx &&\\
  & s.t. &&\\
  &&Ax - b = 0 &\\
  &&x_i \in \mathit{K_i^{n_i}} &\quad i=1,\ldots\,k
\end{aligned}
\end{equation}

Note how each block variable $x_i$ belongs to a cone $\mathit{K_i^{n_i}}$; moreover, we can  recover the standard LP formulation just setting $k=1$ and $n_1=n$. In short we may write $x\in \Pi_{i=1}^k K_i^{n_i}$. For our purpose, $\mathit{K_i^{n_i}}$ can be

- the positive orthant $\mathbb{R}^{n_i}$
- a Lorentz cone $L_i^{n_i}$ of dimension $n_i$
- a rotated Lorentz cone $R_i^{n_i}$ of dimension $n_i$

We introduce as many dual variables $y$ as cones in Problem (1) and obtain the dual problem as follows:

\begin{equation}
\begin{aligned}
  &\max \quad b^Ty \\
  & s.t. \\
  &&  A^T y -c \in\Pi_{i=1}^k \mathit{K}_i^*\\
\end{aligned}
\end{equation}

where $\mathit{K}_i^*$ is the dual cone of $\mathit{K}_i$. In practice, the cones we work with are all self-dual, i.e. they are all dual to themselves, and this means roughly speaking $\mathit{K}_i^*\equiv \mathit{K}_i$. Introducing slack variables Problem (2) reads:

\begin{equation}
\begin{aligned}
  &\max \quad b^Ty \\
  & s.t. \\
  && c- A^T y + s= 0\\
   &&s_i \in \mathit{K}^*_i  &\quad i=1,\ldots\,k
\end{aligned}
\end{equation}

which can be converted again in standard form by noticing that $\min x = - \max -x$ and transforming free variables in non negative ones (see for instance [2]).

Summarizing, we recommend to follow these steps:
  1. if possible recast the given problem in conic form
  2. transform the conic form in standard form
  3. derive the dual formulation
  4. simplify the dual if possible
In practice, step 1 requires the biggest effort, while the others follow quite directly. Indeed, the crucial point is to certified the given problem can be expressed in conic form, and we do this actually constructing such a representation. Once we get the conic formulation, most of the job is done. So, working in conic form has also the interesting side-effect of a much simpler access to dual formulation and information.

Check this post to see a practical example! Others will follow in the next days.


[1] Ben-Tal, A., & Nemirovski, A. (2001). Lectures on modern convex optimization: analysis, algorithms, and engineering applications (Vol. 2). Siam.
[2] Papadimitriou, C. H., & Steiglitz, K. (1998). Combinatorial optimization: algorithms and complexity. Courier Dover Publications.
[3] MOSEK modeling manual , 2013, Mosek ApS