Solving a system of equations with sparse data

I am attempting to solve a set of equations which has 40 independent variables (x1, ..., x40) and one dependent variable (y). The total number of equations (number of rows) is ~300, and I want to solve for the set of 40 coefficients that minimizes the total sum-of-square error between y and the predicted value.

My problem is that the matrix is very sparse and I do not know the best way to solve the system of equations with sparse data. An example of the dataset is shown below:

   y    x1  x2 x3 x4 x5 x6 ... x40
87169   14  0  1  0  0  2  ... 0 
46449   0   0  4  0  1  4  ... 12
846449  0   0  0  0  0  3  ... 0
....

I am currently using a Genetic Algorithm to solve this and the results are coming out with roughly a factor of two difference between observed and expected.

Can anyone suggest different methods or techniques which are capable of solving a set of equations with sparse data.

Topic genetic regression algorithms machine-learning

Category Data Science


If I understand you correctly, this is the case of multiple linear regression with sparse data (sparse regression). Assuming that, I hope you will find the following resources useful.

1) NCSU lecture slides on sparse regression with overview of algorithms, notes, formulas, graphics and references to literature: http://www.stat.ncsu.edu/people/zhou/courses/st810/notes/lect23sparse.pdf

2) R ecosystem offers many packages, useful for sparse regression analysis, including:

3) A blog post with an example of sparse regression solution, based on SparseM: http://aleph-nought.blogspot.com/2012/03/multiple-linear-regression-with-sparse.html

4) A blog post on using sparse matrices in R, which includes a primer on using glmnet: http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r

5) More examples and some discussion on the topic can be found on StackOverflow: https://stackoverflow.com/questions/3169371/large-scale-regression-in-r-with-a-sparse-feature-matrix

UPDATE (based on your comment):

If you're trying to solve an LP problem with constraints, you may find this theoretical paper useful: http://web.stanford.edu/group/SOL/papers/gmsw84.pdf.

Also, check R package limSolve: http://cran.r-project.org/web/packages/limSolve. And, in general, check packages in CRAN Task View "Optimization and Mathematical Programming": http://cran.r-project.org/web/views/Optimization.html.

Finally, check the book "Using R for Numerical Analysis in Science and Engineering" (by Victor A. Bloomfield). It has a section on solving systems of equations, represented by sparse matrices (section 5.7, pages 99-104), which includes examples, based on some of the above-mentioned packages: http://books.google.com/books?id=9ph_AwAAQBAJ&pg=PA99&lpg=PA99&dq=r+limsolve+sparse+matrix&source=bl&ots=PHDE8nXljQ&sig=sPi4n5Wk0M02ywkubq7R7KD_b04&hl=en&sa=X&ei=FZjiU-ioIcjmsATGkYDAAg&ved=0CDUQ6AEwAw#v=onepage&q=r%20limsolve%20sparse%20matrix&f=false.


Aleksandr's answer is completely correct.

However, the way the question is posed implies that this is a straightforward ordinary least squares regression question: minimizing the sum of squared residuals between a dependent variable and a linear combination of predictors.

Now, while there may be many zeros in your design matrix, your system as such is not overly large: 300 observations on 40 predictors is no more than medium-sized. You can run such a regression using R without any special efforts for sparse data. Just use the lm() command (for "linear model"). Use ?lm to see the help page. And note that lm will by default silently add a constant column of ones to your design matrix (the intercept) - include a -1 on the right hand side of your formula to suppress this. Overall, assuming all your data (and nothing else) is in a data.frame called foo, you can do this:

model <- lm(y~.-1,data=foo)

And then you can look at parameter estimates etc. like this:

summary(model)
residuals(model)

If your system is much larger, say on the order of 10,000 observations and hundreds of predictors, looking at specialized sparse solvers as per Aleksandr's answer may start to make sense.

Finally, in your comment to Aleksandr's answer, you mention constraints on your equation. If that is actually your key issue, there are ways to calculate constrained least squares in R. I personally like pcls() in the mgcv package. Perhaps you want to edit your question to include the type of constraints (box constraints, nonnegativity constraints, integrality constraints, linear constraints, ...) you face?

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.