scikit-learn OMP mem error

I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this or sugggest more memory efficient algorithm for feature selection, or is subsampling the data my only option. Are there some deterministic good subsampling techniques.

EDIT: Relevant code piece:

n=8;
y=mydata[:,0];
X=mydata[:,[1,2,3,4,5,6,7,8]];
#print y;
#print X;
print "here";
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=5,copy_X = False, normalize=True);
omp.fit(X,y);
coef = omp.coef_;
print omp.coef_;
idx_r, = coef.nonzero();
for id in idx_r:
        print coef[id], vars[id],"\n";

The error I get:

File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 324, in score
return r2_score(y, self.predict(X), sample_weight=sample_weight)
File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py", line 2332, in r2_score
numerator = (weight * (y_true - y_pred) ** 2).sum(dtype=np.float64)
MemoryError

Topic scikit-learn feature-selection python scalability bigdata

Category Data Science


One option is to set precompute to True, which will precompute the Gram and Xy matrix. It would be something like:

from sklearn.linear_model import OrthogonalMatchingPursuit

omp = OrthogonalMatchingPursuit(precompute=True, n_nonzero_coefs=5,copy_X = False, normalize=True)

Also upgrading to Python 3 might help with memory issues.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.