I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this …
I need to detect objects from multiple video streams at realtime (or close to it, like 10 FPS). How many GPUs do I need to detect objects using YOLOv3 or MobileNet for, say, 10 video streams? Is it possible to use CPU or something else? I don't need an exact number. I just need to understand scalability perspective and costs per single stream.
In order to get an R/Shiny forecasting app ready for production, I am concerned about speeding up the model tuning process. I am already using parallel processing on Windows 10. There are some other possible improvements to go for, like Kaggle kernels or Amazon AWS. However, I would prefer to go for some open source improvements first. A C++ coder told m recently, that it is worth to compute resource extensive tasks on linux/ubuntu machines. I was thinking about virtual …
I am building something very similar to this BigQuery ML example project. My system is different in two ways: Firstly it will need several thousand time-series so I would prefer to use the multiple-series feature rather than having thousands of individual models. Secondly is the data is more unpredictable in the long run (rather than periodic or seasonal) so needs retraining quite often, with only local trends being detected. The data is actually monitoring voltages in battery-operated devices, which usually …
I'm beginner in ML and I have a large dataset that has 15 features with 6M rows, so it becomes challenging to work on it locally. I can train one model locally but to perform hyper parameter tuning and cross validations with my macbook pro, it runs out of memory and lacks the processing speed and capacity. I tried spark but that gives poor results, so I would prefer python native ecosystem of pandas and sklearn. So I want to …
I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?
I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of feature representations is the one I would like to cluster in practice. However, I've had some difficulties because of the following constraints: there's no way I can know the number of clusters in advance; the clustering algorithm must be stable (it must not be order-dependent); I …
I'm working on an issue where I need to cluster user types on a scale in an unsupervised manner. I've been looking at the basics like KNN and K-means etc., but I found it hard to scale, as these methods are quite resource-intensive. What are some highly scalable clustering methods that have a low complexity ("big O")?
Vocabulary Face detection: Finding all faces in an image. Face representation: The simplest way to represent a face is as an image (pixels / color values). This is not very space efficient and likely makes follow-up tasks hard. Face embeddings are one other representation. In this case a face is a point on the unit-sphere in $\mathbb{R}^{128}$, IIRC. Face verification: Given two face representations, deciding if they are the same Question I was just wondering how identifying a person with …
Background I would like to build a face recognition model for registration and login for some kind of service. For example, using this approach (CNN + SVM). When a new user wants to register a service, the image of his/her face is recorded and the machine learning model is trained using these images. Then, when a person requests for the service, the model recognises if this person is a member or not. Question But when there is new user comes …
Although it might sound like a pure techie question, I would like to know which ways you usually try out, for very data science-like processes, when you need to speed up your processes (given that the data retrieval is not a problem and that it also fits in memory etc). Some of those could be the following, but I would like to receive feedback about any other else: good practices as always using Numpy when possible on numeric operations and …
I have to predict next min traffic for multiple cities (100+). I am thinking of using LSTM. My main concern is how do I scale the number of cities. How does LSTM learn different amount of traffic and other related features of all cities to predict the next state. What should be the network architecture for such cases. I was thinking of the following process: Normalisation of the data with city specific Min,Max scalar Feed sliding window data(t_1 to t_60) …
Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated terms, like scalability and efficiency, but what exactly defines a problem as a big data problem? Does the computation have to be related to some set of specific purposes, like data mining/information retrieval, or could an algorithm for general …
According to Wikipedia, "the distance matrix of size $\frac{(n^2-n)}{2}$ can be materialized to avoid distance recomputations, but this needs $O(n^2)$ memory, whereas a non-matrix based implementation of DBSCAN only needs $O(n)$ memory." $\frac{(n^2-n)}{2}$ is basically the triangular matrix. However, it says that a non-matrix based implementation only requires $O(n)$ memory. How does that work? Regardless of what data structure you use, don't you always have to have $\frac{(n^2-n)}{2}$ distance values? It would still be $O(n^2)$ space complexity, no? Is there …
Applying density based clustering (DBSCAN) on $50k$ data points and about $2k$-$4k$ features, I achieve the desired results. However, scaling this to $10$ million data points requires a creatively efficient implementation since DBSCAN requires $O(n^2)$ to calculate the distance matrix and crushes my memory. There must be some efficient sampling-based method to overcome this, ideally something similar to minHash - but I'm not sure how to approach this, and if there exists a solution that can work on existing sklearn …
This is an issue for all Data Scientists who have worked with this stack: python scikit-learn scipy-stats matplotlib etc. We are looking for ways to have a project already implemented in the aforementioned stack scale for very large datasets by doing the minimum amount of work Counter examples would be to rewrite everything in Tensorflow framework or use industry tools that are unrelated with Python.
At the moment we use different methods for record linking locations in different datasets. Theoretically given two locations we can give a prediction on how well they match (are the same). This is not just based on address data (street, house number, zip, city, country, latitude, longitude) but also based on the name, type of establishment and other properties like phonenumber. Since most features are prone to fuzzy errors (different spellings, writing styles, formatting, human entry error, null-values (absent)) this …
I have a data set that keeps track of who referred someone to a program, and includes the geo coordinates of both parties for each record. What will be the best way to visualize this kind of data set? This visualization should also be able to use the geo coordinates to place this entities in the map to form clusters, or to superimpose them on a real map. I am interested in an algorithm and/or a library that will help …
My situation is that I have many thousands of devices which each have their own specific LSTM model for anomaly prediction. These devices behave wildly differently so I don't think there is any way to have a shared global model, unfortunately. Periodically I will update each device model with the new data from the device - so maybe once per day I will load an additional daily batch of readings and use the properties of stateful LSTM training to update …
I am trying to build price recommendation solution for clients in a scalable manner. I have two choices as below. Professional service: Statistician involvement to build regression model or any other kind of predictive model that fits specifically to client data and can be used. Issue: So on the long run there will be issues around scalability as one analyst cannot build model simultaneously for hundreds of clients who want to come on board and use this service. Hiring 1 …