scalability

scikit-learn OMP mem error

sshanks

2022年5月16日 13:02

I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this …

Topic: scikit-learn feature-selection python scalability bigdata

Category: Data Science

How many video streams can single GPU handle for object detection

viator

2022年5月6日 08:00

I need to detect objects from multiple video streams at realtime (or close to it, like 10 FPS). How many GPUs do I need to detect objects using YOLOv3 or MobileNet for, say, 10 video streams? Is it possible to use CPU or something else? I don't need an exact number. I just need to understand scalability perspective and costs per single stream.

Topic: hardware yolo gpu scalability

Category: Data Science

Discussion: Is hyperparameter tuning on linux/ubuntu virtual machine a usefull approach?

Leonhard Geisler

2022年2月28日 19:42

In order to get an R/Shiny forecasting app ready for production, I am concerned about speeding up the model tuning process. I am already using parallel processing on Windows 10. There are some other possible improvements to go for, like Kaggle kernels or Amazon AWS. However, I would prefer to go for some open source improvements first. A C++ coder told m recently, that it is worth to compute resource extensive tasks on linux/ubuntu machines. I was thinking about virtual …

Topic: hyperparameter-tuning linux scalability

Category: Data Science

Is it possible to update data and retrain just one of several data series in bigquery model

scipilot

2022年2月10日 07:06

I am building something very similar to this BigQuery ML example project. My system is different in two ways: Firstly it will need several thousand time-series so I would prefer to use the multiple-series feature rather than having thousands of individual models. Secondly is the data is more unpredictable in the long run (rather than periodic or seasonal) so needs retraining quite often, with only local trends being detected. The data is actually monitoring voltages in battery-operated devices, which usually …

Topic: google-bigquery training time-series scalability machine-learning

Category: Data Science

Where and how to do large scale supervised machine learning?

ro23

2021年6月2日 17:22

I'm beginner in ML and I have a large dataset that has 15 features with 6M rows, so it becomes challenging to work on it locally. I can train one model locally but to perform hyper parameter tuning and cross validations with my macbook pro, it runs out of memory and lacks the processing speed and capacity. I tried spark but that gives poor results, so I would prefer python native ecosystem of pandas and sklearn. So I want to …

Topic: cloud supervised-learning pyspark random-forest scalability

Category: Data Science

Data Science Tools Using Scala

sheldonkreger

2021年3月11日 20:14

I know that Spark is fully integrated with Scala. It's use case is specifically for large data sets. Which other tools have good Scala support? Is Scala best suited for larger data sets? Or is it also suited for smaller data sets?

Topic: scala scalability

Category: Data Science

Clustering large set of images

Overloop

2021年1月27日 14:50

I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of feature representations is the one I would like to cluster in practice. However, I've had some difficulties because of the following constraints: there's no way I can know the number of clusters in advance; the clustering algorithm must be stable (it must not be order-dependent); I …

Topic: apache-spark clustering scalability bigdata machine-learning

Category: Data Science

Resource-unintensive (low complexity) methods for large-scale unsupervised clustering?

lte__

2020年10月24日 18:14

I'm working on an issue where I need to cluster user types on a scale in an unsupervised manner. I've been looking at the basics like KNN and K-means etc., but I found it hard to scale, as these methods are quite resource-intensive. What are some highly scalable clustering methods that have a low complexity ("big O")?

Topic: clustering scalability

Category: Data Science

How can one quickly look up people from a large database?

Martin Thoma

2020年7月6日 08:21

Vocabulary Face detection: Finding all faces in an image. Face representation: The simplest way to represent a face is as an image (pixels / color values). This is not very space efficient and likely makes follow-up tasks hard. Face embeddings are one other representation. In this case a face is a point on the unit-sphere in $\mathbb{R}^{128}$, IIRC. Face verification: Given two face representations, deciding if they are the same Question I was just wondering how identifying a person with …

Topic: image-recognition search scalability bigdata

Category: Data Science

Face Recognition (Scalability Issue)

K_inverse

2020年2月17日 09:40

Background I would like to build a face recognition model for registration and login for some kind of service. For example, using this approach (CNN + SVM). When a new user wants to register a service, the image of his/her face is recorded and the machine learning model is trained using these images. Then, when a person requests for the service, the model recognises if this person is a member or not. Question But when there is new user comes …

Topic: training scalability machine-learning

Category: Data Science

Ways to speed up Python code for data science purposes

German C M

2020年1月29日 10:49

Although it might sound like a pure techie question, I would like to know which ways you usually try out, for very data science-like processes, when you need to speed up your processes (given that the data retrieval is not a problem and that it also fits in memory etc). Some of those could be the following, but I would like to receive feedback about any other else: good practices as always using Numpy when possible on numeric operations and …

Topic: python efficiency scalability

Category: Data Science

LSTM Time series prediction for multiple multivariate series

maggs

2019年6月14日 22:06

I have to predict next min traffic for multiple cities (100+). I am thinking of using LSTM. My main concern is how do I scale the number of cities. How does LSTM learn different amount of traffic and other related features of all cities to predict the next state. What should be the network architecture for such cases. I was thinking of the following process: Normalisation of the data with city specific Min,Max scalar Feed sliding window data(t_1 to t_60) …

Topic: lstm time-series scalability

Category: Data Science

How big is big data?

Rubens

2018年5月1日 13:04

Lots of people use the term big data in a rather commercial way, as a means of indicating that large datasets are involved in the computation, and therefore potential solutions must have good performance. Of course, big data always carry associated terms, like scalability and efficiency, but what exactly defines a problem as a big data problem? Does the computation have to be related to some set of specific purposes, like data mining/information retrieval, or could an algorithm for general …

Topic: performance efficiency scalability bigdata

Category: Data Science

DBSCAN - Space complexity of O(n)?

absolutelydevastated

2018年4月24日 22:54

According to Wikipedia, "the distance matrix of size $\frac{(n^2-n)}{2}$ can be materialized to avoid distance recomputations, but this needs $O(n^2)$ memory, whereas a non-matrix based implementation of DBSCAN only needs $O(n)$ memory." $\frac{(n^2-n)}{2}$ is basically the triangular matrix. However, it says that a non-matrix based implementation only requires $O(n)$ memory. How does that work? Regardless of what data structure you use, don't you always have to have $\frac{(n^2-n)}{2}$ distance values? It would still be $O(n^2)$ space complexity, no? Is there …

Topic: dbscan clustering scalability

Category: Data Science

Scaling DBSCAN clustering - minHash?

Adam

2018年3月26日 13:18

Applying density based clustering (DBSCAN) on $50k$ data points and about $2k$-$4k$ features, I achieve the desired results. However, scaling this to $10$ million data points requires a creatively efficient implementation since DBSCAN requires $O(n^2)$ to calculate the distance matrix and crushes my memory. There must be some efficient sampling-based method to overcome this, ideally something similar to minHash - but I'm not sure how to approach this, and if there exists a solution that can work on existing sklearn …

Topic: dbscan clustering scalability machine-learning

Category: Data Science

What is the easiest way to scale a data science project based on scikit stack?

George Pligoropoulos

2018年1月17日 17:07

This is an issue for all Data Scientists who have worked with this stack: python scikit-learn scipy-stats matplotlib etc. We are looking for ways to have a project already implemented in the aforementioned stack scale for very large datasets by doing the minimum amount of work Counter examples would be to rewrite everything in Tensorflow framework or use industry tools that are unrelated with Python.

Topic: scalability bigdata

Category: Data Science

Distance measure calculation addresses for record linking

Tom Lous

2017年8月29日 14:27

At the moment we use different methods for record linking locations in different datasets. Theoretically given two locations we can give a prediction on how well they match (are the same). This is not just based on address data (street, house number, zip, city, country, latitude, longitude) but also based on the name, type of establishment and other properties like phonenumber. Since most features are prone to fuzzy errors (different spellings, writing styles, formatting, human entry error, null-values (absent)) this …

Topic: feature-engineering distance feature-extraction scalability

Category: Data Science

Geospatial Social Network Analysis Visualization

Jideobi Benedine Ofomah

2017年4月24日 17:38

I have a data set that keeps track of who referred someone to a program, and includes the geo coordinates of both parties for each record. What will be the best way to visualize this kind of data set? This visualization should also be able to use the geo coordinates to place this entities in the map to form clusters, or to superimpose them on a real map. I am interested in an algorithm and/or a library that will help …

Topic: library visualization social-network-analysis scalability

Category: Data Science

Scalable training/updating of many small LSTM models

NMR

2016年8月31日 13:52

My situation is that I have many thousands of devices which each have their own specific LSTM model for anomaly prediction. These devices behave wildly differently so I don't think there is any way to have a shared global model, unfortunately. Periodically I will update each device model with the new data from the device - so maybe once per day I will load an additional daily batch of readings and use the properties of stateful LSTM training to update …

Topic: apache-spark parallel scalability machine-learning

Category: Data Science

How to create Self learning data product

StatguyUser

2015年12月13日 15:28

I am trying to build price recommendation solution for clients in a scalable manner. I have two choices as below. Professional service: Statistician involvement to build regression model or any other kind of predictive model that fits specifically to client data and can be used. Issue: So on the long run there will be issues around scalability as one analyst cannot build model simultaneously for hundreds of clients who want to come on board and use this service. Hiring 1 …

Topic: data-product deep-learning recommender-system scalability machine-learning

Category: Data Science

About