processing

Storing Large dataset for processing and analysis of data

user14519285

2022年4月4日 03:02

I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The dataset that I want to store is a heart rate pulse data generated by a sensor.

Topic: data-engineering data-analysis data-formats dataset processing

Category: Data Science

Specifying number of threads using XGBoost.train

LauritsT

2022年3月24日 19:27

When using the xgboost.train() function, all the threads are used. I would like to use a specific amount. Unfortunately, this function does not accept the parameters nthread nor n_jobs. How can I control the number of threads being used? Thanks. // Edit It seems that I found a solution. In contrast with the method, how one provides the nthread (or n_jobs) parameter to XGBClassifier of XGBRegressor, by adding this parameter directly to the brackets as xgb.XGBRegressor(nthread=n) then as indicated on …

Topic: xgboost parallel processing

Category: Data Science

Why are wavelet transforms not scale-equivariant?

diegor

2022年2月23日 13:38

One can rely on continuous wavelets to build a multi-resolution analysis that is equivariant ("covariant") under the action of a discrete subgroup of translation. When not downsampled, the multi-resolution analysis of a 1D signal can be seen as a matrix of n x m coefficients, where n are the octaves that one wants to capture, and m are the number of considered translated wavelets on each octave. Equivariance to translation in this case means that a certain translation of the …

Topic: image-segmentation representation processing

Category: Data Science

Do I load all files at once or one at a time?

Jonathan Miller

2021年3月26日 10:25

I currently have $1700+$ CSV files. Each of them is in the same format and structure, give or take a row or possibly a column at the end. Each CSV is $\approx 3.8$ MB. I need to perform a transformation on each file Extract one data set, perform a summation and column select, then store inside a folder. Extract an array, no need for column names here, then sore inside a folder. From an algorithmic POV, is it better to …

Topic: julia optimization dataset processing

Category: Data Science

CycleGAN vs. AutoEncoder for transforming sketches into images

Il Saggio Vecchino

2020年4月20日 14:21

I'm playing around with the use of deep learning on images and done quite works : colorizing black and white images for example, or maybe fixing old damaged photos. Today I want to tackle a new problem, concerning the conversion of sketches into real look like images as shown in the figure. Performing this could be done in various ways. I want to know if it is correct to think of CycleGANs for this task, since in truth they are …

Topic: image-preprocessing gan autoencoder deep-learning processing

Category: Data Science

Optimization of pandas row iteration and summation

Cody

2019年11月19日 17:52

i'm wondering if anyone can provide some input on improving the speed and calculations of a pandas result. What i am trying to obtain is a summation of IDs in one table (player table) based on each row of a second table (UUID). Functionally each row needs to sum the total of the players table rows that are contained in its Active row and assign the UUID as the index to that row. My initial thought was to loop row …

Topic: dataframe optimization pandas processing

Category: Data Science

Gunicorn workers timeout

porfgian

2018年10月29日 12:34

I'm using Flask where i load some pre-trained machine learning models once. I'm also using Gunicorn usually with 2 or 4 workers to handle parallel requests. Every request contains some texts that i want to analyze. I'll explain my problem with a example: My Flask server with Gunicorn and 2 workers is up and loads my models once for every worker. Then i send two parallel requests. The first will run analysis on the 1st worker with 500 texts and …

Topic: parallel processing machine-learning

Category: Data Science

How to split natural language script into segments?

A.D.

2018年4月16日 18:54

I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories: MainConceptDescription-> Explanation of the Main concept(s) SubConceptDescription-> Explanation of Subconcept related to the main concept Methodology / Technique-> To achieve something, what should one do Summary-> Summary of the discussed material or of the whole course Application-> Practical advise …

Topic: lda topic-model python processing data-mining

Category: Data Science

Meltdown patch impact on data processing speeds

GdD

2018年1月5日 14:30

The patch for the Meltdown vulnerability disables speculative execution, which will impact all processing activities. The degree of impact is highly dependent on the type of processing being done. Is there hard data or experience of how machine learning and data processing will be impacted in measurable terms?

Topic: processing

Category: Data Science

Running huge datasets with R

Filipe Ferminiano

2017年11月15日 03:54

I'm trying to run some analysis with some big datasets (eg 400k rows vs. 400 columns) with R (e.g. using neural networks and recommendation systems). But, it's taking too long to process the data (with huge matrices, e.g. 400k rows vs. 400k columns). What are some free/cheap ways to improve R performance? I'm accepting packages or web services suggestions (other options are welcome).

Topic: optimization r processing bigdata

Category: Data Science

Alignment of square nonorientable images/data

Mark

2015年7月31日 17:45

Another post where I don't know enough terminology to describe things efficiently. For the comments, please suggest some tags and keywords I can add to this post to make it better. Say I have a 2D data structure where 'orientation' doesn't matter. The examples I ran into: The state of a 2048 game. In terms of symmetry groups this would be D4 / D8, except that an operation doesn't yield an identical state, it just yields another state that has …

Topic: data-cleaning processing

Category: Data Science

Cheat Sheet of UNIX commands for Data Science

Piotr Migdal

2015年7月24日 02:21

I was looking for a cheat sheet of UNIX commands, which are specifically usable for data science. I mean an introduction to very basic commands (starting from cd, ls, pwd, to some still simple but usable for data - e.g. wc, a few simple things with pipes, ssh, maybe s3cmd, etc), with some minimalistic examples. I couldn't find one; or at least, nothing close to: Git Cheat Sheet Regular Expressions Cheat Sheet series of R Cheat Sheets I am pretty …

Topic: beginner processing tools

Category: Data Science

What is the difference between Hadoop and noSQL

рüффп

2015年5月18日 12:30

I heard about many tools / frameworks for helping people to process their data (big data environment). One is called Hadoop and the other is the noSQL concept. What is the difference in point of processing? Are they complementary?

Topic: apache-hadoop processing tools nosql

Category: Data Science

Pre-processing (center, scale, impute) among training sets (different forms) and the test set - what is a good approach?

Matek

2015年5月2日 03:55

I am currently working on a multi-class classification problem with a large training set. However, it has some specific characteristics, which induced me to experiment with it, resulting in few versions of the training set (as a result of re-sampling, removing observations, etc). I want to perform pre-processing of the data, that is to scale, center and impute (not much imputation though) values. This is the point where I've started to get confused. I've been taught that you should always …

Topic: feature-scaling dataset processing data-mining machine-learning

Category: Data Science

SAS PROC means (two variants together)

Beginner

2015年4月13日 10:52

PROC means data=d mean; var a; class b; var a; run; I want to perform the "PROC means" for continuous "var a": 1) in general and 2) by classes. But it performed by the classes only. How to make procedure for "var a" here in general too? P.S. SAS WARNING: Analysis variable "a" was defined in a previous statement, duplicate definition will be ignored.

Topic: processing

Category: Data Science

OCR / Text Recognition and Recovery Problem

ekrose

2015年2月13日 05:57

I am working on a research project that deals with American military casualties during WWII. Specifically, I am attempting to construct a count of casualties for each service at the county level. There are two sources of data here, each presenting their own challenges. 1. Army and Air Force data. The National Archives hosts lists of Army and Air Force servicemen killed in action by state and county. There are .gif images of the report available online. Here is a …

Topic: text-mining dataset data-cleaning processing

Category: Data Science

Is it possible to automate generating reproducibility documentation?

blunders

2014年5月15日 02:02

First, think it's worth me stating what I mean by replication & reproducibility: Replication of analysis A results in an exact copy of all inputs and processes that are supply and result in incidental outputs in analysis B. Reproducibility of analysis A results in inputs, processes, and outputs that are semantically incidental to analysis A, without access to the exact inputs and processes. Putting aside how easy it might be to replicate a given build, especially an ad-hoc one, to …

Topic: processing

Category: Data Science

About