I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The dataset that I want to store is a heart rate pulse data generated by a sensor.
When using the xgboost.train() function, all the threads are used. I would like to use a specific amount. Unfortunately, this function does not accept the parameters nthread nor n_jobs. How can I control the number of threads being used? Thanks. // Edit It seems that I found a solution. In contrast with the method, how one provides the nthread (or n_jobs) parameter to XGBClassifier of XGBRegressor, by adding this parameter directly to the brackets as xgb.XGBRegressor(nthread=n) then as indicated on …
One can rely on continuous wavelets to build a multi-resolution analysis that is equivariant ("covariant") under the action of a discrete subgroup of translation. When not downsampled, the multi-resolution analysis of a 1D signal can be seen as a matrix of n x m coefficients, where n are the octaves that one wants to capture, and m are the number of considered translated wavelets on each octave. Equivariance to translation in this case means that a certain translation of the …
I currently have $1700+$ CSV files. Each of them is in the same format and structure, give or take a row or possibly a column at the end. Each CSV is $\approx 3.8$ MB. I need to perform a transformation on each file Extract one data set, perform a summation and column select, then store inside a folder. Extract an array, no need for column names here, then sore inside a folder. From an algorithmic POV, is it better to …
I'm playing around with the use of deep learning on images and done quite works : colorizing black and white images for example, or maybe fixing old damaged photos. Today I want to tackle a new problem, concerning the conversion of sketches into real look like images as shown in the figure. Performing this could be done in various ways. I want to know if it is correct to think of CycleGANs for this task, since in truth they are …
i'm wondering if anyone can provide some input on improving the speed and calculations of a pandas result. What i am trying to obtain is a summation of IDs in one table (player table) based on each row of a second table (UUID). Functionally each row needs to sum the total of the players table rows that are contained in its Active row and assign the UUID as the index to that row. My initial thought was to loop row …
I'm using Flask where i load some pre-trained machine learning models once. I'm also using Gunicorn usually with 2 or 4 workers to handle parallel requests. Every request contains some texts that i want to analyze. I'll explain my problem with a example: My Flask server with Gunicorn and 2 workers is up and loads my models once for every worker. Then i send two parallel requests. The first will run analysis on the 1st worker with 500 texts and …
I have a bunch of .txt and .srt files extracted from a MOOC website, they are the scripts of the videos. I would like to segment the scripts into parts such that each part falls into one of the following categories: MainConceptDescription-> Explanation of the Main concept(s) SubConceptDescription-> Explanation of Subconcept related to the main concept Methodology / Technique-> To achieve something, what should one do Summary-> Summary of the discussed material or of the whole course Application-> Practical advise …
The patch for the Meltdown vulnerability disables speculative execution, which will impact all processing activities. The degree of impact is highly dependent on the type of processing being done. Is there hard data or experience of how machine learning and data processing will be impacted in measurable terms?
I'm trying to run some analysis with some big datasets (eg 400k rows vs. 400 columns) with R (e.g. using neural networks and recommendation systems). But, it's taking too long to process the data (with huge matrices, e.g. 400k rows vs. 400k columns). What are some free/cheap ways to improve R performance? I'm accepting packages or web services suggestions (other options are welcome).
Another post where I don't know enough terminology to describe things efficiently. For the comments, please suggest some tags and keywords I can add to this post to make it better. Say I have a 2D data structure where 'orientation' doesn't matter. The examples I ran into: The state of a 2048 game. In terms of symmetry groups this would be D4 / D8, except that an operation doesn't yield an identical state, it just yields another state that has …
I was looking for a cheat sheet of UNIX commands, which are specifically usable for data science. I mean an introduction to very basic commands (starting from cd, ls, pwd, to some still simple but usable for data - e.g. wc, a few simple things with pipes, ssh, maybe s3cmd, etc), with some minimalistic examples. I couldn't find one; or at least, nothing close to: Git Cheat Sheet Regular Expressions Cheat Sheet series of R Cheat Sheets I am pretty …
I heard about many tools / frameworks for helping people to process their data (big data environment). One is called Hadoop and the other is the noSQL concept. What is the difference in point of processing? Are they complementary?
I am currently working on a multi-class classification problem with a large training set. However, it has some specific characteristics, which induced me to experiment with it, resulting in few versions of the training set (as a result of re-sampling, removing observations, etc). I want to perform pre-processing of the data, that is to scale, center and impute (not much imputation though) values. This is the point where I've started to get confused. I've been taught that you should always …
PROC means data=d mean; var a; class b; var a; run; I want to perform the "PROC means" for continuous "var a": 1) in general and 2) by classes. But it performed by the classes only. How to make procedure for "var a" here in general too? P.S. SAS WARNING: Analysis variable "a" was defined in a previous statement, duplicate definition will be ignored.
I am working on a research project that deals with American military casualties during WWII. Specifically, I am attempting to construct a count of casualties for each service at the county level. There are two sources of data here, each presenting their own challenges. 1. Army and Air Force data. The National Archives hosts lists of Army and Air Force servicemen killed in action by state and county. There are .gif images of the report available online. Here is a …
First, think it's worth me stating what I mean by replication & reproducibility: Replication of analysis A results in an exact copy of all inputs and processes that are supply and result in incidental outputs in analysis B. Reproducibility of analysis A results in inputs, processes, and outputs that are semantically incidental to analysis A, without access to the exact inputs and processes. Putting aside how easy it might be to replicate a given build, especially an ad-hoc one, to …