bigdata

How do you do 1-vs-rest classifiers in XGBoost Library (Not Sklearn)?

Sebastian

2022年6月4日 18:02

I am working with a very large dataset that would benefit from using training continuation with the xgb_model parameter in xgb.train(). The label (Y) of dataset itself has 4 classes and is highly imbalanced, so I would like to generate per-label PR curves for it to evaluate its performance, and would thus need to treat each class as it's own binary problem using a one-vs-rest classifier. After a lot of reading I haven't found an equivalent to sklearn's OneVsRestClassifier in …

Topic: xgboost multiclass-classification bigdata machine-learning

Category: Data Science

What is the difference between Pachyderm and Git?

Lerner Zhang

2022年6月4日 05:03

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …

Topic: data version-control dataset tools bigdata

Category: Data Science

Subsampling the “right” amout of data to train an ML model

giz

2022年6月4日 04:20

I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset. Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that …

Topic: sampling classification bigdata

Category: Data Science

How to fake data based on the condition and weight

UHU

2022年5月20日 17:03

I'm trying to fake data for the coffee shop. I've two features age and menu. Menu includes various type of drinks such as coffee [latte, espresso, mocca, etc], tea [milktea, lemontea], milk [freshmilk, matchamilk, etc]. What I'm trying to do is to fake menu based on the age like if the age is higher than 15, 80% of people who has the age higher than 15 will mostly order coffee randomly from the list of coffee [latte, espresso, mocca, etc], …

Topic: dataset python bigdata

Category: Data Science

scikit-learn OMP mem error

sshanks

2022年5月16日 13:02

I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this …

Topic: scikit-learn feature-selection python scalability bigdata

Category: Data Science

Time Complexity notation in Big Data platforms

Mohitt

2022年5月13日 04:04

I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over …

Topic: map-reduce algorithms bigdata

Category: Data Science

huge doubt on anomaly detection

Ram Varun

2022年5月12日 04:04

from the naked eye itself, we can tell in the region 5161 the network usage is high so that is the anomaly in my case, then why do we want to apply k-means and other machine learning algorithms to find anomalies in our data

Topic: data-science-model anomaly-detection bigdata data-mining machine-learning

Category: Data Science

How do I create a dataset from many CSV files that is too large for RAM

Finncent Price

2022年5月7日 22:00

I have been handed about 40 GB of CSV files that I need to turn into a database. The files are arranged in a file structure that uses location in that file structure to create a relationship between the different CSV files. /base - ancillary_information.csv - /Run1 - /Scenario A one.csv two.csv ... - /Scenario B one.csv ... - /Run2 - /Scenario C one.csv ... - /Scenario D one.csv ... - /Run3 ... Each CSV file is for a single …

Topic: dataset bigdata

Category: Data Science

Is it possible to implement an rdd version of a for loop having map and reduce using pyspark?

Josef

2022年5月4日 06:59

I need to test an algorithm that computes a function on a dataframe where in each execution I drop a column and computes the function. This is a example in python pyspark but without using rdd: df2581=spark.sparkContext.parallelize([Row(a=1 ,b=3,c=5,d=7,e=9)]).toDF() df2581.show() wo = df2581.rdd.flatMap(lambda x: x[1:] ).map(lambda a:print(type(a))) wo.collect() def f(x): list3 = [] index = 0 list2 = x for j in x: list = array(x) list.remove(list[index]) list3 = list.copy() index += 1 return list3 colu= df2581.columns def add(x,y): return x+y …

Topic: pyspark apache-spark python bigdata

Category: Data Science

The single CSV created by combining a large number of CSV files is too large to process. What options do I have?

odd_wolf

2022年5月3日 14:17

The dataset I am currently working on has more than 100 csv files, with each of size more than 250MB. These are files containing time series data captured from different locations and all the files have the same features as columns. As I understand, I must combine these into one single csv file to use these data in a CNN, RNN or any other network and its assumed to be more than 20GB after completion. But this is an unacceptable …

Topic: csv pandas dataset python bigdata

Category: Data Science

Classifying transactions as malicious

thaweatherman

2022年5月2日 17:04

I have a big data set of fake transactions for a company. Each row contains the username, credit card number, time, device used, and amount of money in the transaction. I need to classify each transaction as either malicious or not malicious and I am lost for ideas on where to start. Doing it by hand would be silly. I was thinking possibly checking for how often a credit card is used, if it is consistently used at a certain …

Topic: classification bigdata

Category: Data Science

Sensitivity analysis in outlier explanation

Shashank Srivastava

2022年4月27日 01:00

I am trying to find the outlier explanation using the sensitivity analysis. Let’s consider that my dataset contains 19 different input values and 1 output value (So overall 20 different columns are there and values are numerical). I have already made a prediction model and I am considering the values with high prediction errors are outliers/ anomalies. I have done the sensitivity analysis for individual input values but in the dataset values are correlated with some other input values, e.g. …

Topic: outlier python bigdata data-mining

Category: Data Science

How to fill missing enteries in column A, and add respective corresponding enteries to column-B with value of previous cell

Jawad Yousuf

2022年4月23日 10:05

I am facing an issue with an excel file. I have an excel sheet with 2 columns Column A : Time Increment with per second Column B : A particular value of a machine sensor The problem i am facing is when the machine is stopped (not in motion), the depth increment stop for that particular time and do not make entries in the excel sheet, and once the machine start moving it again add the entries for the starting …

Topic: excel bigdata

Category: Data Science

Efficiently modify a large csv file in Pandas

Jost A.

2022年4月18日 21:02

I have a csv file and would like to do the following modification on it: df = pandas.read_csv('some_file.csv') df.index = df.index.map(lambda x: x[:-1]) df.to_csv('some_file.csv') This takes the index, removes the last character, and then saves it again. I have multiple problems with this solution since my csv is quite large (around 500GB). First of all, reading and then writing seems not to be very efficient since every line will be fully overwritten, which is not necessary, right? Furthermore, due to …

Topic: dataframe csv pandas python bigdata

Category: Data Science

Combining parameters for Douglas-Peucker Simplification

ShahiM

2022年4月18日 18:38

NOTE : I'm not sure if this is the right forum for this question. if not, please advice. Context : I am collecting a huge amount of data using an android app that is placed on a vehicle. I collect the data at ~1second intervals for about 2 hours, which gives almost 7200 data sets. These are the parameters : Timestamp (milliseconds) Latitude Longitude Speed Acceleration Now I was looking at ways to simplify this data, as processing and rendering …

Topic: javascript dataset data-cleaning bigdata

Category: Data Science

How to run Spark python code in Jupyter Notebook via command prompt

randunu galhena

2022年4月16日 19:04

I am trying to import a data frame into spark using Python's pyspark module. For this, I used Jupyter Notebook and executed the code shown in the screenshot below After that I want to run this in CMD so that I can save my python codes in text file and save as test.py (as python file). Then, I run that python file in CMD using python test.py command, below the screen shot: So my task previously worked, but after 3 …

Topic: data-engineering pyspark apache-spark python bigdata

Category: Data Science

Efficient way to compare one record to millions of rows

Isaiah Melendez

2022年4月12日 15:15

We have a production table that contains a bucket of customer data. A customer could be the same customer/person at location A and at location B. They are different by how the name is spelled, address disparity (lane vs ln), and ultimately the customer ID (PK/UID). We have built a query to pull in the data for the customer and loading them into a staging table and running a similarity coeff library to check each record in the staging table …

Topic: data-cleaning bigdata

Category: Data Science

Spliting training data into multiple variables using R

Ka_Papa

2022年4月9日 05:01

So right now I am trying to create multiple variables with training data, and in the process I have reached an error Error in eval(predvars, data, env) : object '1.band1' not found which is a product of these lines: for(i in 1:length(data_split)){ assign(paste("fit.lda",i, sep = ""), train(class~., data=data_split[i], method="lda", metric=metric, trControl=control))} is it something that I did wrong or is it something that can be fixed with another methodology EDIT: My dataset is a data frame which was created by …

Topic: r bigdata

Category: Data Science

PyMC3: how to efficiently regress on many variables?

Coolio2654

2022年4月2日 15:21

I am sorry ahead of time if this seems like a basic question, but I had difficulty finding resources online addressing this. In PyMC3, when building a basic model of a few variables, it is easy to define each on their own, like alpha=pm.Normal('alpha',mu=0,st=1), and manually add them all with each other. However, what are the standard approaches when one is dealing with dozens/hundreds of variables, each needing a prior? I see that the shape argument is helpful in defining …

Topic: pymc3 bayesian data python bigdata

Category: Data Science

Neural Network Sensordata as Input

Marcel Franzen

2022年3月29日 13:01

I have a dataset consisting of sensor recordings about human movement. There are 22 classes of different movement like sitting or walking and 19 sensor values. Each recording of a movement has about 1000 lines contained in a csv file. My problem: I don't know how to present those recordings to a neural network (TensorFlow) so that it can be trained on the movement classes and even predict what was done in recording by getting those 19000 values. I don't …

Topic: tensorflow python bigdata

Category: Data Science

About