I am working with a very large dataset that would benefit from using training continuation with the xgb_model parameter in xgb.train(). The label (Y) of dataset itself has 4 classes and is highly imbalanced, so I would like to generate per-label PR curves for it to evaluate its performance, and would thus need to treat each class as it's own binary problem using a one-vs-rest classifier. After a lot of reading I haven't found an equivalent to sklearn's OneVsRestClassifier in …
I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …
I am training a machine learning model (i.e., a classifier) on a large dataset. I know that I can get the same results using less data (about 30%) but I would like to avoid the trial and error process to find the 'right' amount of data to retain from the dataset. Of course I can create a script which automatically tried different thresholds but I was wondering if there is any principled way of doing this. It seems strange that …
I'm trying to fake data for the coffee shop. I've two features age and menu. Menu includes various type of drinks such as coffee [latte, espresso, mocca, etc], tea [milktea, lemontea], milk [freshmilk, matchamilk, etc]. What I'm trying to do is to fake menu based on the age like if the age is higher than 15, 80% of people who has the age higher than 15 will mostly order coffee randomly from the list of coffee [latte, espresso, mocca, etc], …
I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this …
I am redesigning some of the classical algorithms for Hadoop/MapReduce framework. I was wondering if there any established approach for denoting Big(O) kind of expressions to measure time complexity? For example, hypothetically, a simple average calculation of n (=1 billion) numbers is O(n) + C operation using simple for loop, or O(log) I am assuming division to be a constant time operation for the sake for simplicity. If i break this massively parallelizable algorithm for MapReduce, by dividing data over …
from the naked eye itself, we can tell in the region 5161 the network usage is high so that is the anomaly in my case, then why do we want to apply k-means and other machine learning algorithms to find anomalies in our data
I have been handed about 40 GB of CSV files that I need to turn into a database. The files are arranged in a file structure that uses location in that file structure to create a relationship between the different CSV files. /base - ancillary_information.csv - /Run1 - /Scenario A one.csv two.csv ... - /Scenario B one.csv ... - /Run2 - /Scenario C one.csv ... - /Scenario D one.csv ... - /Run3 ... Each CSV file is for a single …
I need to test an algorithm that computes a function on a dataframe where in each execution I drop a column and computes the function. This is a example in python pyspark but without using rdd: df2581=spark.sparkContext.parallelize([Row(a=1 ,b=3,c=5,d=7,e=9)]).toDF() df2581.show() wo = df2581.rdd.flatMap(lambda x: x[1:] ).map(lambda a:print(type(a))) wo.collect() def f(x): list3 = [] index = 0 list2 = x for j in x: list = array(x) list.remove(list[index]) list3 = list.copy() index += 1 return list3 colu= df2581.columns def add(x,y): return x+y …
The dataset I am currently working on has more than 100 csv files, with each of size more than 250MB. These are files containing time series data captured from different locations and all the files have the same features as columns. As I understand, I must combine these into one single csv file to use these data in a CNN, RNN or any other network and its assumed to be more than 20GB after completion. But this is an unacceptable …
I have a big data set of fake transactions for a company. Each row contains the username, credit card number, time, device used, and amount of money in the transaction. I need to classify each transaction as either malicious or not malicious and I am lost for ideas on where to start. Doing it by hand would be silly. I was thinking possibly checking for how often a credit card is used, if it is consistently used at a certain …
I am trying to find the outlier explanation using the sensitivity analysis. Let’s consider that my dataset contains 19 different input values and 1 output value (So overall 20 different columns are there and values are numerical). I have already made a prediction model and I am considering the values with high prediction errors are outliers/ anomalies. I have done the sensitivity analysis for individual input values but in the dataset values are correlated with some other input values, e.g. …
I am facing an issue with an excel file. I have an excel sheet with 2 columns Column A : Time Increment with per second Column B : A particular value of a machine sensor The problem i am facing is when the machine is stopped (not in motion), the depth increment stop for that particular time and do not make entries in the excel sheet, and once the machine start moving it again add the entries for the starting …
I have a csv file and would like to do the following modification on it: df = pandas.read_csv('some_file.csv') df.index = df.index.map(lambda x: x[:-1]) df.to_csv('some_file.csv') This takes the index, removes the last character, and then saves it again. I have multiple problems with this solution since my csv is quite large (around 500GB). First of all, reading and then writing seems not to be very efficient since every line will be fully overwritten, which is not necessary, right? Furthermore, due to …
NOTE : I'm not sure if this is the right forum for this question. if not, please advice. Context : I am collecting a huge amount of data using an android app that is placed on a vehicle. I collect the data at ~1second intervals for about 2 hours, which gives almost 7200 data sets. These are the parameters : Timestamp (milliseconds) Latitude Longitude Speed Acceleration Now I was looking at ways to simplify this data, as processing and rendering …
I am trying to import a data frame into spark using Python's pyspark module. For this, I used Jupyter Notebook and executed the code shown in the screenshot below After that I want to run this in CMD so that I can save my python codes in text file and save as test.py (as python file). Then, I run that python file in CMD using python test.py command, below the screen shot: So my task previously worked, but after 3 …
We have a production table that contains a bucket of customer data. A customer could be the same customer/person at location A and at location B. They are different by how the name is spelled, address disparity (lane vs ln), and ultimately the customer ID (PK/UID). We have built a query to pull in the data for the customer and loading them into a staging table and running a similarity coeff library to check each record in the staging table …
So right now I am trying to create multiple variables with training data, and in the process I have reached an error Error in eval(predvars, data, env) : object '1.band1' not found which is a product of these lines: for(i in 1:length(data_split)){ assign(paste("fit.lda",i, sep = ""), train(class~., data=data_split[i], method="lda", metric=metric, trControl=control))} is it something that I did wrong or is it something that can be fixed with another methodology EDIT: My dataset is a data frame which was created by …
I am sorry ahead of time if this seems like a basic question, but I had difficulty finding resources online addressing this. In PyMC3, when building a basic model of a few variables, it is easy to define each on their own, like alpha=pm.Normal('alpha',mu=0,st=1), and manually add them all with each other. However, what are the standard approaches when one is dealing with dozens/hundreds of variables, each needing a prior? I see that the shape argument is helpful in defining …
I have a dataset consisting of sensor recordings about human movement. There are 22 classes of different movement like sitting or walking and 19 sensor values. Each recording of a movement has about 1000 lines contained in a csv file. My problem: I don't know how to present those recordings to a neural network (TensorFlow) so that it can be trained on the movement classes and even predict what was done in recording by getting those 19000 values. I don't …