I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …
I hope it's allowed to ask here, but I am looking for a dataset (the format is not that important) that is similar to SQuAD, but it also contains false answers to the questions. I wanna use it to fine tune GPT-3, and all I find is either MC questions based on a text, but with no distractors, or classical quizzes that have no context before each question. I have a code that generates distractors, and I can just plug …
I have a small size data set and I want to assess the effect of a certain type of cases on the overall model performance. For example, is the model biased against people of a certain age group? Using a single train-test split, the number of cases of a particular type becomes quite small, and I suspect findings may occur due to randomness. Would it in this scenario make sense to use multiple train-test splits, compute the average performances, and …
In a data classification problem (with supervised learning), what should be the ideal difference in the training set accuracy and testing set accuracy? What should be the ideal range? Is a difference of 5% between the accuracy of training and testing set okay? Or does it signify overfitting?
I have no problem importing Excel formatted data into R/R Studio and use all other R packages that I use. But, when I want to use the glmnet package to develop a regularization model, I invariably run into the following error (after specifying my regularization model and attempting to run it): Error in storage.mode(y) <- "double": (list) object cannot be coerced to type 'double' Here is what I have already tried to resolve this: De-format the numbers in Excel (no …
This is something I have been wondering for ages but I am never able to get an answer. I am trying to understand how to make a dataframe in R, where each element of the dataframe is itself a vector or a matrix. For example, lets say we have a regular vector V with elements being real numbers. Then to acess any number we would have V[3] which would give the third element of said vector. Now I want to …
I am trying to build a predictive churn model that will identify customers who are likely to churn. I am defining a churned user as someone who hasn't transacted within 60 days. 90% of all transactions occur within 60 days of one another so this feels reasonable. I have very limited behavioural data; however. I have a record of a user's transactions and I have access to Google Analytics (GA). GA does not, however, allow me to track the specific …
Need advice on the best way to represent the below data to be fed into an ML algorithm (yet to decided on) This is from the online order processing domain. An order consists of a set of variable number of items. Each item can be located in different warehouses, again this is a variable number. The entire order with multiple items and items with multiple warehouses per item, needs to be processed as one training sample. The goal is to …
device date act power 1 react power 2 ------------------------------------------------- M1 05-02 2 3 M2 05-02 4 2 M3 05-02 3 4 M1 06-02 1 2 M2 07-02 3 4 ------- ------- need sum need sum Say that I only need the sum of M1 and M2 from that table. How could I add a variable that contains the sum of power group by date and device? I don't know if it is desired to have something like this? Or how …
Hi there, My question is about how to read ECDF graphs. I am still quite unsure what the jumps / zig-zags in the graph mean and what is happening when there is a horizontal line and so on. I would be happy if someone can explain me how I am suppose to read this graph and what information I can get from it. Thank you
I have data set with date features like 01/01/2019 and I would like to use KNN. However, I cannot find a good transformation for dates that has a meaningful distance result for the last feature. For example: f1 | 1 | 2 | 3 | 4 | 01/01/2019 f2 | 10 | 3 | 12 | 1 | 14/01/2019 Does anyone have any recommendations?
I am creating a bootcamp for data analyst and it's been 2 days I am looking for some good dataset fit for commercial use that I can use to create Tableau and Power BI tutorials. Even on kaggle some datasets are licensed as CC0 but when you track back the company the data was scrapped from, it states that the data shouldn't be used for commercial use (e.g Zomato dataset). Are there any good data sources which I can use …
I am using ggplot, to compare 114 unique studies for a particular variable I'm interested in. This is what I have used. ggplot(steps, aes(x=factor(edu))) + geom_bar(aes(y = (..count..), group = id_study,)) + facet_wrap(~id_study,) Whilst this works, all 114 studies are plotted on one page and the formatting is all squashed. How do I split this over 4x4 pages ? Many thanks S edit **** As there are 114 unique studies, I have 5 pages in total 1) ggplot(steps, aes(x=factor(edu))) + …
Hi Data Science Community! I am a new Data Intern and I have been stuck on this question for a while. Here is a sample dataset I am working with: Customer Manufacturer A Spending Manufacturer B Spending Manufacturer A Cost per Product (CPP) Manufacturer B Cost per Product (CPP) Product Cost Difference (B-A) Product Cost Difference in % 1 400000 360000 44 45 1 1/45 2 300000 310000 23 21 -2 -2/21 3 100000 106000 1.4 1.6 0.2 0.2/1.6 I …
I am using two binary classifiers that predicts the accuracy of samples over a dataset. Accuracy is defined as ratio of correct vs incorrect predictions. Do i need to take accuracies sampled over multiple experiments and use them as data for t-test. Can some explain please ? Also what will the result of the t-test convey?. Thanks in advance.
I am trying to build a model that will predict the communication loss of a wireless device. For now I am using RandomForestClassifier along with Device and Location as the features. I am getting both the train score and test score as 99%. So I am pretty sure the model is giving biased result. One of the reason might be because the record of communication loss events are very less compared to the the record with no communication loss Some …
I am trying to get a dataset of Electrolytic capacitors ageing and I am not being able to find one that shows the ripple current and the voltage in order to calculate its Equivalent Series Resistance (a nice parameter to check its degradation). I have look on the typical sites (kaggle, dataworld...) but I found none. May someone recomend me a site? Thank you!
I am working on some algorithm that is comparing results with other model using confidence interval , 90%. Can this be said a statistical test ? I read a article where it said about statistical test with some confidence level. Is confidence level same as confidence interval in statistical tests ?
I have some data like hr1 hr2 hr3 hr4 hr5 hr6 hr7 usr1 1 0 0 0 0 0 0 usr2 0 1 1 0 0 0 0 usr3 0 1 0 0 0 0 0 usr4 1 0 0 0 0 0 0 usr5 1 1 1 1 1 1 1 How to categorize this data in bins like hr1-hr3 and hr4-hr7 or any other bins