data - Geeks Mental

What is the difference between Pachyderm and Git?

Lerner Zhang

2022年6月4日 05:03

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when data is added to or changed in a data set It can run any transformation, as long as it runs in a Docker, and accepts a file as input and outputs a file as result It versions all …

Topic: data version-control dataset tools bigdata

Category: Data Science

Dataset with Multiple Choice Questions for fine tuning

futuredataengineer

2022年6月3日 19:35

I hope it's allowed to ask here, but I am looking for a dataset (the format is not that important) that is similar to SQuAD, but it also contains false answers to the questions. I wanna use it to fine tune GPT-3, and all I find is either MC questions based on a text, but with no distractors, or classical quizzes that have no context before each question. I have a code that generates distractors, and I can just plug …

Topic: openai-gpt data dataset nlp machine-learning

Category: Data Science

Averaging multiple train-test splits to estimate the performance with higher variability?

Frederik Faarup

2022年6月3日 09:44

I have a small size data set and I want to assess the effect of a certain type of cases on the overall model performance. For example, is the model biased against people of a certain age group? Using a single train-test split, the number of cases of a particular type becomes quite small, and I suspect findings may occur due to randomness. Would it in this scenario make sense to use multiple train-test splits, compute the average performances, and …

Topic: cnn data neural-network machine-learning

Category: Data Science

Ideal difference in the training accuracy and testing accuracy

girl101

2022年6月2日 00:06

In a data classification problem (with supervised learning), what should be the ideal difference in the training set accuracy and testing set accuracy? What should be the ideal range? Is a difference of 5% between the accuracy of training and testing set okay? Or does it signify overfitting?

Topic: training data supervised-learning accuracy classification

Category: Data Science

Importing Excel format data into R/R Studio and using glmnet package?

Sympa

2022年5月31日 03:03

I have no problem importing Excel formatted data into R/R Studio and use all other R packages that I use. But, when I want to use the glmnet package to develop a regularization model, I invariably run into the following error (after specifying my regularization model and attempting to run it): Error in storage.mode(y) <- "double": (list) object cannot be coerced to type 'double' Here is what I have already tried to resolve this: De-format the numbers in Excel (no …

Topic: regularization data excel error-handling r

Category: Data Science

How to make a dataframe with lists or vectors as its elements

Quality

2022年5月29日 22:02

This is something I have been wondering for ages but I am never able to get an answer. I am trying to understand how to make a dataframe in R, where each element of the dataframe is itself a vector or a matrix. For example, lets say we have a regular vector V with elements being real numbers. Then to acess any number we would have V[3] which would give the third element of said vector. Now I want to …

Topic: code data r

Category: Data Science

Behavioural data required to predict churn

James Stott

2022年5月29日 10:05

I am trying to build a predictive churn model that will identify customers who are likely to churn. I am defining a churned user as someone who hasn't transacted within 60 days. 90% of all transactions occur within 60 days of one another so this feels reasonable. I have very limited behavioural data; however. I have a record of a user's transactions and I have access to Google Analytics (GA). GA does not, however, allow me to track the specific …

Topic: churn data machine-learning

Category: Data Science

How to input sets as features

Tana

2022年5月26日 22:00

Need advice on the best way to represent the below data to be fed into an ML algorithm (yet to decided on) This is from the online order processing domain. An order consists of a set of variable number of items. Each item can be located in different warehouses, again this is a variable number. The entire order with multiple items and items with multiple warehouses per item, needs to be processed as one training sample. The goal is to …

Topic: feature-engineering data

Category: Data Science

Column sum in SPSS (with filter and grouped by date)?

Daniel Rivas

2022年5月26日 09:00

device date act power 1 react power 2 ------------------------------------------------- M1 05-02 2 3 M2 05-02 4 2 M3 05-02 3 4 M1 06-02 1 2 M2 07-02 3 4 ------- ------- need sum need sum Say that I only need the sum of M1 and M2 from that table. How could I add a variable that contains the sum of power group by date and device? I don't know if it is desired to have something like this? Or how …

Topic: automatic-summarization spss data

Category: Data Science

Data Analytics how to read ECDF graph

Yavuz Bozkurt

2022年5月25日 18:08

Hi there, My question is about how to read ECDF graphs. I am still quite unsure what the jumps / zig-zags in the graph mean and what is happening when there is a horizontal line and so on. I would be happy if someone can explain me how I am suppose to read this graph and what information I can get from it. Thank you

Topic: data-analysis data graphs

Category: Data Science

Date transformation for KNN

Mapp

2022年5月24日 16:06

I have data set with date features like 01/01/2019 and I would like to use KNN. However, I cannot find a good transformation for dates that has a meaningful distance result for the last feature. For example: f1 | 1 | 2 | 3 | 4 | 01/01/2019 f2 | 10 | 3 | 12 | 1 | 14/01/2019 Does anyone have any recommendations?

Topic: k-nn data distance machine-learning

Category: Data Science

Are there any open datasets for commercial use?

hyeri

2022年5月24日 16:06

I am creating a bootcamp for data analyst and it's been 2 days I am looking for some good dataset fit for commercial use that I can use to create Tableau and Power BI tutorials. Even on kaggle some datasets are licensed as CC0 but when you track back the company the data was scrapped from, it states that the data shouldn't be used for commercial use (e.g Zomato dataset). Are there any good data sources which I can use …

Topic: data-analysis powerbi data tableau

Category: Data Science

Wave Function in Python

Elevate Vienna

2022年5月20日 08:05

How to apply Wave Function to a Data Set in Python to derive frequency distribution and probability amplitude?

Topic: forecasting probability data python

Category: Data Science

How to plot using facet_wrap, over multiple pages as a .pdf files in r cran

Shivy b

2022年5月17日 18:06

I am using ggplot, to compare 114 unique studies for a particular variable I'm interested in. This is what I have used. ggplot(steps, aes(x=factor(edu))) + geom_bar(aes(y = (..count..), group = id_study,)) + facet_wrap(~id_study,) Whilst this works, all 114 studies are plotted on one page and the formatting is all squashed. How do I split this over 4x4 pages ? Many thanks S edit **** As there are 114 unique studies, I have 5 pages in total 1) ggplot(steps, aes(x=factor(edu))) + …

Topic: plotting ggplot2 data r data-mining

Category: Data Science

How can I statistically measure/determine if A performs better than B?

newnewnoo11

2022年5月16日 22:51

Hi Data Science Community! I am a new Data Intern and I have been stuck on this question for a while. Here is a sample dataset I am working with: Customer Manufacturer A Spending Manufacturer B Spending Manufacturer A Cost per Product (CPP) Manufacturer B Cost per Product (CPP) Product Cost Difference (B-A) Product Cost Difference in % 1 400000 360000 44 45 1 1/45 2 300000 310000 23 21 -2 -2/21 3 100000 106000 1.4 1.6 0.2 0.2/1.6 I …

Topic: hypothesis-testing data

Category: Data Science

How do i conduct t-test for comparing the accuracy of two binary classifiers?

honolulu

2022年5月14日 13:02

I am using two binary classifiers that predicts the accuracy of samples over a dataset. Accuracy is defined as ratio of correct vs incorrect predictions. Do i need to take accuracies sampled over multiple experiments and use them as data for t-test. Can some explain please ? Also what will the result of the t-test convey?. Thanks in advance.

Topic: hypothesis-testing descriptive-statistics data statistics

Category: Data Science

How to build a unbiased predictive ML model when the record of the event is less compared to the total number of records?

Mashrafi Iqra

2022年5月13日 11:03

I am trying to build a model that will predict the communication loss of a wireless device. For now I am using RandomForestClassifier along with Device and Location as the features. I am getting both the train score and test score as 99%. So I am pretty sure the model is giving biased result. One of the reason might be because the record of communication loss events are very less compared to the the record with no communication loss Some …

Topic: data-science-model data machine-learning

Category: Data Science

Where should I find electrolytic capacitor ageing data

Jan Cabrera

2022年5月13日 08:46

I am trying to get a dataset of Electrolytic capacitors ageing and I am not being able to find one that shows the ripple current and the voltage in order to calculate its Equivalent Series Resistance (a nice parameter to check its degradation). I have look on the typical sites (kaggle, dataworld...) but I found none. May someone recomend me a site? Thank you!

Topic: data-science-model prediction data dataset predictive-modeling

Category: Data Science

Are experiments using confidence interval can be said a statistical test

honolulu

2022年5月13日 07:18

I am working on some algorithm that is comparing results with other model using confidence interval , 90%. Can this be said a statistical test ? I read a article where it said about statistical test with some confidence level. Is confidence level same as confidence interval in statistical tests ?

Topic: confidence hypothesis-testing descriptive-statistics data statistics

Category: Data Science

How to do binning in matrix data

Blogger22

2022年5月12日 16:25

I have some data like hr1 hr2 hr3 hr4 hr5 hr6 hr7 usr1 1 0 0 0 0 0 0 usr2 0 1 1 0 0 0 0 usr3 0 1 0 0 0 0 0 usr4 1 0 0 0 0 0 0 usr5 1 1 1 1 1 1 1 How to categorize this data in bins like hr1-hr3 and hr4-hr7 or any other bins

Topic: matrix-factorisation data statistics data-mining machine-learning

Category: Data Science

About