How do you use KS-test in a data science report?

I'm writing a data science report, I want to find an exist distribution to fit the sample. I got a good looking result , but when I use KS-test to test the model, I got a low p-value,1.2e-4, definitely I should reject the model. I mean, whatever what distribution/model you use to fit the sample, you cannot expect to have a perfect result, especially working with huge amount of data. So what does KS-test do in a data science report? …
Category: Data Science

How to automatically segment multidimensional data?

How to partition the time-series multidimensional data in the figure below into segments using an unsupervised algorithm, so that the information within the same segment remains consistent, while the information in adjacent segments differs? Note that the algorithm should be adaptive because we do not know how many segments each time-series data should be divided into. The data can be found here.
Category: Data Science

Multiple regression with non-normal data in minitab - help

I am aiming to assess the effect of BMI (continuous) on certain biomarkers (also continuous) whilst adjusting for several relevant variables (mixed categorical and continuous) using multiple regression. My data is non-normal which I believe violates one of the key assumptions of multiple linear regression. Whilst I think it can still be performed I think it affects significance testing which is an issue for me. I think I can transform the data and then perform regression but I'm not sure …
Category: Data Science

Do non-parametric models always overfit without regularization?

Let's scope this to just classification. It's clear that if you fully grow out a decision tree with no regularization (e.g. max depth, pruning), it will overfit the training data and get full accuracy down to Bayes error*. Is this universally true for all non-parametric methods? *Assuming the model has access to the "right" features.
Category: Data Science

pass variable length argument to mstats.kruskalwallis

I am trying to run kruskawallis test on multiple columns of my data for that i wrote an function var=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] def kruskawallis_test(column): k_test=train.loc[:,[column,'SalePrice']] x=pd.pivot_table(k_test,index=k_test.index, values='SalePrice',columns=column) for i in range(x.shape[1]): var[i]=x.iloc[:,i] var[i]=var[i][~var[i].isnull()].tolist() H, pval = mstats.kruskalwallis(var[0],var[1],var[2],var[3]) return pval the problem i am facing is every column have a different number of groups so var[0],var[1],var[2],var[3] will not be correct for every column. mstats.kruskalwallis() take input vector which contain values of each group to be compared from a particular column.(as per my knowledge). …
Category: Data Science

what are the main differences between parametric and non-parametric machine learning algorithms?

I am interested in parametric and non-parametric machine learning algorithms, their advantages and disadvantages and also their main differences regarding computational complexities. In particular I am interested in the parametric Gaussian Mixture Model (GMM) and the non-parametric kernel density estimation (KDE). I found out that if a "small" number of data points is used then parametric (like GMM/EM) are the better choice but if the amount of data points increases to a much higher number then non-parametric algorithms are better. …
Category: Data Science

Books about statistical inference

I'm currently taking a course "Introduction to Machine Learning" which covers the following topics: linear regression, overfitting, classification problems, parametric & non-parametric models, Bayesian & non Bayesian models, generative classification, neural networks, SVM, boosting & bagging, unsupervised learning. I've asked the course stuff for some reading material about those subjects but I would like to hear some more recommendations about books (or any other material) that give more intuition about the listed topics to start with and also some books …
Category: Data Science

Should features be correlated or uncorrelated for classification?

I have seen researchers using pearson's correlation coefficient to find out the relevant features -- to keep the features that have a high correlation value with the target. The implication is that the correlated features contribute more information in finding out the target in classification problems. Whereas, we remove the features which are redundant and have very negligible correlation value. Q1) Should highly correlated features with the target variable be included or removed from classification problems ? Is there a …
Category: Data Science

Logic behind the Statement on Non-Parametric models

I am currently reading 'Mastering Machine Learning with scikit-learn', 2E, by Packt. In Lazy Learning and Non-Parametric models topic in Chapter 3- Classification and Regression with k-Nearest Neighbors, there is a paragraph stating- Non-parametric models can be useful when training data is abundant and you have little prior knowledge about the relationship between the response and the explanatory variables. kNN makes only one assumption: instances that are near each other are likely to have similar values of the response variable. …
Category: Data Science

Good introductory reference for Bayesian Non-parametric (Dirichlet Process / Chinese Restaurant Process)

I am looking for a recommendation for basic introductory material on Bayesian Non-parametric methods, specifically Dirichlet Process / Chinese Restaurant Process. I am looking for material which covers the modeling part as well as the inference part from ground-up. Most of the material I found on the internet has slightly advanced material and they skip the inference part, which is usually harder to grasp.
Category: Data Science

Linear vs Non linear regression (Basic Beginner)

So my doubt is basically in Linear regression, We try to fit a straight line or a curve for a given training set. Now, I believe whenever the features (independent variable) increases, parameters also increases. Hence computing these parameters is computationally expensive. So, I guess that's the reason we move to Non linear!? Is my understanding right? And my next doubt is, in overfitting for linear regression, we say that the model memorizes. What I understand is that the parameters …
Category: Data Science

About confidence/prediction intervals: parametric methods VS non-parametric (via bootstrap) methods

About the methodology to find confidence and/or prediction intervals in, let's say, a regression problem, I know 2 main options: Checking normality in the estimates/predictions distribution, and applying well known Gaussian alike methods to find those intervals if the distribution is gaussian Applying non-parametric methodologies like bootstraping, so we do not need to assume/check/care whether our distribution is normal With this in mind, I would basically always go for the second one because: it is meant to be generic, as …
Category: Data Science

Non-parametric regression on set of time series: One model for each or one for all series?

Let's say I have a set of 1D time series which values have been samples in equip-distant time steps with timestamps $1,2,3,...$, they have all the same lengths and are somewhat similar in shape. I want to apply non-parametric regression (e.g. with Gaussian Processes or Kernel Regression) on the time series in order to infer values for timestamps that are between sample timestamps (e.g. $5.3$). The obvious way of doing this would be to simply build a regression model for …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.