Let's say I have 10,000 training points, 100,000,000 points to impute, and 5-10 prediction variables/parameters, all numeric (for now). The target variable is numeric, skewed normal with outliers. I want to use SVM, but I'm new, so I would appreciate any opinions.
I'm writing a data science report, I want to find an exist distribution to fit the sample. I got a good looking result , but when I use KS-test to test the model, I got a low p-value,1.2e-4, definitely I should reject the model. I mean, whatever what distribution/model you use to fit the sample, you cannot expect to have a perfect result, especially working with huge amount of data. So what does KS-test do in a data science report? …
Which Nonparametric outlier detection do you suggest to detect outliers (red points) in these plots? I have tested std, IQR, etc., but no good result. It is just one vector including normal and outliers. Thanks for your help.
How to partition the time-series multidimensional data in the figure below into segments using an unsupervised algorithm, so that the information within the same segment remains consistent, while the information in adjacent segments differs? Note that the algorithm should be adaptive because we do not know how many segments each time-series data should be divided into. The data can be found here.
I am aiming to assess the effect of BMI (continuous) on certain biomarkers (also continuous) whilst adjusting for several relevant variables (mixed categorical and continuous) using multiple regression. My data is non-normal which I believe violates one of the key assumptions of multiple linear regression. Whilst I think it can still be performed I think it affects significance testing which is an issue for me. I think I can transform the data and then perform regression but I'm not sure …
Let's scope this to just classification. It's clear that if you fully grow out a decision tree with no regularization (e.g. max depth, pruning), it will overfit the training data and get full accuracy down to Bayes error*. Is this universally true for all non-parametric methods? *Assuming the model has access to the "right" features.
I am trying to run kruskawallis test on multiple columns of my data for that i wrote an function var=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] def kruskawallis_test(column): k_test=train.loc[:,[column,'SalePrice']] x=pd.pivot_table(k_test,index=k_test.index, values='SalePrice',columns=column) for i in range(x.shape[1]): var[i]=x.iloc[:,i] var[i]=var[i][~var[i].isnull()].tolist() H, pval = mstats.kruskalwallis(var[0],var[1],var[2],var[3]) return pval the problem i am facing is every column have a different number of groups so var[0],var[1],var[2],var[3] will not be correct for every column. mstats.kruskalwallis() take input vector which contain values of each group to be compared from a particular column.(as per my knowledge). …
I am interested in parametric and non-parametric machine learning algorithms, their advantages and disadvantages and also their main differences regarding computational complexities. In particular I am interested in the parametric Gaussian Mixture Model (GMM) and the non-parametric kernel density estimation (KDE). I found out that if a "small" number of data points is used then parametric (like GMM/EM) are the better choice but if the amount of data points increases to a much higher number then non-parametric algorithms are better. …
I'm currently taking a course "Introduction to Machine Learning" which covers the following topics: linear regression, overfitting, classification problems, parametric & non-parametric models, Bayesian & non Bayesian models, generative classification, neural networks, SVM, boosting & bagging, unsupervised learning. I've asked the course stuff for some reading material about those subjects but I would like to hear some more recommendations about books (or any other material) that give more intuition about the listed topics to start with and also some books …
I have seen researchers using pearson's correlation coefficient to find out the relevant features -- to keep the features that have a high correlation value with the target. The implication is that the correlated features contribute more information in finding out the target in classification problems. Whereas, we remove the features which are redundant and have very negligible correlation value. Q1) Should highly correlated features with the target variable be included or removed from classification problems ? Is there a …
I am currently reading 'Mastering Machine Learning with scikit-learn', 2E, by Packt. In Lazy Learning and Non-Parametric models topic in Chapter 3- Classification and Regression with k-Nearest Neighbors, there is a paragraph stating- Non-parametric models can be useful when training data is abundant and you have little prior knowledge about the relationship between the response and the explanatory variables. kNN makes only one assumption: instances that are near each other are likely to have similar values of the response variable. …
I am looking for a recommendation for basic introductory material on Bayesian Non-parametric methods, specifically Dirichlet Process / Chinese Restaurant Process. I am looking for material which covers the modeling part as well as the inference part from ground-up. Most of the material I found on the internet has slightly advanced material and they skip the inference part, which is usually harder to grasp.
So my doubt is basically in Linear regression, We try to fit a straight line or a curve for a given training set. Now, I believe whenever the features (independent variable) increases, parameters also increases. Hence computing these parameters is computationally expensive. So, I guess that's the reason we move to Non linear!? Is my understanding right? And my next doubt is, in overfitting for linear regression, we say that the model memorizes. What I understand is that the parameters …
About the methodology to find confidence and/or prediction intervals in, let's say, a regression problem, I know 2 main options: Checking normality in the estimates/predictions distribution, and applying well known Gaussian alike methods to find those intervals if the distribution is gaussian Applying non-parametric methodologies like bootstraping, so we do not need to assume/check/care whether our distribution is normal With this in mind, I would basically always go for the second one because: it is meant to be generic, as …
Let's say I have a set of 1D time series which values have been samples in equip-distant time steps with timestamps $1,2,3,...$, they have all the same lengths and are somewhat similar in shape. I want to apply non-parametric regression (e.g. with Gaussian Processes or Kernel Regression) on the time series in order to infer values for timestamps that are between sample timestamps (e.g. $5.3$). The obvious way of doing this would be to simply build a regression model for …