feature-selection

What are the differences between the below feature selection methods?

Niyaz

2022年6月4日 20:48

Do the below codes do the same? If not, what are the differences? fs = RFE(estimator=RandomForestClassifier(), n_features_to_select=10) fs.fit(X, y) print(fs.support_) fs= RandomForestClassifier(), fs.fit(X, y) print(fs.feature_importances_[:10,])

Topic: scikit-learn feature-selection machine-learning

Category: Data Science

How to interpret a specific feature importance?

DN1

2022年6月4日 09:07

Apologies for a very case specific question. I have a dataset of genes, with which I am using machine learning to predict if a gene causes a disease. One of the features I have is a beta value (which is the effect size of the gene's impact on the disease), and I'm not sure how best to interpret and use this feature. I condense the beta values from the variant level to the gene level, so a gene is left …

Topic: bioinformatics feature-selection machine-learning

Category: Data Science

Merging two datasets with different features for machine learning prediction

Djakarta_zero

2022年6月4日 06:13

I'm trying to create a model which predicts Real estate prices with xgboost in machine learning, my question is : Can i combine two datasets to do it ? First dataset : 13 features Second dataset : 100 features Thé différence between the two datasets is that the first dataset is Real estate transaction from 2018 to 2021 with features like area , région And the second is also transaction but from 2011 to 2016 but with more features like …

Topic: prediction pandas feature-selection python machine-learning

Category: Data Science

Queries regarding feature importance for categorical features

Pradip

2022年6月2日 09:08

Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …

Topic: feature-engineering feature-selection categorical-data

Category: Data Science

Using PCA as features for production

Humpalum Druf

2022年6月1日 04:04

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …

Topic: feature-reduction pca scikit-learn feature-selection

Category: Data Science

Understanding output stepAIC

universalkernel

2022年5月31日 16:47

I am using the stepAIC function in R to do a bi-directional (forward and backward) stepwise regression. I do not understand what each return value from the function means. The output is: Df Sum of Sq RSS AIC <none> 350.71 -5406.0 - aaa 1 0.283 350.99 -5405.9 - bbb 1 0.339 351.05 -5405.4 - ccc 1 0.982 351.69 -5400.5 - ddd 1 0.989 351.70 -5400.5 Question Are the values listed under Df, Sum of Sq, RSS, and AIC the values …

Topic: feature-selection r data-mining

Category: Data Science

Why is XGBClassifier in Python outputting different feature importance values with the same data across different repetitions?

user15733888

2022年5月31日 09:01

I am fitting an XGBClassifier to a small dataset (32 subjects) and find that if I loop through the code 10 times the feature importances (gain) assigned to the features in the model varies slightly. I am using the same hyperparameter values between each iteration, and have subsample and colsample set to the default of 1 to prevent any random variation between executions. I am using the scikit learn feature_importance_ function to extract the values from the fitted model. Any …

Topic: feature-importances xgboost feature-selection python machine-learning

Category: Data Science

make new feature based on number of 'likes' and 'release date'

DarthVader8848

2022年5月30日 14:02

I must create new feature based on number of likes and release date. Is it good idea to estimate likes per day? And after It make some range for video popularity. I think that if our video was released a long time ago and this video has 5 likes. Like per day coeff will be high. How can I calculate this coeff properly?

Topic: data-analysis feature-selection

Category: Data Science

Why my regression model always be dominanted by one feature?

nick

2022年5月30日 08:02

I am working on a financial predict problem. which means it is a time series prediction problem. I have three features, which have high correlation(each two's corr is about 0.6) And I do the linear regression fit. I assume that the coefficient should be similiar among these three features, but i get a coefficient vector like this: [0.01, 0.15, 0.01] which means the second features have the biggest coff(features are normalized), and it can dominant the prediction result. I dont …

Topic: normalization regression feature-selection machine-learning

Category: Data Science

What is the Purpose of Feature Selection

sums22

2022年5月27日 15:00

I have a small medical dataset (200 samples) that contains only 6 cases of the condition I am trying to predict using machine learning. So far, the dataset is not proving useful for predicting the target variable and is resulting in models with 0% recall and precision, probably due to how small the dataset is. However, in order to learn from the dataset, I applied Feature Selection techniques to deduct what features are useful in predicting the target variable and …

Topic: feature-selection python dimensionality-reduction machine-learning

Category: Data Science

Rapidminer and decision tree weights

Qwerto

2022年5月25日 04:02

In Rapidminer, are the decision tree's weights a measure of the "importance" of attributes in the splitting procedure ? If yes, why is useful to know these weights ? Are there better methods to know the most discriminant features in a data set ?

Topic: rapidminer decision-trees feature-selection data-mining machine-learning

Category: Data Science

Feature creation ideas for propensity models?

NAS

2022年5月24日 05:22

I'm working on a propensity model, predicting whether customers would buy or not. While doing exploratory data analysis, I found that customers have a buying pattern. Most customers repeat the purchase in a specified time interval. For example, some customers repeat purchases every four quarters, some every 8,12 etc. I have the purchase date for these customers. What is the most useful feature I can create to capture this pattern in the data?. I'm predicting whether in the next quarter …

Topic: feature-engineering feature-construction classification feature-selection machine-learning

Category: Data Science

Feature Selection on Aggregated Targetdata

Alexander Fratzer

2022年5月21日 11:03

I have a question about feature selection on a dataset where the target variable is aggregated by the sum of different data points. I want to predict the number of sales depending on a variety of features like: week price per unit store country store city 2-3 other categorical meta-data other features I am aware that this data should be interpreted as time series but because of the lack of available historical data, no model can compete with the naive …

Topic: aggregation time-series feature-selection machine-learning

Category: Data Science

How to use scikit-learn to extract features from text when I only have positive and unlabeled data?

rbaehr

2022年5月21日 08:03

I'm looking for something similar to this https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py But instead of positive and negative examples, I have positive examples and a bunch of unlabeled data that will contain some positive examples but is mostly negative. I'm planning on using this in a pipeline to transform text data into a vector, then feeding it into a classifier using https://pulearn.github.io/pulearn/doc/pulearn/ The issue is I'm not sure the best way to build the preprocessing stage where I transform the raw text data into …

Topic: bag-of-words text-classification scikit-learn feature-selection clustering

Category: Data Science

Using cross validation score to perform feature selection

Rubiks cube

2022年5月19日 23:01

So to perform my feature selection I ran cross validation over and over again, each time trying different subsets of my attributes and repeated this until I got the best cross validation score I could get. Is this alright to do or I am creating a major bias? I suspect that this could cause a bias and possibly result in data leakage because I am probably learning something about my test set by doing this, but how bad of a …

Topic: cross-validation feature-selection

Category: Data Science

Feature selection for regression

ad123

2022年5月19日 02:02

Suppose I have a response variable y and and a set of feature variables (x1, x2 ... xn). I wish to find which of x1...xn are the best features for y in a regression problem (the relationship might not be linear). Is there any way I can do this kind of feature selection without using any correlation measure or regression function in the process (i.e. I cannot use any filter or wrapper methods)?

Topic: regression feature-selection

Category: Data Science

For feature selection, do we use Chi-squared with Mutual Information together?

O O

2022年5月18日 12:46

Or do we only choose one out of two for categorical data.

Topic: chi-square-test mutual-information feature-extraction feature-selection machine-learning

Category: Data Science

Feature Selection and Outlier Detection

Payal Bhatia

2022年5月18日 09:07

How does feature selection impact outlier detection and also, removing outliers impact feature selection? It could be a basic question. However, just to know the boundaries, I asked. Thanks in advance. I have gone through the following:Feature selection and outlier order

Topic: outlier feature-selection statistics machine-learning

Category: Data Science

How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?

from keras import michael

2022年5月17日 21:05

I have been exploring clustering algorithms (K-Means, K-Medoids, Ward Agglomerative, Gaussian Mixture Modeling, BIRCH, DBSCAN, OPTICS, Common Nearest-Neighbour Clustering) with multidimensional data. I believe that the clusters in my data occur across different subsets of the features rather than occurring across all features, and I believe that this impacts the performance of the clustering algorithms. To illustrate, below is Python code for a simulated dataset: ## Simulate a dataset. import numpy as np, matplotlib.pyplot as plt from sklearn.cluster import KMeans …

Topic: feature-selection python k-means clustering

Category: Data Science

scikit-learn OMP mem error

sshanks

2022年5月16日 13:02

I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this …

Topic: scikit-learn feature-selection python scalability bigdata

Category: Data Science

About