How to interpret a specific feature importance?

Apologies for a very case specific question. I have a dataset of genes, with which I am using machine learning to predict if a gene causes a disease. One of the features I have is a beta value (which is the effect size of the gene's impact on the disease), and I'm not sure how best to interpret and use this feature. I condense the beta values from the variant level to the gene level, so a gene is left …
Category: Data Science

Merging two datasets with different features for machine learning prediction

I'm trying to create a model which predicts Real estate prices with xgboost in machine learning, my question is : Can i combine two datasets to do it ? First dataset : 13 features Second dataset : 100 features Thé différence between the two datasets is that the first dataset is Real estate transaction from 2018 to 2021 with features like area , région And the second is also transaction but from 2011 to 2016 but with more features like …
Category: Data Science

Queries regarding feature importance for categorical features

Queries regarding feature importance for categorical features: Context: I have almost 185 categorical features and these categorical features have either 2 or 3 or 8 or 1 or sometimes 4 categories, null's also. I need to select top 60 features for my model. I also understand that features needs to be selected based on business importance OR feature importance by random forest / decision tree. Queries: I have plotted histograms for each feature (value count vs category) to analyse. What …
Category: Data Science

Using PCA as features for production

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
Category: Data Science

Understanding output stepAIC

I am using the stepAIC function in R to do a bi-directional (forward and backward) stepwise regression. I do not understand what each return value from the function means. The output is: Df Sum of Sq RSS AIC <none> 350.71 -5406.0 - aaa 1 0.283 350.99 -5405.9 - bbb 1 0.339 351.05 -5405.4 - ccc 1 0.982 351.69 -5400.5 - ddd 1 0.989 351.70 -5400.5 Question Are the values listed under Df, Sum of Sq, RSS, and AIC the values …
Category: Data Science

Why is XGBClassifier in Python outputting different feature importance values with the same data across different repetitions?

I am fitting an XGBClassifier to a small dataset (32 subjects) and find that if I loop through the code 10 times the feature importances (gain) assigned to the features in the model varies slightly. I am using the same hyperparameter values between each iteration, and have subsample and colsample set to the default of 1 to prevent any random variation between executions. I am using the scikit learn feature_importance_ function to extract the values from the fitted model. Any …
Category: Data Science

Why my regression model always be dominanted by one feature?

I am working on a financial predict problem. which means it is a time series prediction problem. I have three features, which have high correlation(each two's corr is about 0.6) And I do the linear regression fit. I assume that the coefficient should be similiar among these three features, but i get a coefficient vector like this: [0.01, 0.15, 0.01] which means the second features have the biggest coff(features are normalized), and it can dominant the prediction result. I dont …
Category: Data Science

What is the Purpose of Feature Selection

I have a small medical dataset (200 samples) that contains only 6 cases of the condition I am trying to predict using machine learning. So far, the dataset is not proving useful for predicting the target variable and is resulting in models with 0% recall and precision, probably due to how small the dataset is. However, in order to learn from the dataset, I applied Feature Selection techniques to deduct what features are useful in predicting the target variable and …
Category: Data Science

Feature creation ideas for propensity models?

I'm working on a propensity model, predicting whether customers would buy or not. While doing exploratory data analysis, I found that customers have a buying pattern. Most customers repeat the purchase in a specified time interval. For example, some customers repeat purchases every four quarters, some every 8,12 etc. I have the purchase date for these customers. What is the most useful feature I can create to capture this pattern in the data?. I'm predicting whether in the next quarter …
Category: Data Science

Feature Selection on Aggregated Targetdata

I have a question about feature selection on a dataset where the target variable is aggregated by the sum of different data points. I want to predict the number of sales depending on a variety of features like: week price per unit store country store city 2-3 other categorical meta-data other features I am aware that this data should be interpreted as time series but because of the lack of available historical data, no model can compete with the naive …
Category: Data Science

How to use scikit-learn to extract features from text when I only have positive and unlabeled data?

I'm looking for something similar to this https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py But instead of positive and negative examples, I have positive examples and a bunch of unlabeled data that will contain some positive examples but is mostly negative. I'm planning on using this in a pipeline to transform text data into a vector, then feeding it into a classifier using https://pulearn.github.io/pulearn/doc/pulearn/ The issue is I'm not sure the best way to build the preprocessing stage where I transform the raw text data into …
Category: Data Science

Using cross validation score to perform feature selection

So to perform my feature selection I ran cross validation over and over again, each time trying different subsets of my attributes and repeated this until I got the best cross validation score I could get. Is this alright to do or I am creating a major bias? I suspect that this could cause a bias and possibly result in data leakage because I am probably learning something about my test set by doing this, but how bad of a …
Category: Data Science

Feature selection for regression

Suppose I have a response variable y and and a set of feature variables (x1, x2 ... xn). I wish to find which of x1...xn are the best features for y in a regression problem (the relationship might not be linear). Is there any way I can do this kind of feature selection without using any correlation measure or regression function in the process (i.e. I cannot use any filter or wrapper methods)?
Category: Data Science

How To Develop Cluster Models Where the Clusters Occur Along Subsets of Dimensions in Multidimensional Data?

I have been exploring clustering algorithms (K-Means, K-Medoids, Ward Agglomerative, Gaussian Mixture Modeling, BIRCH, DBSCAN, OPTICS, Common Nearest-Neighbour Clustering) with multidimensional data. I believe that the clusters in my data occur across different subsets of the features rather than occurring across all features, and I believe that this impacts the performance of the clustering algorithms. To illustrate, below is Python code for a simulated dataset: ## Simulate a dataset. import numpy as np, matplotlib.pyplot as plt from sklearn.cluster import KMeans …
Category: Data Science

scikit-learn OMP mem error

I tried to use OMP algorithm available in scikit-learn. My net datasize which includes both target signal and dictionary ~ 1G. However when I ran the code, it exited with mem-error. The machine has 16G RAM, so I don't think this should have happened. I tried with some logging where the error came and found that the data got loaded completely into numpy arrays. And it was the algorithm itself that caused the error. Can someone help me with this …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.