Using PCA as features for production

I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
Category: Data Science

Heuristics, methods to speed up searches over subsets of big set (combinatorially NP hard probably)

I have a reasonable-sized set of size N (say 10 000 objects) in which I am searching for groups of compatible elements. Meaning that I have a function y = f(x_1, x_2, x_3, ..., x_n) returning bool 0/1 answer whether n elements are compatible. We are interested in executing this search on each subset smaller than 8 elements of N. Which is obviously either NP hard or close to this. Even for pairwise search for n elements set we have …
Category: Data Science

Reduce number of vectors in dataset to achieve the "same average dimensions result"?

Edit for re-opening the question, I'll try to answer questions made by @user2974951: I have a large user preference statistics for trichotomic data sets. You can visualize each data trio as a 3D vector with X, Y and Z values. All vectors complies to X + Y + Z = 1 because of the trichotomous shape of the data I'm using. It can also be visualized as a points in an equilateral triangle. I have many tests, each with a …
Category: Data Science

Feature reduction by removing certain columns in dataframe

I am working with the Emotion recognition model with the IEMOCAP dataset. For the feature extraction, I am taking mel-spectrogram and then convert it into a NumPy array and converting the array into a data frame of spectrogram features. The generated dataframe has a shape of 2380 rows X 11761 columns like 0 1 2 3 4 5 6 7 ... 11754 11755 11756 11757 11758 11759 11760 11761 262 0.036491 0.037793 0.041035 0.044644 0.047210 0.048467 0.049556 0.052137 ... 0.0 …
Category: Data Science

How should I encode 'dynamic' features (with multiple instances) along with 'static' features (single instances)?

Suppose I have to predict if a certain product from an assembly line in a factory will be a scrap. This product has let's say 'static' data like a certain shape. A certain vendor, etc. And, it can have 'dynamic' data this meaning it can have for example: one or more sets of measurements (pressures,temperatures ,etc) from production processes. How to treat this 'dynamic' features ? Somehow it doesn't seem right to repeat the 'static' data for all 'dynamic' events. …
Category: Data Science

Information compression for variable input size

Is there a way to compress information of a variable input size? Autoencoder requires standardized input sizes. Although I can add masks on the cost function and add dummy features to standardize input/output size, I am hesitant with the potential drawbacks. The input structures I am interested in are graphs and images. If input sizes and shapes vary too much, padding, resizing and rescaling do not work.
Category: Data Science

Correlation Matrix for non-numeric features

Currently, I have dataset with numeric as well non-numeric attributes. I am trying to remove the redundant features in the dataset using R Programming Languages. Note: Non-numeric attributes cannot be turned into binary. The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed. However, It only works numeric values of 'x'. I have been unable to find a package which does it for non-numeric attributes. Is …
Category: Data Science

Does it make sense to randomly select features as a baseline?

In my paper, I am saying that the accuracy of classification is $x\%$ when using the top N features. My supervisor thinks that we should capture the classification accuracy when using N randomly selected features to show that the initial feature selection technique makes an actual difference. Does this make sense? I've argued that no one cares about randomly selected features so this addition doesn't make sense. It's quite obvious that randomly selecting features will provide a worse classification accuracy …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.