I struggle with figuring out how to proceed with taking PCA into production in order to test my Models with unknown samples. I'm using both an One-Hot-Encoding an an TF-IDF in order to classify my elements with various models, mainly KNN. I know i can use the pretrained One-Hot-Encoder and the TF-IDF encoder in order to encode the new elements in order to match the final feature Vector. Since these feature vectors become very large i use an PCA in …
I have a reasonable-sized set of size N (say 10 000 objects) in which I am searching for groups of compatible elements. Meaning that I have a function y = f(x_1, x_2, x_3, ..., x_n) returning bool 0/1 answer whether n elements are compatible. We are interested in executing this search on each subset smaller than 8 elements of N. Which is obviously either NP hard or close to this. Even for pairwise search for n elements set we have …
Edit for re-opening the question, I'll try to answer questions made by @user2974951: I have a large user preference statistics for trichotomic data sets. You can visualize each data trio as a 3D vector with X, Y and Z values. All vectors complies to X + Y + Z = 1 because of the trichotomous shape of the data I'm using. It can also be visualized as a points in an equilateral triangle. I have many tests, each with a …
I am working with the Emotion recognition model with the IEMOCAP dataset. For the feature extraction, I am taking mel-spectrogram and then convert it into a NumPy array and converting the array into a data frame of spectrogram features. The generated dataframe has a shape of 2380 rows X 11761 columns like 0 1 2 3 4 5 6 7 ... 11754 11755 11756 11757 11758 11759 11760 11761 262 0.036491 0.037793 0.041035 0.044644 0.047210 0.048467 0.049556 0.052137 ... 0.0 …
Suppose I have to predict if a certain product from an assembly line in a factory will be a scrap. This product has let's say 'static' data like a certain shape. A certain vendor, etc. And, it can have 'dynamic' data this meaning it can have for example: one or more sets of measurements (pressures,temperatures ,etc) from production processes. How to treat this 'dynamic' features ? Somehow it doesn't seem right to repeat the 'static' data for all 'dynamic' events. …
Is there a way to compress information of a variable input size? Autoencoder requires standardized input sizes. Although I can add masks on the cost function and add dummy features to standardize input/output size, I am hesitant with the potential drawbacks. The input structures I am interested in are graphs and images. If input sizes and shapes vary too much, padding, resizing and rescaling do not work.
Currently, I have dataset with numeric as well non-numeric attributes. I am trying to remove the redundant features in the dataset using R Programming Languages. Note: Non-numeric attributes cannot be turned into binary. The Caret R package provides the findCorrelation which will analyze a correlation matrix of your data’s attributes report on attributes that can be removed. However, It only works numeric values of 'x'. I have been unable to find a package which does it for non-numeric attributes. Is …
In my paper, I am saying that the accuracy of classification is $x\%$ when using the top N features. My supervisor thinks that we should capture the classification accuracy when using N randomly selected features to show that the initial feature selection technique makes an actual difference. Does this make sense? I've argued that no one cares about randomly selected features so this addition doesn't make sense. It's quite obvious that randomly selecting features will provide a worse classification accuracy …
I'm curious to know if feature selection and/or feature reduction techniques exist, which are linear on number of data $n$ and on number of dimensions $d$. References and source code are very welcome.