I want to understand what is the most effective algorithm for splitting. I have ids of users and I want to split them into 2 groups. Now I have 2 variants: Modulo approach - let's say we will place all even ids into one group, odd numbers into another. Pros - for any sequence we will have a uniform distribution of users. So for any day or hour, users that registered during that time will be equally divided between 2 …
I am building a model that uses TF-IDF NLP features in Spark Mllib. The TF-IDF HashingTF function in Mllib uses the 'hashing trick' to efficiently allocate terms to features. My question is: does the hashing trick work as an effective form of dimensionality reduction? Since I can choose the number of features generated by the IDF function, can I choose a relatively small number of features (say, only 512 or 1024 features) and be confident that the allocation will retain …
How should I choose n_features for FeatureHasher in scikit-learn? Assume that I have 1000 categories in feature "case" and I would like to hash them.
I have a huge data set with one of the columns named 'mail_id'. The mail_id is given in a very creepy format as shown below: mail_id DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs= K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q= UGATDXARg7jMEInKH7oXgty2nwxnwD2l0OW/8Nsa0MI= qE9zgWiITYA97RUiN4X/t9IVWLViLz+lKijaYegyBiQ= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= 4+EEK8RbNYwuFCHznY9XSRCV4Yek60bHVgnP3jtjjzk= After doing a lot of analysis on my data, I have found that I cannot drop this feature set from my model so I have to convert it to something meaningful. Can anyone please explain me how to do this efficiently?
I would like to perform appearance-based hashing for similarity detection. I have 500 photos for each of my categories but I only want to maintain the 100 of them that are most distinct. How should I go about this? Are there already well-known baselines for this? I have a preference for Python and PyTorch or bash. Also, what are the other known methods for this task? I was thinking I could also run the resnet50 on images, extract a 2048 …
I'm experiencing an issue with a RandomizedSearchCV grid that is not able to evaluate all of the fits. 50 of the 100 fits I'm calling do not get scored (score=nan), so I'm worried I'm wasting a bunch of time trying to run the gridsearch. I'm wondering how to troubleshoot this and haven't found anything in the past few days and I'm hopeful that the community can help me squash this bug. Now, the details: I have constructed a XGBClassifier model …
Field-aware factorization machines (FFM) have proved to be useful in click-through rate prediction tasks. One of their strengths comes from the hashing trick (feature hashing). When one uses hashing trick from sci-kit-learn, one ends up with a sparse matrix. How can then one work with such a sparse matrix to still implement field-aware factorization machines? SKLearn does not have an implementation of FFM. EDIT 1: I want to perform feature-hashing/hashing-trick for sure in order to be able to scale FFM …