AB testing split algorithm

I want to understand what is the most effective algorithm for splitting. I have ids of users and I want to split them into 2 groups. Now I have 2 variants: Modulo approach - let's say we will place all even ids into one group, odd numbers into another. Pros - for any sequence we will have a uniform distribution of users. So for any day or hour, users that registered during that time will be equally divided between 2 …
Category: Data Science

Hashing trick for dimensionality reduction

I am building a model that uses TF-IDF NLP features in Spark Mllib. The TF-IDF HashingTF function in Mllib uses the 'hashing trick' to efficiently allocate terms to features. My question is: does the hashing trick work as an effective form of dimensionality reduction? Since I can choose the number of features generated by the IDF function, can I choose a relatively small number of features (say, only 512 or 1024 features) and be confident that the allocation will retain …
Category: Data Science

Using scikit-learn FeatureHasher

I have a huge data set with one of the columns named 'mail_id'. The mail_id is given in a very creepy format as shown below: mail_id DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs= K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q= UGATDXARg7jMEInKH7oXgty2nwxnwD2l0OW/8Nsa0MI= qE9zgWiITYA97RUiN4X/t9IVWLViLz+lKijaYegyBiQ= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= 4+EEK8RbNYwuFCHznY9XSRCV4Yek60bHVgnP3jtjjzk= After doing a lot of analysis on my data, I have found that I cannot drop this feature set from my model so I have to convert it to something meaningful. Can anyone please explain me how to do this efficiently?
Category: Data Science

Appearance-based hashing for similarity detection for picking the 100 most distinct images out 500 images

I would like to perform appearance-based hashing for similarity detection. I have 500 photos for each of my categories but I only want to maintain the 100 of them that are most distinct. How should I go about this? Are there already well-known baselines for this? I have a preference for Python and PyTorch or bash. Also, what are the other known methods for this task? I was thinking I could also run the resnet50 on images, extract a 2048 …
Category: Data Science

RandomizedSearchCV() not scoring all fits

I'm experiencing an issue with a RandomizedSearchCV grid that is not able to evaluate all of the fits. 50 of the 100 fits I'm calling do not get scored (score=nan), so I'm worried I'm wasting a bunch of time trying to run the gridsearch. I'm wondering how to troubleshoot this and haven't found anything in the past few days and I'm hopeful that the community can help me squash this bug. Now, the details: I have constructed a XGBClassifier model …
Category: Data Science

How to use hashing trick with field-aware factorization machines

Field-aware factorization machines (FFM) have proved to be useful in click-through rate prediction tasks. One of their strengths comes from the hashing trick (feature hashing). When one uses hashing trick from sci-kit-learn, one ends up with a sparse matrix. How can then one work with such a sparse matrix to still implement field-aware factorization machines? SKLearn does not have an implementation of FFM. EDIT 1: I want to perform feature-hashing/hashing-trick for sure in order to be able to scale FFM …
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.