hashing-trick

AB testing split algorithm

AO1992

2021年10月20日 12:29

I want to understand what is the most effective algorithm for splitting. I have ids of users and I want to split them into 2 groups. Now I have 2 variants: Modulo approach - let's say we will place all even ids into one group, odd numbers into another. Pros - for any sequence we will have a uniform distribution of users. So for any day or hour, users that registered during that time will be equally divided between 2 …

Topic: hashing-trick ab-test

Category: Data Science

Hashing trick for dimensionality reduction

John

2021年7月10日 10:23

I am building a model that uses TF-IDF NLP features in Spark Mllib. The TF-IDF HashingTF function in Mllib uses the 'hashing trick' to efficiently allocate terms to features. My question is: does the hashing trick work as an effective form of dimensionality reduction? Since I can choose the number of features generated by the IDF function, can I choose a relatively small number of features (say, only 512 or 1024 features) and be confident that the allocation will retain …

Topic: hashing-trick tfidf apache-spark nlp dimensionality-reduction

Category: Data Science

How should I choose n_features in FeatureHasher in sklearn?

tohid mon

2021年5月18日 21:07

How should I choose n_features for FeatureHasher in scikit-learn? Assume that I have 1000 categories in feature "case" and I would like to hash them.

Topic: hashing-trick data scikit-learn machine-learning

Category: Data Science

Using scikit-learn FeatureHasher

enterML

2020年12月25日 10:07

I have a huge data set with one of the columns named 'mail_id'. The mail_id is given in a very creepy format as shown below: mail_id DQ/4I+GIOz2ZoIiK0Lg0AkwnI35XotghgUK/MYc101I= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= EHNBRbi6i9KO6cMHsuDPFjZVp2cY3RH+BiOKwPwzLQs= K0y/NW59TJkYc5y0HUwDeAXrewYT0JQlkcozz0s2V5Q= UGATDXARg7jMEInKH7oXgty2nwxnwD2l0OW/8Nsa0MI= qE9zgWiITYA97RUiN4X/t9IVWLViLz+lKijaYegyBiQ= BL3z4RtiyfIDydaRYWX2ZXL6IX10QH1yG5ak1s/8Lls= 4+EEK8RbNYwuFCHznY9XSRCV4Yek60bHVgnP3jtjjzk= After doing a lot of analysis on my data, I have found that I cannot drop this feature set from my model so I have to convert it to something meaningful. Can anyone please explain me how to do this efficiently?

Topic: hashing-trick feature-engineering scikit-learn feature-extraction machine-learning

Category: Data Science

Appearance-based hashing for similarity detection for picking the 100 most distinct images out 500 images

Mona Jalal

2020年11月11日 15:43

I would like to perform appearance-based hashing for similarity detection. I have 500 photos for each of my categories but I only want to maintain the 100 of them that are most distinct. How should I go about this? Are there already well-known baselines for this? I have a preference for Python and PyTorch or bash. Also, what are the other known methods for this task? I was thinking I could also run the resnet50 on images, extract a 2048 …

Topic: semantic-similarity image hashing-trick similarity machine-learning

Category: Data Science

RandomizedSearchCV() not scoring all fits

Nick Bohl

2020年9月18日 19:04

I'm experiencing an issue with a RandomizedSearchCV grid that is not able to evaluate all of the fits. 50 of the 100 fits I'm calling do not get scored (score=nan), so I'm worried I'm wasting a bunch of time trying to run the gridsearch. I'm wondering how to troubleshoot this and haven't found anything in the past few days and I'm hopeful that the community can help me squash this bug. Now, the details: I have constructed a XGBClassifier model …

Topic: hashing-trick gridsearchcv python

Category: Data Science

How to use hashing trick with field-aware factorization machines

learner

2020年6月11日 15:59

Field-aware factorization machines (FFM) have proved to be useful in click-through rate prediction tasks. One of their strengths comes from the hashing trick (feature hashing). When one uses hashing trick from sci-kit-learn, one ends up with a sparse matrix. How can then one work with such a sparse matrix to still implement field-aware factorization machines? SKLearn does not have an implementation of FFM. EDIT 1: I want to perform feature-hashing/hashing-trick for sure in order to be able to scale FFM …

Topic: field-aware-factorization-machines hashing-trick scikit-learn recommender-system

Category: Data Science

AB testing split algorithm

Hashing trick for dimensionality reduction

How should I choose n_features in FeatureHasher in sklearn?

Using scikit-learn FeatureHasher

Appearance-based hashing for similarity detection for picking the 100 most distinct images out 500 images

RandomizedSearchCV() not scoring all fits

How to use hashing trick with field-aware factorization machines

About