IterativeImputer Evaluation

I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …
Category: Data Science

Correlation Wikipedia translated pages vs number of in links is weird (scatterplot)?

I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page). For instance I have Work, links, wikipediaTranslatedPages The name of the rose, 500, 53 I used a scatterplot but it's weird. Is it wrong?
Category: Data Science

What correlation measure for Wikipedia translated pages vs number of in links?

I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page). Is it possible to correlate them? For instance I have Work, links, wikipediaTranslatedPages The name of the rose, 500, 53
Category: Data Science

Minimum number of features for Naïve Bayes model

I keep on reading that Naive Bayes needs fewer features than many other ML algorithms. But what's the minimum number of features you actually need to get good results (90% accuracy) with a Naive Bayes model? I know there is no objective answer to this -- it depends on your exact features and what in particular you are trying to learn -- but I'm looking for a numerical ballpark answer to this. I'm asking because I have a dataset with …
Category: Data Science

How can I use Wikipedia2vec model for embedding my article named entities as 40% entities are not in a wikipedia?

I have news articles in my dataset containing named entities. I want to use the Wikipedia2vec model to encode the article's named entities. But some of the entities (around 40%) from our dataset articles are not present in Wikipedia. Please suggest to me, how can I use the Wikipedia2vec model for embedding my article named entities efficiently with the help of the article?
Category: Data Science

Can a dataset built upon another have more restrictive license?

I found a dataset built on top of Wikipedia dump, which comes in Huggingface Dataset library. The Wikipedia dump is licensed under CC BY-SA and the Huggingface Dataset is licensed under Apache-2.0, but there is no license specified for the dataset I want to use. My question is, can the dataset be licensed under more restrictive license? Or can I assume, that it has the same license as Wikipedia dump?
Category: Data Science

Search for similar wikipedia articles based on a set of keywords

I want to solve two questions: Which wikipedia articles could be interesting to me based on a list of keywords that are generated by the search terms I normally use in google(received by google takeout)? Which wikipedia articles could be interesting to me based on what is not on a list of keywords that are generated by the search terms I normally use in google(received by google takeout)? I am looking for a how to do context search on wikipedia …
Topic: wikipedia api nlp
Category: Data Science

doc2vec - paragraph or article as document

I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data. Should I split every Wikipedia article by each natural paragraph into several documents or use one article as a document to train my model? EDIT: Is there an estimate on how many words per document for doc2vec?
Category: Data Science

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.