wikipedia

IterativeImputer Evaluation

candy bird

2022年5月31日 09:53

I am having a hard time evaluating my model of imputation. I used an iterative imputer model to fill in the missing values in all four columns. For the model on the iterative imputer, I am using a Random forest model, here is my code for imputing: imp_mean = IterativeImputer(estimator=RandomForestRegressor(), random_state=0) imp_mean.fit(my_data) my_data_filled= pd.DataFrame(imp_mean.transform(my_data)) my_data_filled.head() My problem is how can I evaluate my model. How can I know if the filled values are right? I used a describe function before …

Topic: wikipedia evaluation scikit-learn pandas python

Category: Data Science

Correlation Wikipedia translated pages vs number of in links is weird (scatterplot)?

Idkwhatnomeis

2022年3月29日 13:07

I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page). For instance I have Work, links, wikipediaTranslatedPages The name of the rose, 500, 53 I used a scatterplot but it's weird. Is it wrong?

Topic: wikipedia data-science-model correlation dataset python

Category: Data Science

What correlation measure for Wikipedia translated pages vs number of in links?

Idkwhatnomeis

2022年3月18日 10:47

I'm trying to find a correlation measure for the number of Wikipedia pages an entity (an article) has been translated to vs number of links that point to that page (both measures that can point to the popularity of a page). Is it possible to correlate them? For instance I have Work, links, wikipediaTranslatedPages The name of the rose, 500, 53

Topic: wikipedia correlation pandas dataset python

Category: Data Science

Minimum number of features for Naïve Bayes model

E. Turok

2022年2月13日 23:01

I keep on reading that Naive Bayes needs fewer features than many other ML algorithms. But what's the minimum number of features you actually need to get good results (90% accuracy) with a Naive Bayes model? I know there is no objective answer to this -- it depends on your exact features and what in particular you are trying to learn -- but I'm looking for a numerical ballpark answer to this. I'm asking because I have a dataset with …

Topic: wikipedia naive-bayes-classifier feature-selection nlp python

Category: Data Science

How can I use Wikipedia2vec model for embedding my article named entities as 40% entities are not in a wikipedia?

sajankar9

2022年1月10日 21:46

I have news articles in my dataset containing named entities. I want to use the Wikipedia2vec model to encode the article's named entities. But some of the entities (around 40%) from our dataset articles are not present in Wikipedia. Please suggest to me, how can I use the Wikipedia2vec model for embedding my article named entities efficiently with the help of the article?

Topic: wikipedia word-embeddings named-entity-recognition deep-learning machine-learning

Category: Data Science

Can a dataset built upon another have more restrictive license?

Agata

2021年7月29日 10:50

I found a dataset built on top of Wikipedia dump, which comes in Huggingface Dataset library. The Wikipedia dump is licensed under CC BY-SA and the Huggingface Dataset is licensed under Apache-2.0, but there is no license specified for the dataset I want to use. My question is, can the dataset be licensed under more restrictive license? Or can I assume, that it has the same license as Wikipedia dump?

Topic: wikipedia dataset

Category: Data Science

Search for similar wikipedia articles based on a set of keywords

Pascal Widmann

2021年4月2日 12:15

I want to solve two questions: Which wikipedia articles could be interesting to me based on a list of keywords that are generated by the search terms I normally use in google(received by google takeout)? Which wikipedia articles could be interesting to me based on what is not on a list of keywords that are generated by the search terms I normally use in google(received by google takeout)? I am looking for a how to do context search on wikipedia …

Topic: wikipedia api nlp

Category: Data Science

doc2vec - paragraph or article as document

jonas

2021年1月9日 16:24

I'm trying to train a doc2vec model on the German wiki corpus. While looking for the best practice I've found different possibilities on how to create the training data. Should I split every Wikipedia article by each natural paragraph into several documents or use one article as a document to train my model? EDIT: Is there an estimate on how many words per document for doc2vec?

Topic: doc2vec wikipedia gensim nlp

Category: Data Science

Wikipedia corpus for NLP - Cleaned sentences

Nathan B

2019年10月21日 08:17

I can see many wikipedia dumps out there. I am looking for a wikipedia-made corpus, in which every line is one sentence, without any wikipedia meta tags.

Topic: wikipedia corpus nlp

Category: Data Science

About