corpus development for plagiarism detection

There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories?

somewhere i read instead of having local database of the entire internet, one can index and use it for faster search.

I know Elastic Search seems to be usable. Anyone has tried before?

Topic crawling python

Category Data Science


I want to have a local database of corpus of the whole internet

Are you Google? If not storage might be an issue ;)

The PAN series have run various tasks related to plagiarism detection in the past: https://pan.webis.de/tasks.html#task-originality. I think they provide annotated datasets and they used to provide a live search engine.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.