corpus development for plagiarism detection
There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the approach to build such database? Are there any opensource or collaborated live repositories?
somewhere i read instead of having local database of the entire internet, one can index and use it for faster search.
I know Elastic Search seems to be usable. Anyone has tried before?
Category Data Science