Finding similar articles in realtime

I want to build a large document (news article) searchable database, such as when adding a new article I will be able to quickly find X most similar articles from it. What is the right tech/algorithm/Python framework to approach this?

Topic similar-documents search-engine python similarity algorithms

Category Data Science


Elasticsearch is the right tool to use if you don't want to code this yourself. Indeed, you need an indexing algorithm that is able to efficiently retrieve pieces of texts in a big database, and SQL isn't particularly good at it. Moreover, Elasticsearch is quite user friendly, so it won't be an overkill to actually install it and use it. You might discover in the process that finding most similar articles isn't that easy and that Elasticsearch is of a great.

Here is the documentation for a Python client:

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.