Plagiarism detection with Python
Background
Using Python, I need to score the existence of a quote, containing around 2-7 words, a longer text. The quote doesn't have to match the text precisely, but similar words should have the same order.
For example, given the following long text:
The most beautiful things in the world cannot be seen or touched, they are felt with the heart
The following quotes should be scored high (say, above 80 / 100):
The beautiful thing in our world
World cannot see
They feel with the heart
Since they are not precise, but they preserve the order.
While, on the other hand, these quotes should be scored lower (say, below 50 / 100):
The beautiful heart cannot be felt or seen
They are the most seen in the world
These words don't even appear on this text
Because (the first 2) appear entirely in the text, but do not preserve the order.
The problem
This task cannot be accomplished by simply checking the existence of each word in the text. I don't know which algorithm fits best for this task.
What I have tried
Most of the functions in fuzzywuzzy
(partial_token_sort_ratio
, token_sort_ratio
and etc) scored the later terms higher.
partial_ratio
did score the earlier terms higher, but the quote
These words don't even appear on this text
Got 52 / 100 which is unreasonably high.
My question
How can I use python to score the existence of short quotes in longer texts as mentioned above?
Topic fuzzy-logic nlp python
Category Data Science