Plagiarism detection with Python

Background

Using Python, I need to score the existence of a quote, containing around 2-7 words, a longer text. The quote doesn't have to match the text precisely, but similar words should have the same order.

For example, given the following long text:

The most beautiful things in the world cannot be seen or touched, they are felt with the heart

The following quotes should be scored high (say, above 80 / 100):

The beautiful thing in our world

World cannot see

They feel with the heart

Since they are not precise, but they preserve the order.

While, on the other hand, these quotes should be scored lower (say, below 50 / 100):

The beautiful heart cannot be felt or seen

They are the most seen in the world

These words don't even appear on this text

Because (the first 2) appear entirely in the text, but do not preserve the order.

The problem

This task cannot be accomplished by simply checking the existence of each word in the text. I don't know which algorithm fits best for this task.

What I have tried

Most of the functions in fuzzywuzzy (partial_token_sort_ratio, token_sort_ratio and etc) scored the later terms higher. partial_ratio did score the earlier terms higher, but the quote

These words don't even appear on this text

Got 52 / 100 which is unreasonably high.

My question

How can I use python to score the existence of short quotes in longer texts as mentioned above?

Topic fuzzy-logic nlp python

Category Data Science


Python's fuzzywuzzy uses Levenshtein Distance which looks at character level differences.

You have to explore other approaches to text similarity. Find algorithms that nonlinearly weight n-grams differences, such as Q-gram.

python-string-similarity repo has implementations of many text similarity algorithms.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.