Learn (common) grammar / pattern from set of sample strings?
So I currently have a text pattern detection challenge to solve at work. I am trying to make an outlier detection algorithm for a database, for string columns.
For example let's say I have the following list of strings:
[abc123, jkj577, lkj123, uio324, 123123]
I want to develop an algorithm that would detect common patterns in the list of strings, and the indicate which strings are not in this format. For example, in the example above, I would like this algorithm to detect the following regular expression:
r[a-z]{3}\d{3}
given that the majority of the entries in the list obey this pattern, except the last one, which should be marked as an outlier.
The first idea that come to my mind was to use a genetic algorithm to find the regular expression pattern, where the fitness function is the number of entries on the list that match the pattern. I haven't worked out the details (crossvers function, etc..), and there is already the difficulty in the sense that the pattern .* will match everything, hence will always maximize the fitness function.
Anybody already worked on a similar problem? What are my options here? Thank you!
Topic grammar-inference pattern-recognition text-mining
Category Data Science