Learn (common) grammar / pattern from set of sample strings?

Question

Learn (common) grammar / pattern from set of sample strings?

2020年7月28日 11:39

So I currently have a text pattern detection challenge to solve at work. I am trying to make an outlier detection algorithm for a database, for string columns.

For example let's say I have the following list of strings:

[abc123, jkj577, lkj123, uio324, 123123]

I want to develop an algorithm that would detect common patterns in the list of strings, and the indicate which strings are not in this format. For example, in the example above, I would like this algorithm to detect the following regular expression:

r[a-z]{3}\d{3}

given that the majority of the entries in the list obey this pattern, except the last one, which should be marked as an outlier.

The first idea that come to my mind was to use a genetic algorithm to find the regular expression pattern, where the fitness function is the number of entries on the list that match the pattern. I haven't worked out the details (crossvers function, etc..), and there is already the difficulty in the sense that the pattern .* will match everything, hence will always maximize the fitness function.

Anybody already worked on a similar problem? What are my options here? Thank you!

Topic grammar-inference pattern-recognition text-mining

Category Data Science

Erwan · Accepted Answer · 2020年7月28日 10:34

Here are a few ideas:

If the number of strings is not too high, you could consider taking a formal approach and use a finite automata determinization algorithm (I'm very rusty about this stuff but I clearly remember that there is such a thing). The idea is to start from a big automaton made of the union of all the strings, then use the algorithm to find the deterministic automaton, which can then be converted to a regular expression.

A more data-science-y idea is to use character-based similarity/distance measures between all the pairs of strings. Then it should be possible to identify outliers, maybe through clustering based on the distance. Typical character-based measures: Jaro-Winckler, Levenshtein edit distance.

Finally an original (but possibly bad) idea would be to try to train a (character-based) language model on the strings (assuming there are sufficiently many of them). Given an input string, the language model gives you a probability that this string belongs to the "language", so an outlier could be detected by its low probability.

[addition following OP's comment]

Language modeling is normally used for representing the valid sentences in a language, e.g. English, based on the likelihood of sequences of words in this language. It's trained from a large number of correct sentences so that it can estimate the probability of the $n$-grams of words in this language. This is a common NLP task (example) but in your case you would use characters instead of words and strings instead of sentences, so there would be a small adaptation compared to the examples you'll find.

Nikos M. · Accepted Answer · 2020年7月28日 07:08

The problem you face is part of what is called in literature grammar learning or grammar inference which is part of both Natural Language Processing and Machine Learning and in general is a very difficult problem.

However for certain cases like regular grammars/languages (ie learning regular expressions / DFA learning) there are satisfactory solutions up to limitations.

A survey and references on grammar inference and inference of regular grammars:

Learning DFA from Simple Examples

Efficient learning of DFA is a challenging research problem in grammatical inference. It is known that both exact and approximate (in the PAC sense) identifiability of DFA is hard. Pitt, in his seminal paper posed the following open research problem:“Are DFA PAC-identifiable if examples are drawn from the uniform distribution, or some other known simple distribution?”. We demonstrate that the class of simple DFA (i.e., DFA whose canonical representations have logarithmic Kolmogorov complexity) is efficiently PAC learnable under the Solomonoff Levin universal distribution. We prove that if the examples are sampled at random according to the universal distribution by a teacher that is knowledgeable about the target concept, the entire class of DFA is efficiently PAC learnable under the universal distribution. Thus, we show that DFA are efficiently learnable under the PACS model. Further, we prove that any concept that is learnable under Gold’s model for learning from characteristic samples, Goldman and Mathias’ polynomial teachability model, and the model for learning from example based queries is also learnable under the PACS model

An $O(n^2)$ Algorithm for Constructing Minimal Cover Automata for Finite Languages

Cover automata were introduced in [1] as an ecient representation of finite languages. In [1], an algorithm was given to transforma DFA that accepts a finite language to a minimal deterministic finite cover automaton (DFCA) with the time complexity $O(n^4)$, where $n$ is the number of states of the given DFA. In this paper, we introduce a new efficient transformation algorithm with the time complexity $O(n^2)$, which is a significant improvement from the previous algorithm.

There are even libraries implementing algorithms for grammar-inference and DFA learning:

source: stackoverflow

Learn (common) grammar / pattern from set of sample strings?

About