Relational Data Mining without ILP

I have a huge dataset from a relational database which I need to create a classification model for. Normally for this situation I would use Inductive Logic Programming (ILP), but due to special circumstances I can't do that.

The other way to tackle this would be just to try to aggregate the values when I have a foreign relation. However, I have thousands of important and distinct rows for some nominal attributes (e.g.: A patient with a relation to several distinct drug prescriptions). So, I just can't do that without creating a new attribute for each distinct row of that nominal attribute, and furthermore most of the new columns would have NULL values if I do that.

Is there any non-ILP algorithm that allows me to data mine relational databases without resorting to techniques like pivoting, which would create thousands of new columns?

Topic classification relational-dbms data-mining

Category Data Science


First, some caveats

I'm not sure why you can't use your preferred programming (sub-)paradigm*, Inductive Logic Programming (ILP), or what it is that you're trying to classify. Giving more detail would probably lead to a much better answer; especially as it's a little unusual to approach selection of classification algorithms on the basis of the programming paradigm with which they're associated. If your real world example is confidential, then simply make up a fictional-but-analogous example.

Big Data Classification without ILP

Having said that, after ruling out ILP we have 4 other logic programming paradigms in our consideration set:

  1. Abductive
  2. Answer Set
  3. Constraint
  4. Functional

in addition to the dozens of paradigms and sub-paradigms outside of logic programming.

Within Functional Logic Programming for instance, there exists extensions of ILP called Inductive Functional Logic Programming, which is based on inversion narrowing (i.e. inversion of the narrowing mechanism). This approach overcomes several limitations of ILP and (according to some scholars, at least) is as suitable for application in terms of representation and has the benefit of allowing problems to be expressed in a more natural way.

Without knowing more about the specifics of your database and the barriers you face to using ILP, I can't know if this solves your problem or suffers from the same problems. As such, I'll throw out a completely different approach as well.

ILP is contrasted with "classical" or "propositional" approaches to data mining. Those approaches include the meat and bones of Machine Learning like decision trees, neural networks, regression, bagging and other statistical methods. Rather than give up on these approaches due to the size of your data, you can join the ranks of many Data Scientists, Big Data engineers and statisticians who utilize High Performance Computing (HPC) to employ these methods on with massive data sets (there are also sampling and other statistical techniques you may choose to utilize to reduce the computational resources and time required to analyze the Big Data in your relational database).

HPC includes things like utilizing multiple CPU cores, scaling up your analysis with elastic use of servers with high memory and large numbers of fast CPU cores, using high-performance data warehouse appliances, employing clusters or other forms of parallel computing, etc. I'm not sure what language or statistical suite you're analyzing your data with, but as an example this CRAN Task View lists many HPC resources for the R language which would allow you to scale up a propositional algorithm.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.