Mimic a Mahout like system

I have a data set, in excel format, with account names, reported symptoms, a determined root cause and a date in month year format for each row. I am trying to implement a mahout like system with a purpose of determining the likelihood symptoms an account can report by doing a user based similarity kind of a thing. Technically, I am just hoping to tweak the recommendation system into a deterministic system to spot out the probable symptoms an account can report on. Instead of ratings, I can get the frequency of symptoms by accounts. Is it possible to use a programming language or any other software to build such system?

Here is an example:

Account : X Symptoms : AB, AD, AB, AB

Account : Y Symptoms : AE, AE, AB, AB, EA

For the sake of this example, let's assume that all the dates are this month.

O/P: Account : X Symptom: AE

Here both of them have reported AB 2 or more times. I could fix such number as a threshold to look for probable symptoms.

Topic apache-mahout similarity recommender-system

Category Data Science


This seems to me as the plain old recommendation problem. The Accounts are the USERS and the Symptoms are the ITEMS. Each time an Account shows a particular Symptom your system will increment a count value.

Creating the following dataset:

ACCOUNT, SYMPTOM, COUNT

Now you can use/implement any sort of recommender system (Mahout is only an option, have you seen MyMediaLite) or you can even implement yours.

Let's reuse your ideas: * You'd like to use a user-based similarity * If an Account has shown 2 or more times the same symptom it seems to be important

So you can filter out the Account, Symptom pairs with less than 2 counts, and with the rest you create the following datasets:

  • User, Item dataset:

ACCOUNT, SYMPTOM

  • Table with a unique column containing all Users:

ACCOUNT

  • Table with a unique column containing all Items:

SYMPTOM

Now you can use directly the User-KNN algorithm from MyMediaLite.

With the recommender model already trained you can pass any ACCOUNT as input and it will give you a ranked list of the most probable SYMPTOMS that might appear.

Obs.: Initially ignore the time, then you could use it to partition your data in past/future and evaluate the recommendation in a more realistic way. ;-)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.