Classifying transactions as malicious

I have a big data set of fake transactions for a company. Each row contains the username, credit card number, time, device used, and amount of money in the transaction. I need to classify each transaction as either malicious or not malicious and I am lost for ideas on where to start. Doing it by hand would be silly.

I was thinking possibly checking for how often a credit card is used, if it is consistently used at a certain time, or if it is used from lots of different devices (iOS AND Android, as an example) would be possible starting places. I'm still fairly new to all this and ML. Would there be some ML algorithm optimal for this problem?

Also, side question: what would be a good place to host the 600 or so GB of data for cheaps?

Thanks

Topic classification bigdata

Category Data Science


This problem is popularly called the "Credit Card Fraud Detection"

There are several classification algorithms, which aim to tackle this problem.

With the knowledge of the dataset you possess, the Decision Trees algorithm can be employed for detecting malicious transactions from the non-malicious ones. This paper is a nice resource to learn and develop the intuition about fraud detection and the usage of basic classification algorithms like the Decision Trees and the SVMs for solving the problem.

There are several other papers which solve this problems employing algorithms like Neural Networks, Logistic Regression, Genetic Algorithms, etc. However, the paper which uses the decision trees algorithm is a nice place to start learning.

what would be a good place to host the 600 or so GB of data for cheaps?

Aws S3 would be a nice, cheap way to do that. It also integrates nicely with Redshift, in case you want to do complex analytics on the data.


Xgboost algorithm has a special parameter named scale-pos weight to deal with imbalanced classification problems. It basically controls the balance of positive and negative weights. You can refer to this link for further details. http://xgboost.readthedocs.io/en/latest/parameter.html


A rule based classifier is generally suited more for this problem where most of your features are going to contain discrete values.

So, Decision trees, Boosting, Random forests should do the job for you.

One thing you should always keep in mind is how you are going to evaluate your model. For fraud detection, make sure that False negative is eliminated completely. A false positive is fine, but the opposite is dangerous.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.