Fraud risk propagation in large scale network

What's the best approach to do some graph analytics and risk propagation in a network using python where multiple accounts are connected through a relationship and few of the accounts in the network are marked as bad accounts and the rest are unknown?

I tried using networkx but it seems to run forever. I have about 8MM edges and 40K nodes

Topic networkx graphs python

Category Data Science


As Victor proposed, you probably need the graph convolution networks. 40K nodes is borderline too much for the memory, so you could consider GraphSAGE-alike approaches, which propose to sample subgraphs around target points and then run some sort of GCN or GAT (graph attention networks) for them. You could use library like DGL or pytorch geometric for that.

Other notable approach is Deep Walk, it generates some embedding by neighborhood. As a plus, it preserves the locality in the embedding. The minus, in my experience, it's not scales so well, but you can give it a try.


You could try applying a graph convolutional network to do some semi-supervised learning. See Kipf and Welling's paper "Semi-Supervised Classification with Graph Convolutional Networks". It probably depends on how unbalanced your dataset is though. If the dataset is too large, you could train a sample of it, and train the GCN on that subset. I'd try to find some exemplar data points and create a train set from that.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.