Handling IP addresses as features when creating machine learning model

I'm working on ML model for fraud detection, and two features that I have is sender_IP_address and receiver_IP_address.

I think that this is very important feature that can not be ignored. My question is, how can I handle this kind of feature?

My dataset has around 100k rows and 80 columns.

I know that IP is categorical data, and that I can use OneHotEncoder (for example), but from those 100k rows, I have around 70k unique IP addresses (one IP address can occur from 1 to 800 times). If I encode it, I will have +70k features for training, and I will have to much variance in data. Also I will have big imbalance of categorical data related to IP addresses because I will have 80% of IPs that occur 1 time and 20% of IPs that occur more than 1 time (some even 300 times).

I have read that something like this can be done, but I do not know if its legit, and that is treating IP address as numerical data. For example, for IP address 46.242.124.174 is divided in 4 coulmns/features, and each column has a number, in this case 46|242|124|174. Is this correct way?

Also, is there any analogy for sender_IP_address and receiver_IP_address, for example:

sender_IP_address: 46.242.124.174 receiver_IP_address: 225.242.12.174

Two IP addresses has some same numbers (242 and 174) does that means something or not?

Topic data-science-model machine-learning

Category Data Science


You can use python's ipaddress library.

By using the library, you can convert IP addresses into integer values using:

>>> str(ipaddress.IPv4Address('192.168.0.1'))
'192.168.0.1'
>>> int(ipaddress.IPv4Address('192.168.0.1'))
3232235521

One-hot encoding them is already not going to work at this scale - 70,000 features presents problems to some algorithms, not just in performance, but in accuracy. With info spread across 70,000 features, it can drown out all other features and/or makes it hard to learn anything about individual IPs. And obviously there are billions of potential IP addresses.

Treating octets as numbers is not meaningful. There is no ordinal meaning to them; 46.* is not closer to 47.* than 250.*.

IPs are almost ID-like information, and ID has no meaningful content other than its uniqueness. But not quite. There is some structure in IP addresses. You could use whether it's class A/B/C to mask off the subnet and use that. There would be fewer and it would be a little more meaningful as 'categories'.

However your best bet is probably to gather some side information from the IP, like a geolocation, and try to use that info - what country or region it comes from. The IP itself doesn't mean much.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.