Handling IP addresses as features when creating machine learning model
I'm working on ML model for fraud detection, and two features that I have is sender_IP_address and receiver_IP_address.
I think that this is very important feature that can not be ignored. My question is, how can I handle this kind of feature?
My dataset has around 100k rows and 80 columns.
I know that IP is categorical data, and that I can use OneHotEncoder
(for example), but from those 100k rows, I have around 70k unique IP addresses (one IP address can occur from 1 to 800 times). If I encode it, I will have +70k features for training, and I will have to much variance in data. Also I will have big imbalance of categorical data related to IP addresses because I will have 80% of IPs that occur 1 time and 20% of IPs that occur more than 1 time (some even 300 times).
I have read that something like this can be done, but I do not know if its legit, and that is treating IP address as numerical data. For example, for IP address 46.242.124.174 is divided in 4 coulmns/features, and each column has a number, in this case 46|242|124|174. Is this correct way?
Also, is there any analogy for sender_IP_address and receiver_IP_address, for example:
sender_IP_address: 46.242.124.174 receiver_IP_address: 225.242.12.174
Two IP addresses has some same numbers (242 and 174) does that means something or not?
Topic data-science-model machine-learning
Category Data Science