Should hexadecimal addresses of a dataset be cleaned?

Question

Should hexadecimal addresses of a dataset be cleaned?

Namrouch

2022年4月24日 21:00

I am working on fraud detection on blockchains. To be more specific, I fetched a big number of transactions that took place on the blockchain, labeled them to spam / non spam using an appropriate API and now I will train a model to detect fraud using SVM, etc ...

My question is about the preparation of the data. The fields I have are : hash, nonce transaction_index, from_address, to_address,...

The fields from/to_address are hexadecimal fields like 0x5e14d30d2155c0cdd65044d7e0f296373f3e92f65ebd

My question is, how should I format this data ? Should I delete this field ? ( I do not think so since it is very relevant to the problem at hand ). I can't find the appropriate encoding, neither.

Topic dataframe classification python

Category Data Science

Brian Spiering · Accepted Answer · 2022年4月24日 21:00

It is fine to leave the "from/to_address" in the model. It would be useful to choose an algorithm that learns to weight the feature appropriately.

The current hexadecimal format would be encoded as a string in most machine learning algorithms. It might be useful to use feature hashing to encoding it into numerical values that are amenable to most machine learning algorithms.

Should hexadecimal addresses of a dataset be cleaned?

About