converting array to a true/false matrices

I have a data set where each record is a json document with a label, and an array of signals. The signals will vary for each record:

{
    label:bad,
    id: 0009,
    signals:[high_debt_ratio, no_job] 
},

{
    label:good,
     id: 0002,
    signals:[high_debt_ratio, great_credit, no_id_match] 
},

{
    label:good,
    id: 0003,
    signals:[low_debt_ratio, great_credit] 
},

{
    label:bad,
    id: 0001,
    signals:[high_risk_loc, high_debt_ratio, no_job, no_id_match] 
}

I want to convert this to a matrices that looks like this:

id label high_risk_loc high_debt_ratio no_job great_credit no_id_match low_debt_ratio
0009 bad false true true false false false
0002 good false true false true true false
0003 good false false false true false true
0001 bad true true true false true false

I created a function but it seems like this would be a common thing to do. Is there a python lib (pandas, scikit, etc.) that does this for you? I'd rather use something from a package but i'm not sure what to search for.

Topic matrix scikit-learn pandas

Category Data Science


Something for intuitive using sklearn's MultiLabelBinarizer, though slightly more verbose:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
mlb.fit(df['signals'])
new_col_names = mlb.classes_

# New DataFrame containing 0/1 values of the signals
signals_df = pd.DataFrame(mlb.transform(df['signals']), columns=new_col_names)

# Concatenate with original DataFrame
pd.concat( [df, signals_df], axis=1 )

You can try this one line solution:

d = pd.json_normalize([{
    "label":"bad",
    "id": "0009",
    "signals":["high_debt_ratio", "no_job"] 
},

{
    "label":"good",
     "id": "0002",
    "signals":["high_debt_ratio", "great_credit", "no_id_match"] 
},

{
    "label":"good",
    "id": "0003",
    "signals":["low_debt_ratio", "great_credit"] 
},

{
    "label":"bad",
    "id": "0001",
    "signals":["high_risk_loc", "high_debt_ratio", "no_job", "no_id_match"] 
}])

d.merge(d.signals.apply(lambda x: "|".join(x)).str.get_dummies(sep = "|"), left_index = True, right_index= True)

Outputs:

enter image description here


First you need to read your json data with json_normalize in data frame using pandas

import pandas as pd

df = pd.json_normalize(['your json data'])

Your data frame look like this

  label    id                                            signals
0   bad  0009                          [high_debt_ratio, no_job]
1  good  0002       [high_debt_ratio, great_credit, no_id_match]
2  good  0003                     [low_debt_ratio, great_credit]
3   bad  0001  [high_risk_loc, high_debt_ratio, no_job, no_id...

Now we need unique list of value for signals column and loop over with its availability and base on that need to insert True or False in value of particular column... also removed signals column after getting final datas

for i in list(set(df.signals.sum())):
    df[i] = df.signals.apply(lambda x: i in x)
df.drop('signals',axis=1,inplace=True)
print(df)
-:output:-
  label    id  low_debt_ratio  great_credit  high_debt_ratio  no_id_match  high_risk_loc  no_job
0   bad  0009           False         False             True        False          False    True
1  good  0002           False          True             True         True          False   False
2  good  0003            True          True            False        False          False   False
3   bad  0001           False         False             True         True           True    True

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.