Ordering a material science dataset (properties names, properties scalars, formulas)

Question

Ordering a material science dataset (properties names, properties scalars, formulas)

James Arten

2022年5月28日 14:04

I'm dealing with a materials science dataset and I'm in the following situation,

I have data organized like this:

Chemical_ Formula     Property_name            Property_Scalar

    He                Electrical conduc.          1
    NO_2              Resistance                  50
    CuO3              Hardness
    ...               ...                        ...
    CuO3              Fluorescence                300
    He                Toxicity                    39
    NO2               Hardness                    80
    ...               ...                         ...

As you can understand it is really messy because the same chemical formula appears more than once through the entire dataset, but referred to a different property that is considered. My question is, how can I easily maybe split the dataset in smaller ones, fitting every formula with its descriptors in ORDER? I really need help on this... thank you. ( I used fiction names and values, just to explain my problem.)

I'm on Jupyter Notebook and I'm using Pandas.

I'm editing my question trying to be more clear:

My goal would be to plot some histograms of (for example) n°materials vs conductivity at different temperatures (100K, 200K, 300K). So I need to have both conductivity and temperature for each material to be clearly comparable. For example, I guess that a more convenient thing to obtain would be:

Chemical formula     Conductivity      Temperature

      He                 5                  10K
      NO_2               7                  59K
      CuO_3              10                 300K
      ...                ...                ...
      He                 14                 100K
      NO_2               5                  70K
      ...                ...                ...

Topic jupyter pandas python

Category Data Science

lytseeker · Accepted Answer · 2020年12月21日 14:20

Given that your Dataframe is:

df2 = pd.DataFrame({
    "Chemical_Formula":["He", "NO_2", "CuO3", "CuO3", "He", "NO2"],
    "Property_name":["Electrical conduc.", "Resistance", "Hardness", "Fluorescence", "Toxicity", "Hardness"],
    "Property_Scalar":[1, 50, 10, 300, 39, 80]
})

	Chemical_Formula	Property_name	Property_Scalar
0	He	Electrical conduc.	1
1	NO_2	Resistance	50
2	CuO3	Hardness	10
3	CuO3	Fluorescence	300
4	He	Toxicity	39
5	NO2	Hardness	80

You can use pivot to "unmelt" this in a wide format

df3 = df2.pivot(index="Chemical_Formula", columns="Property_name")

Chemical_Formula	('Property_Scalar', 'Electrical conduc.')	('Property_Scalar', 'Fluorescence')	('Property_Scalar', 'Hardness')	('Property_Scalar', 'Resistance')	('Property_Scalar', 'Toxicity')
CuO3	nan	300	10	nan	nan
He	1	nan	nan	nan	39
NO2	nan	nan	80	nan	nan
NO_2	nan	nan	nan	50	nan

From then on you can drop columns you don't need and plot them.

Ordering a material science dataset (properties names, properties scalars, formulas)

About