How to reshape or clean data to be able to visualize it with violin plots?

My end goal is to visualize some data using a violin plot or something similar using Python.

I have the following data in a file (test.csv). The first column is a list of species. The other columns determine abundance of the species at a certain latitude (e.g. how abundant is species A at altitude 1000, 2000?). (Ignoring units for now.) How can I plot this as a violin plot (or something similar)?

test.csv

species,1000,2000,3000,4000,5000,6000,7000
species_A,0.5,0.5,,,2,1,2
species_B,0.5,1,0.5,0.5,1,1,10
species_C,1,1,10,3,15,4,5
species_D,15,3,2,1,0.5,1,3

The Python code I tried so far is below. This does not work because it only plots the distribution of altitudes, which is the same for all species (because they were all sampled from the same set of altitudes).

file = test.csv
df = pd.read_csv(file)

# convert columns to list
colnames = list(df.columns)
colnames.remove(species)

# Transform the data so that I have a dataframe with only three columns: species, Altitude, and Count
df = pd.melt(df, id_vars=['species'], value_vars=colnames, value_name=Count, var_name=Altitude)
df.species = df.species.astype('category')
df.Altitude = df.Altitude.astype('int')

# Plot the data
sns.violinplot(x=species, y=Altitude, data=df)
plt.title(Abundance of Species at Various Altitudes)
plt.grid(alpha=0.5, ls=--)
plt.xticks(rotation=90)

# show graph
plt.show()
```

Topic transformation visualization python data-cleaning

Category Data Science


You can make the "ungrouped" dataframe by reindexing on a repeated index:

df_2d = df.loc[df.index.repeat(
    df["Count"].fillna(0).astype(int)
)]

There should be a more direct way to generate a plot, but I don't know it. That your latitudes are discretized might not help.


I ended up creating a new Pandas DataFrame using the code below. I wash hoping for something simpler or more elegant.

# Create a new dataframe
df_2d = pd.DataFrame()
for _, sp in df.iterrows():
    count = 0 if np.isnan(sp['Count']) else int(np.ceil(sp['Count']))
    df_2d = df_2d.append([{"species": sp["species"], "Altitude": sp["Altitude"]}] * count)

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.