This is a simple solution to my question. It only deals with two models and two variables, but you could easily have lists with the names of the classifiers and the metrics you want to analyze. For my purposes, I just change the values of COI
, ROI_1
, and ROI_2
respectively.
NOTE: This solution is also generalizable.
How? Just change the values of COI
, ROI_1
, and ROI_2
and load any chosen dataset in df = pandas.read_csv("FILENAME.csv, ...)
. If you want another visualization, just change the pyplot
settings near the end.
The key was assigning a new DataFrame
to the original DataFrame
and implementing the .loc["SOMESTRING"]
method. It removes all the rows in the data, EXCEPT for the one specified as a parameter.
Remember, however, to include index_col=0
when you read the file OR use some other method to set the index of the DataFrame
. Without doing this, your row
values will just be indexes, from 0 to MAX_INDEX
.
# Written: April 4, 2019
import pandas # for visualizations
from matplotlib import pyplot # for visualizations
from scipy.stats import ks_2samp # for 2-sample Kolmogorov-Smirnov test
import os # for deleting CSV files
# Functions which isolates DataFrame
def removeColumns(DataFrame, typeArray, stringOfInterest):
for i in range(0, len(typeArray)):
if typeArray[i].find(stringOfInterest) != -1:
continue
else:
DataFrame.drop(typeArray[i], axis = 1, inplace = True)
# Get the whole DataFrame
df = pandas.read_csv("ExperimentResultsCondensed.csv", index_col=0)
dfCopy = df
# Specified metrics and models for comparison
COI = "Area_under_PRC"
ROI_1 = "weka.classifiers.meta.AdaBoostM1[DecisionTable]"
ROI_2 = "weka.classifiers.meta.AdaBoostM1[DecisionStump]"
# Lists of header and row in dataFrame
# `rows` may act strangely
headers = list(df.dtypes.index)
rows = list(df.index)
# remove irrelevant rows
df1 = dfCopy.loc[ROI_1]
df2 = dfCopy.loc[ROI_2]
# remove irrelevant columns
removeColumns(df1, headers, COI)
removeColumns(df2, headers, COI)
# Make CSV files
df1.to_csv(str(ROI_1 + "-" + COI + ".csv"), index=False)
df2.to_csv(str(ROI_2 + "-" + COI) + ".csv", index=False)
results = pandas.DataFrame()
# Read CSV files
# The CSV files can be of any netric/measure, F-measure is used as an example
results[ROI_1] = pandas.read_csv(str(ROI_1 + "-" + COI + ".csv"), header=None).values[:, 0]
results[ROI_2] = pandas.read_csv(str(ROI_2 + "-" + COI + ".csv"), header=None).values[:, 0]
# Kolmogorov-Smirnov test since we have Non-Gaussian, independent, distinctive variance datasets
# Test configurations
value, pvalue = ks_2samp(results[ROI_1], results[ROI_2])
# Corresponding confidence level: 95%
alpha = 0.05
# Output the results
print('\n')
print('\033[1m' + '>>>TEST STATISTIC: ')
print(value)
print(">>>P-VALUE: ")
print(pvalue)
if pvalue > alpha:
print('\t>>Samples are likely drawn from the same distributions (fail to reject H0 - NOT SIGNIFICANT)')
else:
print('\t>>Samples are likely drawn from different distributions (reject H0 - SIGNIFICANT)')
# Plot files
df1.plot.density()
pyplot.xlabel(str(COI + " Values"))
pyplot.ylabel(str("Density"))
pyplot.title(str(COI + " Density Distribution of " + ROI_1))
pyplot.show()
df2.plot.density()
pyplot.xlabel(str(COI + " Values"))
pyplot.ylabel(str("Density"))
pyplot.title(str(COI + " Density Distribution of " + ROI_2))
pyplot.show()
# Delete Files
os.remove(str(ROI_1 + "-" + COI + ".csv"))
os.remove(str(ROI_2 + "-" + COI + ".csv"))