Outlier Elimination in Spark With InterQuartileRange Results in Error
I have the following function that is supposed to calculate the outlier for a given dataset.
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil = acc
case column :: xs =
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s$column ${quantiles.size})
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s$column $lowerRange or $column $upperRange)
inner(xs, filtered)
inner(df.columns.toList, df)
val outlierDF = interQuartileRangeFiltering(incomingDF)
But what happens is that, I have a few features in the incomingDF that are categorical, or in other words binary types with a value of 0 or 1. If I include them, I end up getting an error as below:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(console:75)
at interQuartileRangeFiltering(console:83)
... 54 elided
I have a few questions on how to deal with Outliers for data that is either a 0 or a 1. I can ignore them when doing IQR and this seems to be a reasonable approach, but now my question is, if I ignore them, then how will I join the resulting DataFrame (after running through the recursive function above) back with the OneHotEncoded columns?
For example., if the original dataframe, in this case the incomingDF contains 10000 rows and after outlier detection, it ends up being around 9000 rows, then the excluded columns (the OneHotEncoded columns) still have 10000 and how am I going to merge these two dataframes? Somehow this is confusing to me.
Could someone please help me a way out?
