PySpark: How do I specify dropna axis in PySpark transformation?

Question

PySpark: How do I specify dropna axis in PySpark transformation?

DataBach

2022年5月5日 16:06

I would like to drop columns that contain all null values using dropna(). With Pandas you can do this with setting the keyword argument axis = 'columns' in dropna(). Here an example in a GitHub post.

How do I do this in PySpark ? dropna() is available as a transformation in PySpark, however axis is not an available keyword.

Note: I do not want to transpose my dataframe for this to work.

How would I drop the furniture column from this dataframe ?

data_2 = { 'furniture': [np.NaN ,np.NaN ,np.NaN], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]} 

df_1 = pd.DataFrame(data_2)
ddf_1 = spark.createDataFrame(df_1)
ddf_1.show()

Topic pyspark python data-cleaning

Category Data Science

pm2020 · Accepted Answer · 2020年7月23日 18:14

I know this is a bit late, but I struggled with this also. This is my attempt at removing null columns from a Spark Dataframe.

from pyspark.sql.functions import when, isnull

colsthatarenull = df.select([(when(isnull(c), c)).alias(c) for c in df.columns]).first().asDict()
namesofnullcols = {key:val for key, val in colsthatarenull.items() if val != None}.values()
df = df.drop(*namesofnullcols)

Nitish Sahay · Accepted Answer · 2020年2月11日 22:43

1

Nitish Sahay answered at 2020年2月11日 22:43

You should be able to use the column name like:

df_1 = df_1.drop('furniture')

PySpark: How do I specify dropna axis in PySpark transformation?

About