Gustavo Frederico: Pandas UDF for PySpark, handling missing data

Tuesday, November 13, 2018

Pandas UDF for PySpark, handling missing data

Problem statement:
You have a DataFrame and one column has string values, but some values are the empty string. You need to apply the OneHotEncoder, but it doesn't take the empty string.

Solution:
Use a Pandas UDF to translate the empty strings into another constant string.

First, consider the function to apply the OneHotEncoder:

Now the interesting part. This is the Pandas UDF function:

And now you can create a new column and apply the OneHotEncoder:

For more information, see https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html#pandas.Series.str.replace .

This is the exception you get if you don't replace the empty string:

File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Cannot have an empty string for name.'

Gustavo Frederico

FB_init

Tuesday, November 13, 2018

Pandas UDF for PySpark, handling missing data

No comments: