FB_init
Tuesday, November 13, 2018
Pandas UDF for PySpark, handling missing data
Problem statement:
You have a DataFrame and one column has string values, but some values are the empty string. You need to apply the OneHotEncoder, but it doesn't take the empty string.
Solution:
Use a Pandas UDF to translate the empty strings into another constant string.
First, consider the function to apply the OneHotEncoder:
Now the interesting part. This is the Pandas UDF function:
And now you can create a new column and apply the OneHotEncoder:
For more information, see https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html#pandas.Series.str.replace .
This is the exception you get if you don't replace the empty string:
File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Cannot have an empty string for name.'
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment