FB_init
Sunday, November 18, 2018
PySpark: references to variable number of columns in UDF
Problem statement:
Suppose that you want to create a column in a DataFrame based on many existing columns, but you don't know how many columns, possibly because that will be given by the user or another software.
This is how you can do it:
Saturday, November 17, 2018
PySpark: Concatenate two DataFrame columns using UDF
Problem Statement:
Using PySpark, you have two columns of a DataFrame that have vectors of floats and you want to create a new column to contain the concatenation of the other two columns.
This is how you can do it:
Thursday, November 15, 2018
PySpark, NLP and Pandas UDF
Problem statement:
Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.
This is how to do it using @pandas_udf.
spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.
Now you can create a new column in the dataframe calling the function.
For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Tuesday, November 13, 2018
Pandas UDF for PySpark, handling missing data
Problem statement:
You have a DataFrame and one column has string values, but some values are the empty string. You need to apply the OneHotEncoder, but it doesn't take the empty string.
Solution:
Use a Pandas UDF to translate the empty strings into another constant string.
First, consider the function to apply the OneHotEncoder:
Now the interesting part. This is the Pandas UDF function:
And now you can create a new column and apply the OneHotEncoder:
For more information, see https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html#pandas.Series.str.replace .
This is the exception you get if you don't replace the empty string:
File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Cannot have an empty string for name.'
Subscribe to:
Posts (Atom)