FB_init

Sunday, November 18, 2018

PySpark: references to variable number of columns in UDF


Problem statement:

  Suppose that you want to create a column in a DataFrame based on many existing columns, but you don't know how many columns, possibly because that will be given by the user or another software.

This is how you can do it:

Saturday, November 17, 2018

PySpark: Concatenate two DataFrame columns using UDF


Problem Statement:
  Using PySpark, you have two columns of a DataFrame that have vectors of floats and you want to create a new column to contain the concatenation of the other two columns.

This is how you can do it:

Thursday, November 15, 2018

PySpark, NLP and Pandas UDF


Problem statement:
  Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.

This is how to do it using @pandas_udf.



spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.

Now you can create a new column in the dataframe calling the function.


For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Tuesday, November 13, 2018

Pandas UDF for PySpark, handling missing data


Problem statement:
  You have a DataFrame and one column has string values, but some values are the empty string. You need to apply the OneHotEncoder, but it doesn't take the empty string.

Solution:
  Use a Pandas UDF to translate the empty strings into another constant string.

First, consider the function to apply the OneHotEncoder:

Now the interesting part. This is the Pandas UDF function:

And now you can create a new column and apply the OneHotEncoder:


For more information, see https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html#pandas.Series.str.replace .

This is the exception you get if you don't replace the empty string:

   File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/Users/vivaomengo/anaconda/lib/python3.6/site-packages/pyspark/sql/utils.py", line 79, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Cannot have an empty string for name.'