Gustavo Frederico: PySpark, NLP and Pandas UDF

Thursday, November 15, 2018

PySpark, NLP and Pandas UDF

Problem statement:
Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.

This is how to do it using @pandas_udf.

spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.

Now you can create a new column in the dataframe calling the function.

For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Gustavo Frederico

FB_init

Thursday, November 15, 2018

PySpark, NLP and Pandas UDF

No comments: