FB_init
Thursday, November 15, 2018
PySpark, NLP and Pandas UDF
Problem statement:
Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.
This is how to do it using @pandas_udf.
spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.
Now you can create a new column in the dataframe calling the function.
For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment