FB_init

Thursday, November 15, 2018

PySpark, NLP and Pandas UDF


Problem statement:
  Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.

This is how to do it using @pandas_udf.


import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
import spacy
#...
# nlp = spacy.load('en_core_web_lg')
nlp = spacy.load('en_core_web_sm')
#...
# Use pandas_udf to define a Pandas UDF
@pandas_udf('array<double>', PandasUDFType.SCALAR)
# The input is a pandas.Series with strings. The output is a pandas.Series of arrays of double.
def pandas_nlp(s):
return s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_').transform(lambda x: (nlp(x).vector.tolist()))

spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.

Now you can create a new column in the dataframe calling the function.

dataframe = dataframe.withColumn('description_vec', pandas_nlp('description'))

For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

No comments: