Problem statement:
Assume that your DataFrame in PySpark has a column with text. Assume that you want to apply NLP and vectorize this text, creating a new column.
This is how to do it using @pandas_udf.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
from pyspark.sql.functions import pandas_udf, PandasUDFType | |
import spacy | |
#... | |
# nlp = spacy.load('en_core_web_lg') | |
nlp = spacy.load('en_core_web_sm') | |
#... | |
# Use pandas_udf to define a Pandas UDF | |
@pandas_udf('array<double>', PandasUDFType.SCALAR) | |
# The input is a pandas.Series with strings. The output is a pandas.Series of arrays of double. | |
def pandas_nlp(s): | |
return s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_').transform(lambda x: (nlp(x).vector.tolist())) |
spaCy is the NLP library used ( see https://spacy.io/api/doc ). nlp(astring) is the call that vectorizes the text. The s.fillna("_NO_₦Ӑ_").replace('', '_NO_ӖӍΡṬΫ_') expressions fill in missing data.
Now you can create a new column in the dataframe calling the function.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dataframe = dataframe.withColumn('description_vec', pandas_nlp('description')) |
For more information on Pandas UDF see
https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html
No comments:
Post a Comment