Problem Statement:
Using PySpark, you have two columns of a DataFrame that have vectors of floats and you want to create a new column to contain the concatenation of the other two columns.
This is how you can do it:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pyspark.sql.functions as f | |
import pyspark.sql.types as t | |
# ... | |
def udf_concat_vec(a, b): | |
# a and b of type SparseVector | |
return np.concatenate((a.toArray(), b.toArray())).tolist() | |
my_udf_concat_vec = f.UserDefinedFunction(udf_concat_vec, t.ArrayType(t.FloatType())) | |
df2 = df.withColumn("togetherAB", my_udf_concat_vec('columnA', 'columnB')) |
1 comment:
Hi Gustavo,
Thank you for your blog, it has helped in many of my problems.
When concatenating numerical columns in pyspark I use:
functions.concat() from pyspark.sql.
Wouldn't that perhaps be more efficient than a user defined function? Also there is the: functions.concat_ws() for text.
Thanks in advance,
Ferran
Post a Comment