FB_init

Saturday, November 17, 2018

PySpark: Concatenate two DataFrame columns using UDF


Problem Statement:
  Using PySpark, you have two columns of a DataFrame that have vectors of floats and you want to create a new column to contain the concatenation of the other two columns.

This is how you can do it:

import pyspark.sql.functions as f
import pyspark.sql.types as t
# ...
def udf_concat_vec(a, b):
# a and b of type SparseVector
return np.concatenate((a.toArray(), b.toArray())).tolist()
my_udf_concat_vec = f.UserDefinedFunction(udf_concat_vec, t.ArrayType(t.FloatType()))
df2 = df.withColumn("togetherAB", my_udf_concat_vec('columnA', 'columnB'))

1 comment:

Anonymous said...

Hi Gustavo,

Thank you for your blog, it has helped in many of my problems.
When concatenating numerical columns in pyspark I use:

functions.concat() from pyspark.sql.

Wouldn't that perhaps be more efficient than a user defined function? Also there is the: functions.concat_ws() for text.

Thanks in advance,

Ferran