pyspark.sql.functions.collect_set

pyspark.sql.functions.collect_set(col)

Aggregate function: returns a set of objects with duplicate elements eliminated.

Note

The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.

>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(collect_set('age')).collect()
[Row(collect_set(age)=[5, 2])]

New in version 1.6.