pyspark.sql.DataFrame.sample

DataFrame.sample(withReplacement=None, fraction=None, seed=None)[source]

Returns a sampled subset of this DataFrame.

Parameters
  • withReplacement – Sample with replacement or not (default False).

  • fraction – Fraction of rows to generate, range [0.0, 1.0].

  • seed – Seed for sampling (default a random seed).

Note

This is not guaranteed to provide exactly the fraction specified of the total count of the given DataFrame.

Note

fraction is required and, withReplacement and seed are optional.

>>> df = spark.range(10)
>>> df.sample(0.5, 3).count()
7
>>> df.sample(fraction=0.5, seed=3).count()
7
>>> df.sample(withReplacement=True, fraction=0.5, seed=3).count()
1
>>> df.sample(1.0).count()
10
>>> df.sample(fraction=1.0).count()
10
>>> df.sample(False, fraction=1.0).count()
10

New in version 1.3.