DataFrame.
summary
Computes specified statistics for numeric and string columns. Available statistics are: - count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
Note
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting DataFrame.
DataFrame
>>> df.summary().show() +-------+------------------+-----+ |summary| age| name| +-------+------------------+-----+ | count| 2| 2| | mean| 3.5| null| | stddev|2.1213203435596424| null| | min| 2|Alice| | 25%| 2| null| | 50%| 2| null| | 75%| 5| null| | max| 5| Bob| +-------+------------------+-----+
>>> df.summary("count", "min", "25%", "75%", "max").show() +-------+---+-----+ |summary|age| name| +-------+---+-----+ | count| 2| 2| | min| 2|Alice| | 25%| 2| null| | 75%| 5| null| | max| 5| Bob| +-------+---+-----+
To do a summary for specific columns first select them:
>>> df.select("age", "name").summary("count").show() +-------+---+----+ |summary|age|name| +-------+---+----+ | count| 2| 2| +-------+---+----+
See also describe for basic statistics.
New in version 2.3.0.