SparkSession.
createDataFrame
Creates a DataFrame from an RDD, a list or a pandas.DataFrame.
DataFrame
RDD
pandas.DataFrame
When schema is a list of column names, the type of each column will be inferred from data.
schema
data
When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of either Row, namedtuple, or dict.
None
Row
namedtuple
dict
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is not pyspark.sql.types.StructType, it will be wrapped into a pyspark.sql.types.StructType as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.
pyspark.sql.types.DataType
pyspark.sql.types.StructType
If schema inference is needed, samplingRatio is used to determined the ratio of rows used for schema inference. The first row will be used if samplingRatio is None.
samplingRatio
data – an RDD of any kind of SQL data representation (e.g. row, tuple, int, boolean, etc.), list, or pandas.DataFrame.
list
schema – a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None. The data type string format equals to pyspark.sql.types.DataType.simpleString, except that top level struct type can omit the struct<> and atomic types use typeName() as their format, e.g. use byte instead of tinyint for pyspark.sql.types.ByteType. We can also use int as a short name for IntegerType.
pyspark.sql.types.DataType.simpleString
struct<>
typeName()
byte
tinyint
pyspark.sql.types.ByteType
int
IntegerType
samplingRatio – the sample ratio of rows used for inferring
verifySchema – verify data types of every row against schema.
Changed in version 2.1: Added verifySchema.
Note
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
When Arrow optimization is enabled, strings inside Pandas DataFrame in Python 2 are converted into bytes as they are bytes in Python 2 whereas regular strings are left as strings. When using strings in Python 2, use unicode u”“ as Python standard practice.
>>> l = [('Alice', 1)] >>> spark.createDataFrame(l).collect() [Row(_1='Alice', _2=1)] >>> spark.createDataFrame(l, ['name', 'age']).collect() [Row(name='Alice', age=1)]
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).collect() [Row(age=1, name='Alice')]
>>> rdd = sc.parallelize(l) >>> spark.createDataFrame(rdd).collect() [Row(_1='Alice', _2=1)] >>> df = spark.createDataFrame(rdd, ['name', 'age']) >>> df.collect() [Row(name='Alice', age=1)]
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> person = rdd.map(lambda r: Person(*r)) >>> df2 = spark.createDataFrame(person) >>> df2.collect() [Row(name='Alice', age=1)]
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> df3 = spark.createDataFrame(rdd, schema) >>> df3.collect() [Row(name='Alice', age=1)]
>>> spark.createDataFrame(df.toPandas()).collect() [Row(name='Alice', age=1)] >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() [Row(0=1, 1=2)]
>>> spark.createDataFrame(rdd, "a: string, b: int").collect() [Row(a='Alice', b=1)] >>> rdd = rdd.map(lambda row: row[1]) >>> spark.createDataFrame(rdd, "int").collect() [Row(value=1)] >>> spark.createDataFrame(rdd, "boolean").collect() Traceback (most recent call last): ... Py4JJavaError: ...
New in version 2.0.