pyspark.sql.streaming.DataStreamWriter.start

DataStreamWriter.start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options)[source]

Streams the contents of the DataFrame to a data source.

The data source is specified by the format and a set of options. If format is not specified, the default data source configured by spark.sql.sources.default will be used.

Note

Evolving.

Parameters
  • path – the path in a Hadoop supported file system

  • format – the format used to save

  • outputMode

    specifies how data of a streaming DataFrame/Dataset is written to a

    streaming sink.

    • append: Only the new rows in the streaming DataFrame/Dataset will be written to the sink

    • complete: All the rows in the streaming DataFrame/Dataset will be written to the sink every time these is some updates

    • update: only the rows that were updated in the streaming DataFrame/Dataset will be written to the sink every time there are some updates. If the query doesn’t contain aggregations, it will be equivalent to append mode.

  • partitionBy – names of partitioning columns

  • queryName – unique name for the query

  • options – All other string options. You may want to provide a checkpointLocation for most streams, however it is not required for a memory stream.

>>> sq = sdf.writeStream.format('memory').queryName('this_query').start()
>>> sq.isActive
True
>>> sq.name
'this_query'
>>> sq.stop()
>>> sq.isActive
False
>>> sq = sdf.writeStream.trigger(processingTime='5 seconds').start(
...     queryName='that_query', outputMode="append", format='memory')
>>> sq.name
'that_query'
>>> sq.isActive
True
>>> sq.stop()

New in version 2.0.