Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 1 addition & 13 deletions dev/sparktestsupport/modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,19 +389,7 @@ def __hash__(self):
"python/pyspark/sql"
],
python_test_goals=[
"pyspark.sql.types",
Copy link
Contributor Author

@tdas tdas May 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only to speed up local testing. Will remove this.

"pyspark.sql.context",
"pyspark.sql.session",
"pyspark.sql.conf",
"pyspark.sql.catalog",
"pyspark.sql.column",
"pyspark.sql.dataframe",
"pyspark.sql.group",
"pyspark.sql.functions",
"pyspark.sql.readwriter",
"pyspark.sql.streaming",
"pyspark.sql.udf",
"pyspark.sql.window",

"pyspark.sql.tests",
]
)
Expand Down
164 changes: 164 additions & 0 deletions python/pyspark/sql/streaming.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from pyspark.sql.readwriter import OptionUtils, to_str
from pyspark.sql.types import *
from pyspark.sql.utils import StreamingQueryException
from abc import ABCMeta, abstractmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas, Seems not used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


__all__ = ["StreamingQuery", "StreamingQueryManager", "DataStreamReader", "DataStreamWriter"]

Expand Down Expand Up @@ -843,6 +844,169 @@ def trigger(self, processingTime=None, once=None, continuous=None):
self._jwrite = self._jwrite.trigger(jTrigger)
return self

def foreach(self, f):
"""
Sets the output of the streaming query to be processed using the provided writer ``f``.
This is often used to write the output of a streaming query to arbitrary storage systems.
The processing logic can be specified in two ways.

#. A **function** that takes a row as input.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this variant is specific to Python. I thought we should better match how we support with Scala side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is superset of what we support in scala. Python users are more likely to use simple lambdas instead of defining classes. But they may also want to write transactional stuff in python with open and close methods. Hence providing both alternatives seems to be a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(including the response to #21477 (comment)) I kind of agree that it's a-okay idea but I think we usually provide a consistent API support so far unless it's language specific, for example, ContextManager, decorator in Python and etc.

Just for clarification, does Scala side support function only support too?

Also, I know attribute-checking way is kind of more like "Pythonic" way but I am seeing the documentation is already diverted between Scala vs Python. It costs maintaining overhead on the other hand.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we could maybe consider the other ways but wouldn't it better to have the consistent support as the primary, and then see if the other ways are really requested by users? I think we could still incrementally add attribute-checking way or the lambda (or function to be more correct) way later (but we can't in the opposite way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python APIs anyways have slightly divergences from Scala/Java APIs in order to provide better experiences for Python users. For example, StreamingQuery.lastProgress returns an object of type StreamingQueryProgress in Java/Scala but returns a dict in python. Because python users are more used to dealing with dicts, and java/scala users (typed language) are more comfortable with structures). Similarly, in DataFrame.select, you can refer to columns in scala using $"columnName" but in python, you can refer to it as df.columnName. If we strictly adhere to pure consistency, then we cannot make it convenient for users in different languages. And ultimately convenience is what matters for the user experience. So its okay to have a superset of supported types in python compared to java/scala.

Personally, I think we should also add the lambda variant to Scala as well. But that decision for Scala is independent of this PR as there is enough justification for add the lambda variant for

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe $"columnName" is more like a language specific feature in Scala and I think df.columnName is language specific to Python.

And ultimately convenience is what matters for the user experience.

Thing is, it sounded to me like we are kind of prejudging it.. We can't revert it back easily once we go in this way ..

I think we should also add the lambda variant to Scala as well.

+1

I am okay but I hope this shouldn't be usually done next time ..

This is a simple way to express your processing logic. Note that this does
not allow you to deduplicate generated data when failures cause reprocessing of
some input data. That would require you to specify the processing logic in the next
way.

#. An **object** with a ``process`` method and optional ``open`` and ``close`` methods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas, wouldn't we better just have ForeachWriter class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with @marmbrus . If there is a ForeachWriter class in python, then uses will have to additionally import it. That's just another overhead that can be avoided by just allowing any class with the appropriate methods. One less step for python users.

The object can have the following methods.

* ``open(partition_id, epoch_id)``: *Optional* method that initializes the processing
(for example, open a connection, start a transaction, etc). Additionally, you can
use the `partition_id` and `epoch_id` to deduplicate regenerated data
(discussed later).

* ``process(row)``: *Non-optional* method that processes each :class:`Row`.

* ``close(error)``: *Optional* method that finalizes and cleans up (for example,
close connection, commit transaction, etc.) after all rows have been processed.

The object will be used by Spark in the following way.

* A single copy of this object is responsible of all the data generated by a
single task in a query. In other words, one instance is responsible for
processing one partition of the data generated in a distributed manner.

* This object must be serializable because each task will get a fresh
serialized-deserializedcopy of the provided object. Hence, it is strongly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: deserialized copy (space)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

recommended that any initialization for writing data (e.g. opening a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any initialization for writing data (e.g. opening a connection or starting a transaction) be done open after the open(...) method has been called

be done open seems a bit odd. If we can polish the sentence it would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

connection or starting a transaction) be done open after the `open(...)`
method has been called, which signifies that the task is ready to generate data.

* The lifecycle of the methods are as follows.

For each partition with ``partition_id``:

... For each batch/epoch of streaming data with ``epoch_id``:

....... Method ``open(partitionId, epochId)`` is called.

....... If ``open(...)`` returns true, for each row in the partition and
batch/epoch, method ``process(row)`` is called.

....... Method ``close(errorOrNull)`` is called with error (if any) seen while
processing rows.

Important points to note:

* The `partitionId` and `epochId` can be used to deduplicate generated data when
failures cause reprocessing of some input data. This depends on the execution
mode of the query. If the streaming query is being executed in the micro-batch
mode, then every partition represented by a unique tuple (partition_id, epoch_id)
is guaranteed to have the same data. Hence, (partition_id, epoch_id) can be used
to deduplicate and/or transactionally commit data and achieve exactly-once
guarantees. However, if the streaming query is being executed in the continuous
mode, then this guarantee does not hold and therefore should not be used for
deduplication.

* The ``close()`` method (if exists) is will be called if `open()` method exists and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

returns successfully (irrespective of the return value), except if the Python
crashes in the middle.

.. note:: Evolving.

>>> # Print every row using a function
>>> writer = sdf.writeStream.foreach(lambda x: print(x))
>>> # Print every row using a object with process() method
>>> class RowPrinter:
... def open(self, partition_id, epoch_id):
... print("Opened %d, %d" % (partition_id, epoch_id))
... return True
... def process(self, row):
... print(row)
... def close(self, error):
... print("Closed with error: %s" % str(error))
...
>>> writer = sdf.writeStream.foreach(RowPrinter())
"""

from pyspark.rdd import _wrap_function
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
from pyspark.taskcontext import TaskContext

if callable(f):
"""
The provided object is a callable function that is supposed to be called on each row.
Construct a function that takes an iterator and calls the provided function on each row.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should better turn this to comment form..

# The ..
# Construct ...

def func_without_process(_, iterator):
for x in iterator:
f(x)
return iter([])

func = func_without_process

else:
"""
The provided object is not a callable function. Then it is expected to have a
'process(row)' method, and optional 'open(partition_id, epoch_id)' and
'close(error)' methods.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


if not hasattr(f, 'process'):
raise Exception(
"Provided object is neither callable nor does it have a 'process' method")

if not callable(getattr(f, 'process')):
raise Exception("Attribute 'process' in provided object is not callable")

open_exists = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for

if open_exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to extract the code block to the function, since the logic for checking open and close are exactly same.

if hasattr(f, 'open'):
if not callable(getattr(f, 'open')):
raise Exception("Attribute 'open' in provided object is not callable")
else:
open_exists = True

close_exists = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe

close_exists  = hasattr(f, "close")
if close_exists:
    ...

if hasattr(f, "close"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, for perfectness, we should check argument specification too ... although it's a bit too much. I felt like I had to mention it at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all dynamically typed, right? I don't think there is a good way to check where the function accepts the right argument types without actually calling the function (please correct me if I am wrong). And this is a standard problem in python that is everybody using python is used to (i.e. runtime exceptions when using incorrectly typed parameters). Also, no other operations that take lambdas check the types and counts. So I think we should just be consistent and let that be.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can check argspec at least I believe. but, yup, I am fine with it as is.

if not callable(getattr(f, 'close')):
raise Exception("Attribute 'close' in provided object is not callable")
else:
close_exists = True

def func_with_open_process_close(partition_id, iterator):
epoch_id = TaskContext.get().getLocalProperty('streaming.sql.batchId')
if epoch_id:
epoch_id = int(epoch_id)
else:
raise Exception("Could not get batch id from TaskContext")

should_process = True
if open_exists:
should_process = f.open(partition_id, epoch_id)

def call_close_if_needed(error):
if open_exists and close_exists:
f.close(error)
try:
if should_process:
for x in iterator:
f.process(x)
except Exception as ex:
call_close_if_needed(ex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be put in finally?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

putting this to finally sounds like better pattern for me too, and then we may be able to inline call_close_if_needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is tricky because I have to pass the exception and rethrow the error if any. alternative code is

err = None
...
except Exception as ex:
  err = ex
finally:
  call_close(ex)
  if ex is not None: 
     raise ex

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sure. I rushed to read. Seems fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either is fine for me.

raise ex

call_close_if_needed(None)
return iter([])

func = func_with_open_process_close

serializer = AutoBatchedSerializer(PickleSerializer())
wrapped_func = _wrap_function(self._spark._sc, func, serializer, serializer)
jForeachWriter = \
self._spark._sc._jvm.org.apache.spark.sql.execution.python.PythonForeachWriter(
wrapped_func, self._df._jdf.schema())
self._jwrite.foreach(jForeachWriter)
return self

@ignore_unicode_prefix
@since(2.0)
def start(self, path=None, format=None, outputMode=None, partitionBy=None, queryName=None,
Expand Down
160 changes: 159 additions & 1 deletion python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ def tearDown(self):
# tear down test_bucketed_write state
self.spark.sql("DROP TABLE IF EXISTS pyspark_bucket")

'''
Copy link
Contributor Author

@tdas tdas May 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only to speed up local testing. Will remove this.

def test_row_should_be_read_only(self):
row = Row(a=1, b=2)
self.assertEqual(1, row.a)
Expand Down Expand Up @@ -1884,7 +1885,164 @@ def test_query_manager_await_termination(self):
finally:
q.stop()
shutil.rmtree(tmpPath)
'''

class ForeachWriterTester:

def __init__(self, spark):
self.spark = spark
self.input_dir = tempfile.mkdtemp()
self.open_events_dir = tempfile.mkdtemp()
self.process_events_dir = tempfile.mkdtemp()
self.close_events_dir = tempfile.mkdtemp()

def write_open_event(self, partitionId, epochId):
self._write_event(
self.open_events_dir,
{'partition': partitionId, 'epoch': epochId})

def write_process_event(self, row):
self._write_event(self.process_events_dir, {'value': 'text'})

def write_close_event(self, error):
self._write_event(self.close_events_dir, {'error': str(error)})

def write_input_file(self):
self._write_event(self.input_dir, "text")

def open_events(self):
return self._read_events(self.open_events_dir, 'partition INT, epoch INT')

def process_events(self):
return self._read_events(self.process_events_dir, 'value STRING')

def close_events(self):
return self._read_events(self.close_events_dir, 'error STRING')

def run_streaming_query_on_writer(self, writer, num_files):
try:
sdf = self.spark.readStream.format('text').load(self.input_dir)
sq = sdf.writeStream.foreach(writer).start()
for i in range(num_files):
self.write_input_file()
sq.processAllAvailable()
sq.stop()
finally:
self.stop_all()

def _read_events(self, dir, json):
rows = self.spark.read.schema(json).json(dir).collect()
dicts = [row.asDict() for row in rows]
return dicts

def _write_event(self, dir, event):
import random
file = open(os.path.join(dir, str(random.randint(0, 100000))), 'w')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: file shadows the builtin function file ..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might feel more convenient with with statement, and renaming file to f or fw or so. Please ignore if there's specific reason not to use with statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a uuid in the file name?

file.write("%s\n" % str(event))
file.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be put in finally too?


def stop_all(self):
for q in self.spark._wrapped.streams.active:
q.stop()

def __getstate__(self):
return (self.open_events_dir, self.process_events_dir, self.close_events_dir)

def __setstate__(self, state):
self.open_events_dir, self.process_events_dir, self.close_events_dir = state

def test_streaming_foreach_with_simple_function(self):
tester = self.ForeachWriterTester(self.spark)

def foreach_func(row):
tester.write_process_event(row)

tester.run_streaming_query_on_writer(foreach_func, 2)
self.assertEqual(len(tester.process_events()), 2)

def test_streaming_foreach_with_basic_open_process_close(self):
tester = self.ForeachWriterTester(self.spark)

class ForeachWriter:
def open(self, partitionId, epochId):
tester.write_open_event(partitionId, epochId)
return True

def process(self, row):
tester.write_process_event(row)

def close(self, error):
tester.write_close_event(error)

tester.run_streaming_query_on_writer(ForeachWriter(), 2)

open_events = tester.open_events()
self.assertEqual(len(open_events), 2)
self.assertSetEqual(set([e['epoch'] for e in open_events]), {0, 1})

self.assertEqual(len(tester.process_events()), 2)

close_events = tester.close_events()
self.assertEqual(len(close_events), 2)
self.assertSetEqual(set([e['error'] for e in close_events]), {'None'})

def test_streaming_foreach_with_open_returning_false(self):
tester = self.ForeachWriterTester(self.spark)

class ForeachWriter:
def open(self, partitionId, epochId):
tester.write_open_event(partitionId, epochId)
return False

def process(self, row):
tester.write_process_event(row)

def close(self, error):
tester.write_close_event(error)

tester.run_streaming_query_on_writer(ForeachWriter(), 2)

self.assertEqual(len(tester.open_events()), 2)
self.assertEqual(len(tester.process_events()), 0) # no row was processed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two spaces before #.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a PEP 8 rule or Spark style?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP 8 claims (at least) 2 spaces before inlined comments (https://www.python.org/dev/peps/pep-0008/#inline-comments). So, the previous style is totally fine but I have been doing this with less spaces since I find there's no point of adding more spaces unless it looks more pretty or has a format.

(This was just a tiny nit too me. I am okay with just ignoring such my comments personally)

close_events = tester.close_events()
self.assertEqual(len(close_events), 2)
self.assertSetEqual(set([e['error'] for e in close_events]), {'None'})

def test_streaming_foreach_with_process_throwing_error(self):
from pyspark.sql.utils import StreamingQueryException

tester = self.ForeachWriterTester(self.spark)

class ForeachWriter:
def open(self, partitionId, epochId):
tester.write_open_event(partitionId, epochId)
return True

def process(self, row):
raise Exception("test error")

def close(self, error):
tester.write_close_event(error)

try:
sdf = self.spark.readStream.format('text').load(tester.input_dir)
sq = sdf.writeStream.foreach(ForeachWriter()).start()
tester.write_input_file()
sq.processAllAvailable()
self.fail("bad writer should fail the query") # this is not expected
except StreamingQueryException as e:
# self.assertTrue("test error" in e.desc) # this is expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, shall we uncomment this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be done because somehow e.desc does not have the original underlying error when it's a StreamingQueryException. I think this is a separate bug that needs to be fixed independent of this PR.

pass
finally:
tester.stop_all()

self.assertEqual(len(tester.open_events()), 1)
self.assertEqual(len(tester.process_events()), 0) # no row was processed
close_events = tester.close_events()
self.assertEqual(len(close_events), 1)
# self.assertTrue("test error" in e[0]['error'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving marker as well: This is only to speed up local testing. Will remove this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes i will after I finish adding all the tests.

def test_help_command(self):
# Regression test for SPARK-5464
rdd = self.sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])
Expand Down Expand Up @@ -5391,7 +5549,7 @@ def test_invalid_args(self):
AnalysisException,
'mixture.*aggregate function.*group aggregate pandas UDF'):
df.groupby(df.id).agg(mean_udf(df.v), mean(df.v)).collect()

'''
if __name__ == "__main__":
from pyspark.sql.tests import *
if xmlrunner:
Expand Down
4 changes: 2 additions & 2 deletions python/pyspark/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -581,9 +581,9 @@ def test_get_local_property(self):
self.sc.setLocalProperty(key, value)
try:
rdd = self.sc.parallelize(range(1), 1)
prop1 = rdd.map(lambda x: TaskContext.get().getLocalProperty(key)).collect()[0]
prop1 = rdd.map(lambda _: TaskContext.get().getLocalProperty(key)).collect()[0]
self.assertEqual(prop1, value)
prop2 = rdd.map(lambda x: TaskContext.get().getLocalProperty("otherkey")).collect()[0]
prop2 = rdd.map(lambda _: TaskContext.get().getLocalProperty("otherkey")).collect()[0]
self.assertTrue(prop2 is None)
finally:
self.sc.setLocalProperty(key, None)
Expand Down
Loading