-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18857][SQL] Don't use Iterator.duplicate for incrementalCollect in Thrift Server
#16440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
125e79c
00bb52d
fb4ee89
e66b165
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -50,8 +50,8 @@ private[hive] class SparkExecuteStatementOperation( | |
| with Logging { | ||
|
|
||
| private var result: DataFrame = _ | ||
| private var resultList: Option[Array[org.apache.spark.sql.Row]] = _ | ||
| private var iter: Iterator[SparkRow] = _ | ||
| private var iterHeader: Iterator[SparkRow] = _ | ||
| private var dataTypes: Array[DataType] = _ | ||
| private var statementId: String = _ | ||
|
|
||
|
|
@@ -103,6 +103,10 @@ private[hive] class SparkExecuteStatementOperation( | |
| } | ||
| } | ||
|
|
||
| private def useIncrementalCollect: Boolean = { | ||
|
||
| sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", "false").toBoolean | ||
|
||
| } | ||
|
|
||
| def getNextRowSet(order: FetchOrientation, maxRowsL: Long): RowSet = { | ||
| validateDefaultFetchOrientation(order) | ||
| assertState(OperationState.FINISHED) | ||
|
|
@@ -111,9 +115,15 @@ private[hive] class SparkExecuteStatementOperation( | |
|
|
||
| // Reset iter to header when fetching start from first row | ||
| if (order.equals(FetchOrientation.FETCH_FIRST)) { | ||
| val (ita, itb) = iterHeader.duplicate | ||
| iter = ita | ||
| iterHeader = itb | ||
| iter = if (useIncrementalCollect) { | ||
| resultList = None | ||
| result.toLocalIterator.asScala | ||
| } else { | ||
| if (resultList.isEmpty) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that this makes the implicit buffering implicit. So, if an iterator is duplicated into A and B, and all of A is consumed, then B will internally buffer everything from A so it can be replayed? and in our case, we know that A will be entirely consumed? then these are basically the same, yes. But, does that solve the problem? this now always stores the whole result set locally. Is this avoiding a second whole copy of it? What if you always just return result.collect().iterator here -- the problem is the re-collecting the result every time?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. The following happens with
And, the whole result storing happens line 122 and line 245-246 for
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose I'm asking, why is this an improvement? because in the new version, you also buffer the whole result into memory locally.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct. There are two cases and this PR targets
First of all, before SPARK-16563,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK I believe I get it now. I see your approach and it makes sense. Just wondering if that's actually simpler, to avoid keeping a reference to the whole data set, or whether that defeats a purpose.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When However, if users executes a query with |
||
| resultList = Some(result.collect()) | ||
| } | ||
| resultList.get.iterator | ||
| } | ||
| } | ||
|
|
||
| if (!iter.hasNext) { | ||
|
|
@@ -227,17 +237,14 @@ private[hive] class SparkExecuteStatementOperation( | |
| } | ||
| HiveThriftServer2.listener.onStatementParsed(statementId, result.queryExecution.toString()) | ||
| iter = { | ||
| val useIncrementalCollect = | ||
| sqlContext.getConf("spark.sql.thriftServer.incrementalCollect", "false").toBoolean | ||
| if (useIncrementalCollect) { | ||
| resultList = None | ||
| result.toLocalIterator.asScala | ||
| } else { | ||
| result.collect().iterator | ||
| resultList = Some(result.collect()) | ||
| resultList.get.iterator | ||
| } | ||
| } | ||
| val (itra, itrb) = iter.duplicate | ||
| iterHeader = itra | ||
| iter = itrb | ||
| dataTypes = result.queryExecution.analyzed.output.map(_.dataType).toArray | ||
| } catch { | ||
| case e: HiveSQLException => | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write
SparkRowfor consistency? and init toNoneexplicitly?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @srowen .
Sure!