-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24324][PYTHON] Pandas Grouped Map UDF should assign result columns by name #21427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
21e0c3e
bbe3587
a653e9b
88e2aa3
7cc0c49
d4b5da1
63c3963
9bbf014
74c5d8e
b2d0966
5a7edb2
59972d6
27b4cad
c593650
2d2ced6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1161,6 +1161,16 @@ object SQLConf { | |
| .booleanConf | ||
| .createWithDefault(true) | ||
|
|
||
| val PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION = | ||
| buildConf("spark.sql.execution.pandas.groupedMap.assignColumnsByPosition") | ||
| .internal() | ||
| .doc("When true, a grouped map Pandas UDF will assign columns from the returned " + | ||
| "Pandas DataFrame based on position, regardless of column label type. When false, " + | ||
| "columns will be looked up by name if labeled with a string and fallback to use" + | ||
|
||
| "position if not.") | ||
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| val REPLACE_EXCEPT_WITH_FILTER = buildConf("spark.sql.optimizer.replaceExceptWithFilter") | ||
| .internal() | ||
| .doc("When true, the apply function of the rule verifies whether the right node of the" + | ||
|
|
@@ -1647,6 +1657,9 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def pandasRespectSessionTimeZone: Boolean = getConf(PANDAS_RESPECT_SESSION_LOCAL_TIMEZONE) | ||
|
|
||
| def pandasGroupedMapAssignColumnssByPosition: Boolean = | ||
| getConf(SQLConf.PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION) | ||
|
|
||
| def replaceExceptWithFilter: Boolean = getConf(REPLACE_EXCEPT_WITH_FILTER) | ||
|
|
||
| def decimalOperationsAllowPrecisionLoss: Boolean = getConf(DECIMAL_OPERATIONS_ALLOW_PREC_LOSS) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -23,6 +23,7 @@ import org.apache.arrow.memory.RootAllocator | |
| import org.apache.arrow.vector.types.{DateUnit, FloatingPointPrecision, TimeUnit} | ||
| import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema} | ||
|
|
||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.types._ | ||
|
|
||
| object ArrowUtils { | ||
|
|
@@ -120,4 +121,19 @@ object ArrowUtils { | |
| StructField(field.getName, dt, field.isNullable) | ||
| }) | ||
| } | ||
|
|
||
| /** Return Map with conf settings to be used in ArrowPythonRunner */ | ||
| def getPythonRunnerConfMap(conf: SQLConf): Map[String, String] = { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Edit: Actually, nvm |
||
| val timeZoneConf = if (conf.pandasRespectSessionTimeZone) { | ||
| Seq(SQLConf.SESSION_LOCAL_TIMEZONE.key -> conf.sessionLocalTimeZone) | ||
| } else { | ||
| Nil | ||
| } | ||
| val pandasColsByPosition = if (conf.pandasGroupedMapAssignColumnssByPosition) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we do:
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's better to just omit the config for the default case, that way it's easier to process in the worker.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am sorry can you explain why it's easier to process in the worker? I think we just need to remove the default value here? Also one thing is not great about omitting the conf for default case is that you need to put the default value in two places..(both python and java) |
||
| Seq(SQLConf.PANDAS_GROUPED_MAP_ASSIGN_COLUMNS_BY_POSITION.key -> "true") | ||
| } else { | ||
| Nil | ||
| } | ||
| Map(timeZoneConf ++ pandasColsByPosition: _*) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to be a little more careful here, for example, an
KeyErrorin to_arrow_type could lead to unexpected behavior.How about sth like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems ok to me since it's basically the same, but I don't think we need to worry about
to_arrow_typethrowing aKeyError. Is there any particular reason you suggested handling position like this?To me it seems better to look up by column labels, how it is currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
result.iloc[:,i]andresult[result.columns[i]]are the same, you don't have change it if you preferresult.columns[i]I agree
to_arrow_typedoesn't throwKeyError, but in general I feel it's more robust not to assume the implementation detail ofto_arrow_type. I think the code is more concise and readable with if/else too (comparing to except KeyError)