Commit 032d179
[SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function
### What changes were proposed in this pull request?
This PR proposes to make `PythonFunction` holds `Seq[Byte]` instead of `Array[Byte]` to be able to compare if the byte array has the same values for the cache manager.
### Why are the changes needed?
Currently the cache manager doesn't use the cache for `udf` if the `udf` is created again even if the functions is the same.
```py
>>> func = lambda x: x
>>> df = spark.range(1)
>>> df.select(udf(func)("id")).cache()
```
```py
>>> df.select(udf(func)("id")).explain()
== Physical Plan ==
*(2) Project [pythonUDF0#14 AS <lambda>(id)#12]
+- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14]
+- *(1) Range (0, 1, step=1, splits=12)
```
This is because `PythonFunction` holds `Array[Byte]`, and `equals` method of array equals only when the both array is the same instance.
### Does this PR introduce _any_ user-facing change?
Yes, if the user reuse the Python function for the UDF, the cache manager will detect the same function and use the cache for it.
### How was this patch tested?
I added a test case and manually.
```py
>>> df.select(udf(func)("id")).explain()
== Physical Plan ==
InMemoryTableScan [<lambda>(id)#12]
+- InMemoryRelation [<lambda>(id)#12], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(2) Project [pythonUDF0#5 AS <lambda>(id)#3]
+- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#5]
+- *(1) Range (0, 1, step=1, splits=12)
```
Closes apache#28774 from ueshin/issues/SPARK-31945/udf_cache.
Authored-by: Takuya UESHIN <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>1 parent e14029b commit 032d179
4 files changed
Lines changed: 25 additions & 4 deletions
File tree
- core/src/main/scala/org/apache/spark/api/python
- python/pyspark/sql/tests
- sql/core/src/main/scala/org/apache/spark/sql/execution/python
Lines changed: 14 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
77 | | - | |
| 77 | + | |
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
83 | | - | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
84 | 96 | | |
85 | 97 | | |
86 | 98 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
613 | 613 | | |
614 | 614 | | |
615 | 615 | | |
616 | | - | |
| 616 | + | |
617 | 617 | | |
618 | 618 | | |
619 | 619 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
642 | 642 | | |
643 | 643 | | |
644 | 644 | | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
645 | 654 | | |
646 | 655 | | |
647 | 656 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
104 | 104 | | |
105 | 105 | | |
106 | 106 | | |
107 | | - | |
| 107 | + | |
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
| |||
0 commit comments