-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13185][SQL] Improve the performance of DateTimeUtils by reusing TimeZone and Calendar objects #11071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #50744 has finished for PR 11071 at commit
|
|
Test build #50746 has finished for PR 11071 at commit
|
| @transient lazy val defaultTimeZone = TimeZone.getDefault | ||
|
|
||
| // Reuse the TimeZone object as it is expensive to create in each method call. | ||
| final val timeZones = new ConcurrentHashMap[String, TimeZone] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map could be quite big, because the string varies. Actually ZoneInfoFile does provide a cache for different IDs. Let's find out whether the boost you mentioned comes from reusing TimeZone or Calendar instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map could be quite big, because the string varies. Actually ZoneInfoFile does provide a cache for different IDs. Let's find out whether the boost you mentioned comes from reusing TimeZone or Calendar instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By use this map we can skip a lot of calls to getTimeZone, which is a synchronized method, ConcurrentHashMap can help improve performance, that's true. Do we need add a transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added transient. The total available timezone IDs should be limited.
|
retest this please |
|
Test build #50794 has finished for PR 11071 at commit
|
|
Test build #50797 has finished for PR 11071 at commit
|
|
transient is wrong for a static member. What are you trying to do, avoid a large cache? There are ways to do this but its not this. but how many timezones could there be? And are these not already cached? |
|
The map key could be like "UTC+01:00". "American/Los Angeles", "PST", etc., they are already cached in |
|
I have a sub query like this
By reusing |
|
Test FAILed. |
|
@adrian-wang but the |
|
@srowen you are right. |
It is expensive to create java TimeZone and Calendar objects in each method of DateTimeUtils. We can reuse the objects to improve the performance. In one of my Sql queries which calls StringToDate many times, the duration of the stage improved from 1.6 minutes to 1.2 minutes.