Skip to content

Conversation

@carsonwang
Copy link
Contributor

It is expensive to create java TimeZone and Calendar objects in each method of DateTimeUtils. We can reuse the objects to improve the performance. In one of my Sql queries which calls StringToDate many times, the duration of the stage improved from 1.6 minutes to 1.2 minutes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50744 has finished for PR 11071 at commit cb9b157.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50746 has finished for PR 11071 at commit 0ab90ed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@transient lazy val defaultTimeZone = TimeZone.getDefault

// Reuse the TimeZone object as it is expensive to create in each method call.
final val timeZones = new ConcurrentHashMap[String, TimeZone]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map could be quite big, because the string varies. Actually ZoneInfoFile does provide a cache for different IDs. Let's find out whether the boost you mentioned comes from reusing TimeZone or Calendar instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map could be quite big, because the string varies. Actually ZoneInfoFile does provide a cache for different IDs. Let's find out whether the boost you mentioned comes from reusing TimeZone or Calendar instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By use this map we can skip a lot of calls to getTimeZone, which is a synchronized method, ConcurrentHashMap can help improve performance, that's true. Do we need add a transient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added transient. The total available timezone IDs should be limited.

@carsonwang
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 5, 2016

Test build #50794 has finished for PR 11071 at commit 72b31c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2016

Test build #50797 has finished for PR 11071 at commit 72b31c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 5, 2016

transient is wrong for a static member. What are you trying to do, avoid a large cache? There are ways to do this but its not this. but how many timezones could there be? And are these not already cached?

@adrian-wang
Copy link
Contributor

The map key could be like "UTC+01:00". "American/Los Angeles", "PST", etc., they are already cached in getTimeZone, but the method itself is a synchronized one.

@carsonwang
Copy link
Contributor Author

I have a sub query like this SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))
When profiling this stage with Spark 1.6, I noticed a lot time was consumed by DateTimeUtils.stringToDate. Especially, TimeZone.getTimeZone and Calendar.getInstance are extremely slow. The table stores visitDate as String type and has 3 billion records. This means it creates 3 billion Calendar and TimeZone objects.

TimeZone.getTimeZone is a synchronized method and will block other threads calling this same method. #10994 fixed one for DateTimeUtils.stringToDate. But DateTimeUtils.stringToTimestamp has the same issue so I tried to cache the TimeZone objects in a map. The total available number of TimeZone should be limited.

By reusing Calendar object instead of creating it each time in the method, I can see more performance improvement. Creating 20 millions Calendar objects will take more than 20 seconds on my machine. So we will benefit from reusing it.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50804/
Test FAILed.

@srowen
Copy link
Member

srowen commented Feb 6, 2016

@adrian-wang but the @transient was applied to the whole reference to the Map. This doesn't make sense. The way to avoid a big cache is with weak keys, which has nothing to do with transient or serialization; this is not an instance field anyway.

@adrian-wang
Copy link
Contributor

@srowen you are right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants