[SPARK-13185][SQL] Improve the performance of DateTimeUtils by reusing TimeZone and Calendar objects #11071

carsonwang · 2016-02-04T08:35:49Z

It is expensive to create java TimeZone and Calendar objects in each method of DateTimeUtils. We can reuse the objects to improve the performance. In one of my Sql queries which calls StringToDate many times, the duration of the stage improved from 1.6 minutes to 1.2 minutes.

SparkQA · 2016-02-04T08:43:48Z

Test build #50744 has finished for PR 11071 at commit cb9b157.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-04T09:59:07Z

Test build #50746 has finished for PR 11071 at commit 0ab90ed.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2016-02-04T10:00:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

  @transient lazy val defaultTimeZone = TimeZone.getDefault

+  // Reuse the TimeZone object as it is expensive to create in each method call.
+  final val timeZones = new ConcurrentHashMap[String, TimeZone]


This map could be quite big, because the string varies. Actually ZoneInfoFile does provide a cache for different IDs. Let's find out whether the boost you mentioned comes from reusing TimeZone or Calendar instances.

By use this map we can skip a lot of calls to getTimeZone, which is a synchronized method, ConcurrentHashMap can help improve performance, that's true. Do we need add a transient?

Added transient. The total available timezone IDs should be limited.

carsonwang · 2016-02-05T03:14:19Z

retest this please

SparkQA · 2016-02-05T04:37:07Z

Test build #50794 has finished for PR 11071 at commit 72b31c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-05T04:59:00Z

Test build #50797 has finished for PR 11071 at commit 72b31c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-02-05T05:07:10Z

transient is wrong for a static member. What are you trying to do, avoid a large cache? There are ways to do this but its not this. but how many timezones could there be? And are these not already cached?

adrian-wang · 2016-02-05T05:11:59Z

The map key could be like "UTC+01:00". "American/Los Angeles", "PST", etc., they are already cached in getTimeZone, but the method itself is a synchronized one.

carsonwang · 2016-02-05T05:38:31Z

I have a sub query like this SELECT a, b, c FROM table UV WHERE (datediff(UV.visitDate, '1997-01-01')>=0 AND datediff(UV.visitDate, '2015-01-01')<=0))
When profiling this stage with Spark 1.6, I noticed a lot time was consumed by DateTimeUtils.stringToDate. Especially, TimeZone.getTimeZone and Calendar.getInstance are extremely slow. The table stores visitDate as String type and has 3 billion records. This means it creates 3 billion Calendar and TimeZone objects.

TimeZone.getTimeZone is a synchronized method and will block other threads calling this same method. #10994 fixed one for DateTimeUtils.stringToDate. But DateTimeUtils.stringToTimestamp has the same issue so I tried to cache the TimeZone objects in a map. The total available number of TimeZone should be limited.

By reusing Calendar object instead of creating it each time in the method, I can see more performance improvement. Creating 20 millions Calendar objects will take more than 20 seconds on my machine. So we will benefit from reusing it.

AmplabJenkins · 2016-02-05T06:37:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50804/
Test FAILed.

srowen · 2016-02-06T15:39:10Z

@adrian-wang but the @transient was applied to the whole reference to the Map. This doesn't make sense. The way to avoid a big cache is with weak keys, which has nothing to do with transient or serialization; this is not an instance field anyway.

adrian-wang · 2016-02-14T01:45:32Z

@srowen you are right.

Reuse TimeZone and Calendar objects in DateTimeUtils

cb9b157

import order

0ab90ed

adrian-wang reviewed Feb 4, 2016
View reviewed changes

carsonwang added 2 commits February 5, 2016 10:51

Unit test

164e65b

private and transient

72b31c8

remove transient and thread local TimeZone

698e828

carsonwang closed this Feb 5, 2016

carsonwang mentioned this pull request Feb 6, 2016

[SPARK-13185][SQL] Reuse Calendar object in DateTimeUtils.StringToDate method to improve performance #11090

Closed

[SPARK-13185][SQL] Improve the performance of DateTimeUtils by reusing TimeZone and Calendar objects #11071

[SPARK-13185][SQL] Improve the performance of DateTimeUtils by reusing TimeZone and Calendar objects #11071

Uh oh!

Conversation

carsonwang commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

SparkQA commented Feb 4, 2016

Uh oh!

adrian-wang Feb 4, 2016

Choose a reason for hiding this comment

Uh oh!

adrian-wang Feb 4, 2016

Choose a reason for hiding this comment

Uh oh!

adrian-wang Feb 4, 2016

Choose a reason for hiding this comment

Uh oh!

carsonwang Feb 5, 2016

Choose a reason for hiding this comment

Uh oh!

carsonwang commented Feb 5, 2016

Uh oh!

SparkQA commented Feb 5, 2016

Uh oh!

SparkQA commented Feb 5, 2016

Uh oh!

srowen commented Feb 5, 2016

Uh oh!

adrian-wang commented Feb 5, 2016

Uh oh!

carsonwang commented Feb 5, 2016

Uh oh!

AmplabJenkins commented Feb 5, 2016

Uh oh!

srowen commented Feb 6, 2016

Uh oh!

adrian-wang commented Feb 14, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants