Skip to content

Conversation

@klinvill
Copy link

@klinvill klinvill commented Jan 30, 2017

The contribution is my original work and I license the work to the project under the project’s open source license.

Note: the Teradata JDBC connector limits the row size to 64K. The default string datatype equivalent I used is a 255 character/byte length varchar. This effectively limits the max number of string columns to 250 when using the Teradata jdbc connector.

What changes were proposed in this pull request?

Added a teradataDialect for JDBC connection to Teradata. The Teradata dialect uses VARCHAR(255) in place of TEXT for string datatypes, and CHAR(1) in place of BIT(1) for boolean datatypes.

How was this patch tested?

I added two unit tests to double check that the types get set correctly for a teradata jdbc url. I also ran a couple manual tests to make sure the jdbc connector worked with teradata and to make sure that an error was thrown if a row could potentially exceed 64K (this error comes from the teradata jdbc connector, not from the spark code). I did not check how string columns longer than 255 characters are handled.

Note: the Teradata JDBC connector limits the row size to 64K. The default string datatype equivalent is a 255 character/byte length varchar.
@gatorsmile
Copy link
Member

Are we able to find a docker image for Teradata Express? For example, we did it for Postgres.

Otherwise, it is hard for us to do the test for verification.

@klinvill
Copy link
Author

Unfortunately I don't think there's a docker image for Teradata available yet. They do have the VM version and an AMI. Would either of those be sufficient?

@gatorsmile
Copy link
Member

I do not have a solution to plug in VM into our test framework. How are you doing the test?

@klinvill
Copy link
Author

I just tested it manually with a Teradata instance I have running. I didn't test it too extensively other than making sure that a write to a teradata table using a string datatype was working correctly for smaller strings (<255 characters).

case StringType => Some(JdbcType("VARCHAR(255)", java.sql.Types.VARCHAR))
case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR))
case _ => None
}
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @klinvill .
According to the description and initial PR in SPARK-15648, Teradata didn't support LIMIT query at that time.
Now, does it supports LIMIT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dongjoon-hyun,
Teradata still doesn't support LIMIT (it uses TOP instead) but the spark code that was originally using limit has been changed to use "where 1=0 instead".

/**
   * Get the SQL query that should be used to find if the given table exists. Dialects can
   * override this method to return a query that works best in a particular database.
   * @param table  The name of the table.
   * @return The SQL query to use for checking the table.
   */
  def getTableExistsQuery(table: String): String = {
    s"SELECT * FROM $table WHERE 1=0"
  }

  /**
   * The SQL query that should be used to discover the schema of a table. It only needs to
   * ensure that the result set has the same schema as the table, such as by calling
   * "SELECT * ...". Dialects can override this method to return a query that works best in a
   * particular database.
   * @param table The name of the table.
   * @return The SQL query to use for discovering the schema.
   */
  @Since("2.1.0")
  def getSchemaQuery(table: String): String = {
    s"SELECT * FROM $table WHERE 1=0"
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


package org.apache.spark.sql.jdbc

import java.sql.Types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A blank line is needed here. You can run the following command line to check that and to confirm after fixing.

$ dev/lint-scala

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed in latest commit.

case StringType => Some(JdbcType("VARCHAR(255)", java.sql.Types.VARCHAR))
case BooleanType => Option(JdbcType("CHAR(1)", java.sql.Types.CHAR))
case _ => None
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you verify if we need to override the followings together?

  override def quoteIdentifier(colName: String): String = ...
  override def getTableExistsQuery(table: String): String = ...
  override def isCascadingTruncateTable(): Option[Boolean] = ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about isCascadingTruncateTable? Could you check if Teradata does truncate cascadingly by default for TRUNCATE TABLE statement?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoteIdentifier and getTableExistsQuery will both work for Teradata. Teradata does not cascade by default but it also doesn't have a TRUNCATE TABLE command (DELETE is used instead) so any commands that use TRUNCATE TABLE will fail.

@dongjoon-hyun
Copy link
Member

BTW, @klinvill .
Do you use a real instance? Could you advice how the other persons like me can verify your PR on Teradata? Maybe, can we check Teradata Express or AWS Marketplace?

Note: the Teradata JDBC connector limits the row size to 64K. The default string datatype equivalent is a 255 character/byte length varchar.
@klinvill
Copy link
Author

klinvill commented Feb 1, 2017

@dongjoon-hyun Yup, I was using a real instance for testing. The best way to test without a real instance is probably going to be using the Teradata Express vm: http://downloads.teradata.com/download/database/teradata-express-for-vmware-player. You can also build an instance using an AMI but it's fairly expensive for an AMI so I'd recommend the express vm instead. Unfortunately there's currently not a dockerized version available.

@klinvill
Copy link
Author

Hi @dongjoon-hyun @gatorsmile, just circling back. Is it going to be impractical to check the PR against a VM rather than against a docker image?

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 23, 2017

Test build #77257 has finished for PR 16746 at commit 91e12e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 4816c2e May 23, 2017
@klinvill
Copy link
Author

Thanks for the help and review!

lycplus pushed a commit to lycplus/spark that referenced this pull request May 24, 2017
The contribution is my original work and I license the work to the project under the project’s open source license.

Note: the Teradata JDBC connector limits the row size to 64K. The default string datatype equivalent I used is a 255 character/byte length varchar. This effectively limits the max number of string columns to 250 when using the Teradata jdbc connector.

## What changes were proposed in this pull request?

Added a teradataDialect for JDBC connection to Teradata. The Teradata dialect uses VARCHAR(255) in place of TEXT for string datatypes, and CHAR(1) in place of BIT(1) for boolean datatypes.

## How was this patch tested?

I added two unit tests to double check that the types get set correctly for a teradata jdbc url. I also ran a couple manual tests to make sure the jdbc connector worked with teradata and to make sure that an error was thrown if a row could potentially exceed 64K (this error comes from the teradata jdbc connector, not from the spark code). I did not check how string columns longer than 255 characters are handled.

Author: Kirby Linvill <[email protected]>
Author: klinvill <[email protected]>

Closes apache#16746 from klinvill/master.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants