Teradata QueryGrid: Teradata Database-to-Presto documentation

Introduction

Teradata® QueryGrid™: Teradata Database-to-Presto (henceforth known as “T2P”) provides a SQL interface to query Presto from the Teradata Database. Presto is a SQL query engine for big data that provides access to many data sources, including Hadoop ecosystem databases like Hive and traditional databases like MySQL and PostgreSQL.

You can issue queries from Teradata with T2P that do the following:

  • Import data from Presto tables into an existing table in the Teradata Database.
  • Export data from Teradata and store it in Hadoop or one of the databases that Presto has a connector for.
  • Select data from Presto tables and join it with existing Teradata data.

Prerequisites

T2P is compatible with versions of the Teradata database in the 15.00 and 15.10 releases. The minimum version for the 15.00 release is 15.00.04. The minimum version for the 15.10 release is 15.10.01.01. The Teradata database must have Java 8 installed. Additionally, there are recommended changes to the Teradata JVM properties that greatly improve performance, as described here: JVM Configuration. Changing the JVM properties on the Teradata database requires a database restart.

Only Presto version 0.141t is supported with this version of T2P.

These installation instructions use the presto-admin utility, described here. The administrator needs sudo and ssh access to all of the Presto nodes to install a new connector with presto-admin.

Installation

To install T2P, you must install a connector on Presto and install a driver on Teradata. To be able to run queries, you must create one or more T2P foreign servers on Teradata.

Before starting installation, upload teradata-to-presto-driver.zip to both the Presto and Teradata clusters and unzip it:

unzip teradata-to-presto-driver.zip -d teradata-to-presto-driver
cd teradata-to-presto-driver
The archive contains the following:
  • the Teradata driver jar file, teradata-to-presto-driver.jar
  • the RPM file to be installed on the Presto cluster, teradata-to-presto-connector-*.noarch.rpm
  • the sample properties files to be copied to the Presto cluster, properties/querygrid.properties
  • the install script for the driver on Teradata, install_t2p.bteq
  • the uninstall script for the driver on Teradata, uninstall_t2p.bteq
  • this documentation

Installation on Presto

First, install a connector on Presto. These instructions use the presto-admin tool for easier management of Presto.

In order to install T2P on Presto:

  1. Install the teradata-to-presto-connector RPM:

    sudo ./presto-admin package install teradata-to-presto-connector-*.noarch.rpm
    
  1. Copy properties/querygrid.properties from the package zip file to /etc/opt/prestoadmin/connectors.

  2. In /etc/opt/prestoadmin/connectors/querygrid.properties, replace <PRESTO-HOST> with the actual Presto coordinator hostname or IP reachable by Teradata. If bynet is configured, make sure to use the bynet IP or a hostname that resolves to the bynet IP so that network traffic uses the fastest possible connection.

  3. Deploy the T2P connector configuration using presto-admin:

    sudo ./presto-admin connector add querygrid
    
  4. Restart Presto:

    sudo ./presto-admin server restart
    
  5. Use the Presto CLI to ensure that the T2P connector is installed:

./presto --server localhost:8080 --catalog system --schema default
presto:default> show catalogs;
  Catalog
-----------
 ...
 querygrid
 ...

Note

If running the CLI on any node but the Presto coordinator, substitute the Presto coordinator hostname/IP for “localhost” in the above command.

Installation on Teradata

Privileges and Security

T2P uses two Teradata accounts: a DBA user and a user that will run queries on the foreign server.

The DBA user (henceforth known as dba_user) is usually dbc or a user with similar permissions, and it installs the driver in the syslib database.

Grant the following permissions to dba_user:

GRANT EXECUTE PROCEDURE ON sqlj TO dba_user;
GRANT CREATE EXTERNAL PROCEDURE ON syslib TO dba_user WITH GRANT OPTION;
GRANT CREATE FUNCTION ON syslib TO dba_user;

The user that will be running queries (henceforth known as presto_user) can either be an existing user or a new user. That user must have sufficient spool space for the query workload: the spool space must be at least as large as the amount of data being fetched from Presto.

Grant the following permissions to presto_user:

GRANT SELECT ON td_server_db TO presto_user;
GRANT CREATE SERVER ON td_server_db TO presto_user;
GRANT DROP SERVER ON td_server_db TO presto_user;

JVM Configuration

T2P requires additional Java Virtual Machine (JVM) options. For best performance, it is recommended to use the Java G1 garbage collector and increase the JVM heap size to 30GB. Note that you need to have 30GB of memory free on the Teradata machine to do this; if not, you will need to set -Xms and -Xmx to a lower value. In order to set these JVM options, run the following commands on any Teradata node:

cufconfig -o | grep JVMOptions > old_jvmoptions.txt
echo "JVMOptions: -Xms30g -Xmx30g -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+ExplicitGCInvokesConcurrent" > /tmp/jre_config;
cufconfig -f /tmp/jre_config;
tpareset 'JVMOptions has been changed'

Note

cufconfig -f overwrites the existing JVM properties. To list the existing properties, run cufconfig -o.

Driver Installation

To install the driver, execute the following commands as dba_user, replacing ${DBA_USER} with the DBA user, ${PRESTO_USER} with the user that will query from the foreign server, and ${PATH_TO_T2P_DRIVER_JAR} with the local path to the teradata-to-presto-driver.jar file:

DATABASE syslib;

-- Execute following commands as an admin user (${DBA_USER})
-- Grant rights...
GRANT EXECUTE PROCEDURE ON sqlj TO ${DBA_USER};
GRANT CREATE EXTERNAL PROCEDURE ON syslib TO ${DBA_USER} WITH GRANT OPTION;
GRANT CREATE FUNCTION ON syslib TO ${DBA_USER};

GRANT SELECT ON td_server_db TO ${PRESTO_USER};
GRANT CREATE SERVER ON td_server_db TO ${PRESTO_USER};
GRANT DROP SERVER ON td_server_db TO ${PRESTO_USER};

-- Upload driver...
call sqlj.install_jar('CJ!${PATH_TO_T2P_DRIVER_JAR}', 'presto_teradata_driver_t2p', 0);

-- Create transport functions...
CREATE FUNCTION teradata_to_presto_import()
    RETURNS TABLE VARYING USING FUNCTION td_to_presto_import_contract
    LANGUAGE JAVA NO SQL
    PARAMETER STYLE SQLTABLE
    READS EXTERNAL DATA
    EXTERNAL NAME 'presto_teradata_driver_t2p:com.teradata.ptl.T2PImportDriver.execute';

GRANT EXECUTE FUNCTION ON syslib.teradata_to_presto_import TO ${PRESTO_USER};
GRANT SELECT ON syslib.teradata_to_presto_import TO ${PRESTO_USER};

CREATE FUNCTION teradata_to_presto_export()
    RETURNS TABLE VARYING USING FUNCTION td_to_presto_export_contract
    LANGUAGE JAVA NO SQL
    PARAMETER STYLE SQLTABLE
    MODIFIES EXTERNAL DATA
    EXTERNAL NAME 'presto_teradata_driver_t2p:com.teradata.ptl.T2PExportDriver.execute';

GRANT EXECUTE FUNCTION ON syslib.teradata_to_presto_export TO ${PRESTO_USER};
GRANT SELECT ON syslib.teradata_to_presto_export TO ${PRESTO_USER};

In the installer package, there is a script called install_t2p.bteq containing these commands. In that script, ${DBA_USER} and ${PRESTO_USER} must be set to the appropriate user names.

Foreign Server Creation

Once the driver is installed, create one or more foreign servers to be able to query data from Presto.

The CREATE FOREIGN SERVER command has the following syntax:

CREATE FOREIGN SERVER ${FOREIGN_SERVER_NAME}
[EXTERNAL SECURITY INVOKER TRUSTED ${AUTH_OBJECT}]
USING
  PARAM1('value1')
  PARAM2('value2')
  ...
  PARAMN('valuen')
DO IMPORT WITH syslib.teradata_to_presto_import,
DO EXPORT WITH syslib.teradata_to_presto_export;

${FOREIGN_SERVER_NAME} must be a valid object name in Teradata. PARAM1 to PARAMN are name value pairs (NVPs). Some NVPs are required to create a functional foreign server – see Required Name Value Pairs – while others are necessary for connecting to particular Presto servers – see Optional Name Value Pairs. Other NVPs can be used to tune T2P performance given your expected workload – see Performance Tuning Name Value Pairs.

EXTERNAL SECURITY INVOKER TRUSTED ${AUTH_OBJECT} is used only for Kerberos authentication: for a detailed description of how to configure Kerberos, see Kerberos Support.

If the Presto host is located on 192.0.2.3, a minimal foreign server definition is as follows:

CREATE FOREIGN SERVER presto_server
USING
  HOSTTYPE('NVP')
  TemporaryTableCatalog('querygrid')
  DB_HOST('192.0.2.3')
DO IMPORT WITH syslib.teradata_to_presto_import,
DO EXPORT WITH syslib.teradata_to_presto_export;

See the sections below for more information on the required and optional configuration parameters.

Required Name Value Pairs

The following required name value pairs (NVPs) must be added to the USING clause. NVPs are specified with the form NVP_NAME(‘value’) – e.g. HOSTTYPE(‘NVP’).

HOSTTYPE
For T2P, this is the string ‘NVP’.
TemporaryTableCatalog
The name of the Presto connector installed for T2P. In most cases, this will be ‘querygrid’, unless in connector_installation a name other than querygrid.properties was used for the configuration file.
DB_HOST
The IP address or hostname of the Presto coordinator, accessible from the Teradata hosts. This value must match the host provided in the data-exchange.coordinator.host-and-port property in the Presto configuration in connector_installation.

Optional Name Value Pairs

Additional name value pairs (NVPs) may be needed to connect the Teradata and Presto servers. These NVPs can be specified after the required NVPs with the form NVP_NAME(‘value’) – e.g. DB_PORT(‘8081’).

DB_PORT
The port on which the Presto coordinator is running, accessible from the Teradata hosts. This value must match the port number provided in http-server.http.port (or http-server.https.port, for Kerberos). If not specified, the default is 8080.
TRANSPORT_COORDINATOR_PORT
The port used by T2P to transfer data. This value must match the port number provided in the data-exchange.server.port property in the Presto configuration. If not specified, the default is 21337.
DB_NAME
The Presto catalog that the Teradata foreign server will query. To determine the possible catalogs, run SHOW CATALOGS on Presto; to add new catalogs, see the Presto connector documentation. If not specified, the default is hive.
DB_USER
The user that will execute queries on Presto. This user can be used for workload management – see the Presto queue documentation for more details. If not specified, the default is qg_t2p.
dbname
The schema to query in the catalog specified by DB_NAME. When set, it is not possible to query from any other schema. Without setting this parameter, you will need to explicitly specify the schema every time you run a query.
MaximumTextLength
The maximum length of text in Unicode characters that is transferred. If a VARCHAR or TEXT value is longer than MaximumTextLength, it will be silently truncated. When importing VARCHAR columns into Teradata, their size is MaximumTextLength. The maximum row size in Teradata is approximately 64K, so for tables with many VARCHAR columns, it may be necessary to decrease this value. If not specified, the default is 4096 Unicode characters.
MaximumBinaryLength
The maximum length of binary data in bytes that is transferred. If binary data is longer than MaximumBinaryLength, an error is thrown. If not specified, the default is 4096 bytes.

Querying Presto via a Foreign Server

While logged into Teradata as presto_user, or some other user that has access to the foreign server you created, you can issues the following commands to access data in Presto:

To see the list of schemas (databases) of a foreign server:

HELP FOREIGN SERVER presto_server;

To see the list of tables in a schema (database), e.g. db1:

HELP FOREIGN DATABASE db1@presto_server;

The Hive connector has a schema named default. When referencing this schema in any SQL statement, it needs to be quoted because default is a keyword:

HELP FOREIGN DATABASE "default"@presto_server;

To see the list of columns in a table from a foreign server:

HELP FOREIGN TABLE db1.region@presto_server;

Note

If the schema name has been provided during foreign server creation with the dbname name value pair, then you must not prefix the table name with the schema name.

To select a table from Presto:

SELECT * from db1.region@presto_server;

To insert into a table in Presto:

INSERT INTO db1.region@presto_server SELECT * from region_on_teradata_table;

To run a pass-through query directly on Presto:

SELECT *
FROM FOREIGN TABLE(
   SELECT count(*) as c
   FROM nation
)@presto_server AS presto_query

When running a pass-through query, the syntax of the query is not checked at all, but the query is passed directly to Presto.

Drop Foreign Servers

To remove a foreign server, execute the following command as presto_user, or some user with DROP FOREIGN SERVER privileges on the server:

DROP FOREIGN SERVER ${FOREIGN_SERVER_NAME};

Uninstall Driver

The driver can be uninstalled only after dropping all of the foreign servers. To uninstall the driver, execute the following command as dba_user:

DATABASE syslib;

DROP FUNCTION teradata_to_presto_import;
DROP FUNCTION teradata_to_presto_export;

CALL sqlj.remove_jar('presto_teradata_driver_t2p',0);

In the installer package there is a script called uninstall_t2p.bteq that contains these commands.

After uninstalling T2P, it is recommended that you restart your Teradata Database to clean up resources allocated by the driver. Restarting Teradata is necessary before reinstalling or upgrading T2P. To restart Teradata, issue the following command:

tpareset "QueryGrid driver has been uninstalled."

Configuration

Performance Tuning Name Value Pairs

There are additional name value pairs that can be added to the CREATE FOREIGN SERVER definition to tune performance for a given workload. They must be added to the CREATE FOREIGN SERVER definition after the required NVPs with the form NVP_NAME(‘value’) – e.g. PrestoWriterCount(‘2’).

PrestoWriterCount

The number of concurrent Presto result writer threads per node. If not specified, the default is 8.

If your workload mainly consists of running large queries with low concurrency, you will more fully use your cluster if it is set to a higher value. However, if your workload contains concurrent queries, you may need to lower it.

For example, if your Presto cluster has 64 cores per machine, to fully utilize all of those cores with 4 concurrent queries, you need to set PrestoWriterCount to at least 16. However, a PrestoWriterCount that is too high can cause excessive CPU utilization. If Presto starts to fail with timeout related issues, try to decrease PrestoWriterCount.

When increasing PrestoWriterCount, you must also change the node-scheduler settings in Presto’s config.properties file:

node-scheduler.max-splits-per-node=MAX_SPLITS_PER_NODE
node-scheduler.max-pending-splits-per-node-per-task=MAX_SPLITS_PER_TASK_PER_NODE

Where MAX_SPLITS_PER_TASK_PER_NODE must be greater than or equal to PrestoWriterCount and MAX_SPLITS_PER_NODE must be greater than or equal to PrestoWriterCount multiplied by the number of queries that are intended to run concurrently. This Presto configuration change requires a Presto restart, which might require a downtime window for the Presto cluster.

Compression

The type of compression used for data transfer between Presto and Teradata. If not specified, the data is not compressed.

If the network bandwidth is the bottleneck, data compression may increase throughput at the cost of increased CPU utilization. The possible values for COMPRESSION are LZ4 and SNAPPY. Both algorithms are focused on compression and decompression speed. Compression ratio and compression speed may vary for different queries and data sets.

ReadTimeout

The maximum amount of time spent waiting for query results. It is recommended to increase this value for queries that require complex computation before returning any data. Timeouts must have both a number and a unit: for example, 10s, 5m, or 2h.

SELECT *
FROM FOREIGN TABLE(
  SELECT count(*)
  FROM some_enormous_table
)@foreign_server AS table;

If not specified, the default is 2h.

WriteTimeout

The maximum amount of time spent waiting for consumer to consume the results. It is recommended to increase this value for queries that require complex computation after consuming the data. Timeouts must have both a number and a unit: for example, 10s, 5m, or 2h.

SELECT fibonacci(bigint_column)
FROM large_table@foreign_server;

If not specified, the default is 1h.

QueryTimeout

The maximum amount of time spent waiting for the remote query to finish after data transfer is finished. It is recommended to increase this value for INSERT INTO queries which insert the data into slow consuming connectors. Timeouts must have both a number and a unit: for example, 10s, 5m, or 2h.

INSERT INTO "slow_consuming_connector"."slow_consuming_table"
SELECT * FROM large_table;

If not specified, the default is 1h.

Data Exchange Configuration

The following properties control the data exchange. They can be set for both Presto and Teradata:

  • data-exchange.server.buffer-size - query results write buffer size. The buffer is shared between all the queries that are in progress. If not specified, the default is 2GB.
  • data-exchange.server.port - data transfer server port. If not specified, the default is 21337.
  • data-exchange.client.buffer-size - query results read buffer size. The buffer is shared between all the queries that are in progress. If not specified, the default is 2GB.

Note

Buffer sizes must be specified in human readable data size notation: 100MB, 1GB, 1TB

The data-exchange.server.buffer-size and data-exchange.client.buffer-size properties may need to be changed if latency in the system causes the buffers to fill up. This will cause increased memory usage by T2P.

The data-exchange.server-port property must be changed if the port 21337 is not free. If the port on Presto is changed, TRANSPORT_COORDINATOR_PORT in the foreign server definition, as described in Optional Name Value Pairs, must also be changed to match.

To change a property for Presto, edit the T2P connector configuration file (/etc/opt/prestoadmin/connectors/querygrid.properties). To apply your changes, distribute new properties file to all Presto nodes and restart the Presto server:

sudo ./presto-admin connector add querygrid
sudo ./presto-admin server restart

For Teradata, the data exchange properties are shared across all T2P foreign servers. In order to change a static property for Teradata, you must set a JVM system property in the following format: -Dt2p.property.name=value. Properties for the Teradata JVM can be set using the cufconfig utility. To apply your changes, you must restart the Teradata database using the tpareset command.

Kerberos Support

It is possible to create a foreign server that connects to Presto via Kerberos. Before doing so, you must set up the Presto server to authenticate via Kerberos (see the Presto docs: http://teradata.github.io/presto/docs/current/security/server.html). A few notes about how to set up Kerberos on the Presto side:

  • The Presto coordinator principal (specified by the property http.server.authentication.krb5.service-name) must be HTTP.

  • The SSL keystore (specified by the property http-server.https.keystore.path) must contain as a common name the value below <presto_master_ip>. <presto_master_ip> can be a hostname, but that hostname must be in the /etc/hosts file of all of the Teradata nodes. When specifying a common name (CN) that is an IP address, it is necessary to also specify the subjectAltName extension. For example:

    keytool -genkeypair \
        -alias presto \
        -keyalg RSA \
        -keystore /etc/presto/keystore.jks \
        -keypass password \
        -storepass password \
        -dname "CN=<ip address>, OU=, O=, L=, S=, C=" -ext san=ip:<ip address>
    
  • It is possible to add multiple entries to one keystore by running the above command with different aliases, common names, and IPs. You should add all of the IP addresses by which it is possible to connect to Presto from Teradata (including bynet addresses).

To set up Kerberos on the Teradata side:

  1. Install version 1.4.3 or later of the krb5 and krb5-client packages on all of the nodes in the Teradata cluster. The krb5 packages are included on the standard Teradata Database Linux operating system DVD.

  2. Copy the krb5.conf file from your Presto cluster to /etc/krb5.conf on all of the nodes in the Teradata cluster.

  3. Add the hostname for the machine running the KDC to the /etc/hosts file on all of the nodes in the Teradata cluster.

  4. Install the appropriate Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files for your Java 8 JRE. Download them from here: http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html, and follow the README to install.

  5. Create an authorization object. This authorization object will store credentials for the remote Kerberos system to be used during execution of queries by the foreign server. For more information on authorization objects, see the Teradata database documentation. For example:

    CREATE AUTHORIZATION remote_user1 AS INVOKER TRUSTED
    USER '<kerberos_principal>' PASSWORD '<kerberos_password>';
    
  6. Create the foreign server:

    -- Execute following commands as presto_user
    CREATE FOREIGN SERVER ${FOREIGN_SERVER_NAME}
    EXTERNAL SECURITY INVOKER TRUSTED ${AUTH_OBJECT}
    USING
      HOSTTYPE('NVP')
      TemporaryTableCatalog('querygrid')
      DB_HOST('${PRESTO_HOST}')
    
      DB_PORT('${PRESTO_HTTPS_PORT}')
      security('kerberos')
      SSLCertificate('${CERT_PATH}')
    DO IMPORT WITH syslib.teradata_to_presto_import,
    DO EXPORT WITH syslib.teradata_to_presto_export;
    

${PRESTO_HTTPS_PORT} should be set to the port specified by the Presto property http-server.https.port., and you must copy the SSL certificate keystore specified by the Presto property http-server.https.keystore.path to ${CERT_PATH} on all of the Teradata nodes. Make sure that the file is readable by the user tdatuser; other than that, keep the permissions on this file as restrictive as possible. As stated above, ${PRESTO_HOST}, which can be either an IP address or a hostname, needs to be added to the keystore on Presto. ${AUTH_OBJECT} is the authorization object created above. The user presto_user must have access to that object.

Additional name value pairs as described in Optional Name Value Pairs or Performance Tuning Name Value Pairs may also be added to a server configured with Kerberos.

Technical Specification

Type Support

The following table refers to type handling/conversion between the Presto query engine and Teradata. Type handling can differ between Presto connectors or even between storage types within a connector. Please refer to the Presto documentation for specific connectors for more details.

Type in Presto Type in Teradata Notes
BOOLEAN BYTEINT  
BIGINT BIGINT  
DOUBLE REAL  
DECIMAL(X,Y) DECIMAL(X,Y) Max precision in both Presto and Teradata is 38; scale <= precision
VARCHAR VARCHAR Values exceeding MaximumTextLength (by default 4096 characters) will be truncated
DATE DATE  
TIMESTAMP TIMESTAMP  
VARBINARY VARBYTE Error thrown on values exceeding MaximumBinaryLength (by default 4096 bytes)
JSON VARCHAR  
ARRAY VARCHAR Serialized to/from json
MAP VARCHAR Serialized to/from json
ROW VARCHAR Serialized to json in import path. Writing ROW values to Presto is not supported

There is limited support for CHAR. CHAR columns can be exported from Teradata to Presto VARCHAR columns, since Presto does not have the CHAR type. Presto VARCHAR columns can be inserted into Teradata CHAR columns.

Note

If using the Hive connector for Presto, the underlying Hive datatype when inserting into a VARCHAR column must be either VARCHAR or STRING, because for Hive, Presto does not support inserting into VARCHAR(X) or CHAR(X) columns.

Known Issues/Limitations

  • Complex Presto types – ARRAY, MAP, and ROW – are translated to JSON and are visible in Teradata as strings. Large, complex objects may not be able to be selected, because their JSON string representation may exceed the Teradata maximum row length of 64KB.

  • Native Presto queries that have aggregations without explicitly specified aliases fail with the error, “Presto query failed: Column name not specified at position 1”.

    SELECT *
    FROM FOREIGN TABLE(
       SELECT count(*)
       FROM nation
    )@presto_server AS presto_query
    

    Provide an alias explicitly to make it work:

    SELECT *
    FROM FOREIGN TABLE(
       SELECT count(*) as some_explicit_alias
       FROM nation
    )@presto_server AS presto_query
    
  • Currently, T2P is limited to 20 concurrent queries. However, if your queries are long running or take a lot of resources (e.g. if they have a big result set), it is recommended to run fewer than 20 queries. Otherwise, you may get a Teradata error:

    [Error 7583] [SQLState HY000] The secure mode processes had a set up error.
    
  • High presto_writer_count can cause high CPU utilization. If the Presto starts to fail with the timeout related issue try to decrease presto_writer_count.

  • If you get the error, “Another instance of the Querygrid driver is still running.”, please restart your Teradata database with the tpareset command.