Query data stored in object storage

Author: Adam Tworkiewicz
Last updated: September 7th, 2021

Overview

Native Object Storage (NOS) is a Vantage feature that allows you to query data stored in files in object storage such as AWS S3, Google GCS, Azure Blob or on-prem implementations. It’s useful in scenarios where you want to explore data without building a data pipeline to bring it into Vantage.

Prerequisites

You need access to a Teradata Vantage instance. NOS is enabled in all Vantage editions from Vantage Express through Developer, DYI to Vantage as a Service starting from version 17.10.

If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com.

Explore data with NOS

Currently, NOS supports CSV, JSON (as array or new-line delimited), and Parquet data formats.

Let’s say you have a dataset stored as CSV files in an S3 bucket. You want to explore the dataset before you decide if you want to bring it into Vantage. For this scenario, we are going to use a public dataset published by Teradata that contains river flow data collected by the U.S. Geological Survey. The bucket is at https://td-usgs-public.s3.amazonaws.com/.

Let’s first have a look at sample CSV data. We take the first 10 rows that Vantage will fetch from the bucket:

SELECT
  TOP 10 *
FROM (
	LOCATION='/s3/td-usgs-public.s3.amazonaws.com/CSVDATA/'
) AS d;

Here is what I’ve got:

GageHeight2 Flow   site_no datetime         Precipitation GageHeight
----------- ----- -------- ---------------- ------------- -----------
10.9        15300 09380000 2018-06-28 00:30 671           9.80
10.8        14500 09380000 2018-06-28 01:00 673           9.64
10.7        14100 09380000 2018-06-28 01:15 672           9.56
11.0        16200 09380000 2018-06-27 00:00 669           9.97
10.9        15700 09380000 2018-06-27 00:30 668           9.88
10.8        15400 09380000 2018-06-27 00:45 672           9.82
10.8        15100 09380000 2018-06-27 01:00 672           9.77
10.8        14700 09380000 2018-06-27 01:15 672           9.68
10.9        16000 09380000 2018-06-27 00:15 668           9.93
10.8        14900 09380000 2018-06-28 00:45 672           9.72

We have got plenty of numbers, but what do they mean? To answer this question, we will ask Vantage to detect the schema of the CSV files:

SELECT
  *
FROM (
	LOCATION='/s3/td-usgs-public.s3.amazonaws.com/CSVDATA/'
	RETURNTYPE='NOSREAD_SCHEMA'
) AS d;

Vantage will now fetch a data sample to analyze the schema and return results:

Name            Datatype                            FileType  Location
--------------- ----------------------------------- --------- -------------------------------------------------------------------
GageHeight2     decimal(3,2)                        csv       /S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09513780/2018/06/27.csv
Flow            decimal(3,2)                        csv       /S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09513780/2018/06/27.csv
site_no         int                                 csv       /S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09513780/2018/06/27.csv
datetime        TIMESTAMP(0) FORMAT'Y4-MM-DDBHH:MI' csv       /S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09513780/2018/06/27.csv
Precipitation   decimal(3,2)                        csv       /S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09513780/2018/06/27.csv
GageHeight      decimal(3,2)                        csv       /S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09513780/2018/06/27.csv

We see that the CSV files have 6 columns. For each column, we get the name, the datatype and the file coordinates that were used to infer the schema.

Query data with NOS

Now that we know the schema, we can work with the dataset as if it was a regular SQL table. To prove the point, let’s try to do some data aggregation. Let’s get an average temperature per site for sites that collect temperatures.

SELECT
  site_no Site_no, AVG(Flow) Avg_Flow
FROM (
  LOCATION='/s3/td-usgs-public.s3.amazonaws.com/CSVDATA/'
) AS d
GROUP BY
  site_no
HAVING
  Avg_Flow IS NOT NULL;

Result:

Site_no  Avg_Flow
-------- ---------
09380000 11
09423560 73
09424900 93
09429070 81

To register your ad hoc exploratory activity as a permanent source, create it as a foreign table:

-- If you are running this sample as dbc user you will not have permissions
-- to create a table in dbc database. Instead, create a new database and use
-- the newly create database to create a foreign table.

CREATE DATABASE Riverflow
  AS PERMANENT = 60e6, -- 60MB
  SPOOL = 120e6; -- 120MB

-- change current database to Riverflow
DATABASE Riverflow;

CREATE FOREIGN TABLE riverflow
  USING ( LOCATION('/s3/td-usgs-public.s3.amazonaws.com/CSVDATA/') );

SELECT top 10 * FROM riverflow;

Result:

Location                                                            GageHeight2 Flow site_no datetime            Precipitation GageHeight
------------------------------------------------------------------- ----------- ---- ------- ------------------- ------------- ----------
/S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09429070/2018/07/02.csv null        null 9429070 2018-07-02 14:40:00 1.21          null
/S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09400815/2018/07/10.csv null        0.00 9400815 2018-07-10 00:30:00 0.00          -0.01
/S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09400815/2018/07/10.csv null        0.00 9400815 2018-07-10 00:45:00 0.00          -0.01
/S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09400815/2018/07/10.csv null        0.00 9400815 2018-07-10 01:00:00 0.00          -0.01
/S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09400815/2018/07/10.csv null        0.00 9400815 2018-07-10 00:15:00 0.00          -0.01
/S3/s3.amazonaws.com/td-usgs-public/CSVDATA/09429070/2018/07/02.csv null        null 9429070 2018-07-02 14:38:00 1.06          null

This time, the SELECT statement looks like a regular select against an in-database table. If you require subsecond response time when querying the data, there is an easy way to bring the CSV data into Vantage to speed things up. Read on to find out how.

Load data from NOS into Vantage

Querying object storage takes time. What if you decided that the data looks interesting and you want to do some more analysis with a solution that will you quicker answers? The good news is that data returned with NOS can be used as a source for CREATE TABLE statements. Assuming you have CREATE TABLE privilege, you will be able to run:

This query assumes you created database Riverflow and a foreign table called riverflow in the previous step.

-- This query assumes you created database `Riverflow`
-- and a foreign table called `riverflow` in the previous step.

CREATE MULTISET TABLE riverflow_native (site_no, Flow, GageHeight, datetime)
AS (
  SELECT site_no, Flow, GageHeight, datetime FROM riverflow
) WITH DATA
NO PRIMARY INDEX;

SELECT TOP 10 * FROM riverflow_native;

Result:

site_no   Flow  GageHeight  datetime
-------  -----  ----------  -------------------
9400815    .00        -.01  2018-07-10 00:30:00
9400815    .00        -.01  2018-07-10 01:00:00
9400815    .00        -.01  2018-07-10 01:15:00
9400815    .00        -.01  2018-07-10 01:30:00
9400815    .00        -.01  2018-07-10 02:00:00
9400815    .00        -.01  2018-07-10 02:15:00
9400815    .00        -.01  2018-07-10 01:45:00
9400815    .00        -.01  2018-07-10 00:45:00
9400815    .00        -.01  2018-07-10 00:15:00
9400815    .00        -.01  2018-07-10 00:00:00

This time, the SELECT query returned in less than a second. Vantage didn’t have to fetch the data from NOS. Instead, it answered using data that was already on its nodes.

Access private buckets

So far, we have used a public bucket. What if you have a private bucket? How do you tell Vantage what credentials it should use?

It is possible to inline your credentials directly into your query:

SELECT
  TOP 10 *
FROM (
  LOCATION='/s3/td-usgs-public.s3.amazonaws.com/CSVDATA/'
  AUTHORIZATION='{"ACCESS_ID":"","ACCESS_KEY":""}'
) AS d;

Entering these credentials all the time can be tedious and less secure. In Vantage, you can create an authorization object that will serve as a container for your credentials:

CREATE AUTHORIZATION aws_authorization
  USER 'YOUR-ACCESS-KEY-ID'
  PASSWORD 'YOUR-SECRET-ACCESS-KEY';

You can then reference your authorization object when you create a foreign table:

CREATE FOREIGN TABLE riverflow
, EXTERNAL SECURITY aws_authorization
USING ( LOCATION('/s3/td-usgs-public.s3.amazonaws.com/CSVDATA/') );

Export data from Vantage to object storage

So far, we have talked about reading and importing data from object storage. Wouldn’t it be nice if we had a way to use SQL to export data from Vantage to object storage? This is exactly what WRITE_NOS function is for. Let’s say we want to export data from riverflow_native table to object storage. You can do so with the following query:

SELECT * FROM WRITE_NOS (
  ON ( SELECT * FROM riverflow_native )
  PARTITION BY site_no ORDER BY site_no
  USING
    LOCATION('YOUR-OBJECT-STORE-URI')
    AUTHORIZATION(aws_authorization)
    STOREDAS('PARQUET')
    COMPRESSION('SNAPPY')
    NAMING('RANGE')
    INCLUDE_ORDERING('TRUE')
) AS d;

Here, we instruct Vantage to take data from riverflow_native and save it in YOUR-OBJECT-STORE-URI bucket using parquet format. The data will be split into files by site_no attribute. The files will be compressed.

Summary

In this quick start we have learned how to read data from object storage using Native Object Storage (NOS) functionality in Vantage. NOS supports reading and importing data stored in CSV, JSON and Parquet formats. NOS can also export data from Vantage to object storage.

Query data stored in object storage

Overview

Prerequisites

Explore data with NOS

Query data with NOS

Load data from NOS into Vantage

Access private buckets

Export data from Vantage to object storage

Summary

Further reading