Use Airbyte to load data from external sources to Teradata Vantage
Author: Krutik Pathak
Last updated: June 9th, 2023
Overview
This tutorial showcases how to use Airbyte (an open-source Extract Load Transform tool) with Teradata Vantage. We work with a very simple end-to-end setup to load data from Google Sheets to Teradata Vantage using Airbyte.
-
Source: Google Sheets
-
Destination: Teradata Vantage
Prerequisites
-
Access to a Teradata Vantage Instance. This will be defined as the destination of the Airbyte connection. You will need a database
Host
,Username
, andPassword
for Airbyte’s configuration.If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com. -
Docker Compose to run Airbyte Open Source locally. Docker Compose comes with Docker Desktop. Please refer to docker docs for more details.
-
Data from the source system. In this case, we use a sample spreadsheet from google sheets. The sample data is a breakdown of payrate by employee type.
-
Google Cloud Platform API enabled for your personal or organizational account. You’ll need to authenticate your Google account via OAuth or via Service Account Key Authenticator. In this example, we use Service Account Key Authenticator.
Launch Airbyte Open Source
-
Clone the Airbyte Open Source repository and go to the airbyte directory.
git clone https://github.com/airbytehq/airbyte.git cd airbyte
Make Sure to have Docker Desktop running before running the shell script run-ab-platform
.
-
Run the shell script
run-ab-platform
as./run-ab-platform.sh
You can run the above commands with
git bash
in Windows. Please refer to the Airbyte Local Deployment for more details. -
Log in to the web app http://localhost:8000/ by entering the default credentials found in the
.env
file included in the repository.BASIC_AUTH_USERNAME=airbyte BASIC_AUTH_PASSWORD=password
When logging in for the first time, Airbyte will prompt you to provide your email address and specify your preferences for product improvements. Enter your preferences and click on "Get started."
Once Airbyte Open Source is launched you will see a connections dashboard. If you launched Airbyte Open Source for the first time, it would not show any connections.
Airbyte Configuration
Setting the Source Connection
-
You can either click "Create your first connection" or click on the top right corner to initiate the new connection workflow on Airbyte’s Connections dashboard.
-
Airbyte will ask you for the Source, you can select from an existing source (if you have set it up already) or you can set up a new source, in this case we select
Google Sheets
. -
For authentication we are using
Service Account Key Authentication
which uses a service account key in JSON format. Toggle from the defaultOAuth
toService Account Key Authentication
. To authenticate your Google account via Service Account Key Authentication, enter your Google Cloud service account key in JSON format.
Make sure the Service Account has the Project Viewer permission. If your spreadsheet is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your spreadsheet. -
Add the link to the source spreadsheet as
Spreadsheet Link
.
For more details, please refer Setting Google Sheets as Source Connector in Airbyte Open Source |
-
Click Set up source, if the configuration is correct, you will get the message
All connection tests passed!
Setting the Destination Connection
-
Assuming you want to create a fresh new connection with
Teradata Vantage
, SelectTeradata Vantage
as the destination type under the "Set up the destination" section. -
Add the
Host
,User
, andPassword
. These are the same as theHost
,Username
, andPassword
respectively, used by your Clearscape Analytics Environment. -
Provide a default schema name appropriate to your specific context. Here we have provided
gsheet_airbyte_td
.
If you do not provide a |
-
Click Set up destination, if the configuration is correct, you will get the message
All connection tests passed!
You might get a configuration check failed error. Make sure your Teradata Vantage instance is running properly before making a connection through Airbyte. |
Configuring Data Sync
A namespace is a group of streams (tables) in a source or destination. A schema in a relational database system is an example of a namespace. In a source, the namespace is the location from where the data is replicated to the destination. In a destination, the namespace is the location where the replicated data is stored in the destination. For more details please refer to Airbyte Namespace.
In our example the destination is a database, so the namespace is the default schema gsheet_airbyte_td
we defined when we configured the destination. The stream name is a table that is mirroring the name of the spreadsheet in the source, which is sample_employee_payrate
in this case. Since we are using the single spreadsheet connector, it only supports one stream (the active spreadsheet).
Other type of sources and destinations might have a different layout. In this example, Google sheets, as source, does not support a namespace.
In our example, we have used <destination schema>
as the Namespace of the destination, this is the default namespace assigned by Airbyte based on the Default Schema
we declared in the destination settings. The database gsheet_airbyte_td
will be created in our Teradata Vantage Instance.
We use the term "schema", as it is the term used by Airbyte. In a Teradata context the term "database" is the equivalent. |
Replication Frequency
It shows how often data should sync to destination. You can select every hour, 2 hours, 3 hours etc. In our case we used every 24 hours.
You can also use a Cron expression to specify the time when the sync should run. In the example below, we set the Cron expression to run the sync on every Wednesday at 12:43 PM (US/Pacific) time.
Data Sync Validation
Airbyte tracks synchronization attempts in the "Sync History" section of the Status
tab.
Next, you can go to the ClearScape Analytics Experience and run a Jupyter notebook, notebooks in ClearScape Analytics Experience are configured to run Teradata SQL queries, to verify if the database gsheet_airbyte_td
, streams (tables) and complete data is present.
%connect local
SELECT DatabaseName, TableName, CreateTimeStamp, LastAlterTimeStamp
FROM DBC.TablesV
WHERE DatabaseName = 'gsheet_airbyte_td'
ORDER BY TableName;
DATABASE gsheet_airbyte_td;
SELECT * FROM _airbyte_raw_sample_employee_payrate;
The stream (table) name in destination is prefixed with _airbyte_raw_
because Normalization and Transformation are not supported for this connection, and we only have the raw table. Each stream (table) contains 3 columns:
-
_airbyte_ab_id
: a uuid assigned by Airbyte to each event that is processed. The column type in Teradata isVARCHAR(256)
. -
_airbyte_emitted_at
: a timestamp representing when the event was pulled from the data source. The column type in Teradata isTIMESTAMP(6)
. -
_airbyte_data
: a json blob representing the event data. The column type in Teradata isJSON
.
Here in the _airbyte_data
column, we see 9 rows, the same as we have in the source Google sheet, and the data is in JSON format which can be transformed further as needed.
Close and delete the connection
-
You can close the connection in Airbyte by disabling the connection. This will stop the data sync process.
-
You can also delete the connection.
Summary
This tutorial demonstrated how to extract data from a source system like Google sheets and use the Airbyte ELT tool to load the data into the Teradata Vantage Instance. We saw the end-to-end data flow and complete configuration steps for running Airbyte Open Source locally, and configuring the source and destination connections. We also discussed about the available data sync configurations based on replication frequency. We validated the results in the destination using Cloudscape Analytics Experience and finally we saw the methods to pause and delete the Airbyte connection.