S3 to Redshift

Table of Contents

1 Introduction
2 Prerequisites
3 Load data from S3 to Redshift into Redshift example with AWS CLI
4 Conclusion

Introduction

This tutorial shows you how to create a Redshift cluster resource, connect to Amazon Redshift, load sample data from S3 to Redshift into Redshift, and run queries with data usage command line tools.

You can use SQL Workbench or Amazon Redshift Query Editor v2.0 (web-based analyst workbench). In this tutorial, we choose to load sample data from an Amazon S3 bucket to Amazon Redshift using the PLSQL command-line tool.

psql is a terminal-based front-end to PostgreSQL. It enables you to type in queries interactively, issue them to PostgreSQL, and see the query results. Alternatively, input can be from a file or command line arguments. In addition, psql provides several meta-commands and various shell-like features to facilitate writing scripts and automating a wide variety of tasks.

Prerequisites

Before starting, you should have the following prerequisites configured

An AWS account
AWS CLI on your computer

Load data from S3 to Redshift into Redshift example with AWS CLI

Install PSQL on MacOS
Creating a data warehouse with Amazon Redshift using AWS CLI
Connect to the Redshift cluster using PSQL
Create Redshift cluster tables using PSQL
Redshift default role setting uses AWS Console manage
Loading sample data from S3 to Redshift with PSQL
Delete the sample cluster using AWS CLI

Install PSQL on MacOS

We can choose a version from PostgresSQL page or execute the following command on MacOS

brew install postgresql

Creating a data warehouse with Amazon Redshift using AWS CLI

Before you begin, If you have not installed the AWS CLI, see Setting up the Amazon Redshift CLI. This tutorial uses the us-east-1 region.

Creating a data warehouse with Amazon Redshift use AWS CLI

Now we’re ready to launch a cluster by using the AWS CLI.

The create-cluster the command has a large number of parameters. For this tutorial, you will use the parameter values that are described in the following table. Before you create a cluster in a production environment, we recommend that you review all the required and optional parameters so that your cluster configuration matches your requirements. For more information, see create-cluster

Parameter name	Parameter value for this exercise
cluster-identifier	examplecluster
master-username	awsuser
master-user-password	Awsuser123
node-type	dc2.large
cluster-type	single-node

Run the following command to create a cluster.

aws redshift create-cluster --cluster-identifier examplecluster --master-username awsuser --master-user-password Awsuser123 --node-type dc2.large --cluster-type single-node

This command returns the following result.

The cluster creation process will take several minutes to complete. To check the status, enter the following command.

aws redshift describe-clusters --cluster-identifier examplecluster | grep ClusterStatus

When the ClusterStatus field changes from creating to available, the cluster is ready for use.

Connect to the Redshift cluster using PSQL

Run the following command to connect to the Redshift cluster.

psql -h examplecluster.ccfmryooawwy.us-east-1.redshift.amazonaws.com -U awsuser -d dev -p 5439

You must explicitly grant inbound access to your client to connect to the cluster. When you created a cluster in the previous step, because you did not specify a security group, you associated the default cluster security group with the cluster.

The default cluster security group contains no rules to authorize any inbound traffic to the cluster. To access the new cluster, you must add rules for inbound traffic, which are called ingress rules, to the cluster security group. If you are accessing your cluster from the Internet, you will need to authorize a Classless Inter-Domain Routing IP (CIDR/IP) address range.

#get VpcSecurityGroupId
aws redshift describe-clusters --cluster-identifier examplecluster | grep VpcSecurityGroupId

Connect to the Redshift cluster use PSQL

Run the following command to enable your computer to connect to your Redshift cluster. Then login into your cluster using psql.

#allow connect to cluster from my computer
aws ec2 authorize-security-group-ingress --group-id sg-083f2ca0560111a3b --protocol tcp --port 5439 --cidr 111.111.111.111/32

This command returns the following result.

Now test the connection by querying the system table

Create Redshift cluster tables using PSQL

In this tutorial, I use sample data from AWS. Run the following command to create Redshift tables.

create table users(
userid integer not null distkey sortkey,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports boolean,
liketheatre boolean,
likeconcerts boolean,
likejazz boolean,
likeclassical boolean,
likeopera boolean,
likerock boolean,
likevegas boolean,
likebroadway boolean,
likemusicals boolean);                        

create table event(
eventid integer not null distkey,
venueid smallint not null,
catid smallint not null,
dateid smallint not null sortkey,
eventname varchar(200),
starttime timestamp);

create table sales(
salesid integer not null,
listid integer not null distkey,
sellerid integer not null,
buyerid integer not null,
eventid integer not null,
dateid smallint not null sortkey,
qtysold smallint not null,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp);

This command returns the following result.

Test by querying the public.sales table as follows

select * from public.sales;

Redshift default role setting uses AWS Console manage

Before you can load data from Amazon S3, you must first create an IAM role with the necessary permissions and attach it to your cluster. To do this refer to AWS document

Redshift default role setting use AWS Console manage

Loading sample data from S3 to Redshift with PSQL

Use the COPY command to load large datasets from Amazon S3 into Amazon Redshift. For more information about COPY syntax, see COPY in the Amazon Redshift Database Developer Guide.

Run the following SQL commands in PSQL to load data from S3 to Redshift

COPY users 
FROM 's3://redshift-downloads/tickit/allusers_pipe.txt' 
DELIMITER '|' 
TIMEFORMAT 'YYYY-MM-DD HH:MI:SS'
IGNOREHEADER 1 
REGION 'us-east-1'
IAM_ROLE default;                    
                    
COPY event
FROM 's3://redshift-downloads/tickit/allevents_pipe.txt' 
DELIMITER '|' 
TIMEFORMAT 'YYYY-MM-DD HH:MI:SS'
IGNOREHEADER 1 
REGION 'us-east-1'
IAM_ROLE default;

COPY sales
FROM 's3://redshift-downloads/tickit/sales_tab.txt' 
DELIMITER '\t' 
TIMEFORMAT 'MM/DD/YYYY HH:MI:SS'
IGNOREHEADER 1 
REGION 'us-east-1'
IAM_ROLE default;

Loading sample data from S3 to Redshift with PSQL

After loading data, try some example queries.

\timing

SELECT firstname, lastname, total_quantity 
FROM   (SELECT buyerid, sum(qtysold) total_quantity
        FROM  sales
        GROUP BY buyerid
        ORDER BY total_quantity desc limit 10) Q, users
WHERE Q.buyerid = userid
ORDER BY Q.total_quantity desc;

Now that you’ve loaded data into Redshift.

Delete the sample cluster using AWS CLI

When you delete a cluster, you must decide whether to create a final snapshot. Because this is an exercise and your test cluster should not have any important data in it, you can skip the final snapshot.

To delete your cluster, enter the following command.

aws redshift delete-cluster –cluster-identifier examplecluster –skip-final-cluster-snapshot

Congratulations! You successfully launched, authorized access to, connected to, and terminated a cluster.

Conclusion

These steps provide an example of loading data from S3 to Redshift into Redshift with the PSQL tool. The specific configuration details may vary depending on your environment and setup. It’s recommended to consult the relevant documentation from AWS for detailed instructions on setting up. I hope will this be helpful. Thank you for reading the DevopsRoles page!

Refer

https://docs.aws.amazon.com/redshift/latest/mgmt/getting-started-cli.html#getting-started-create-sample-db-cli

https://docs.aws.amazon.com/redshift/latest/gsg/new-user-serverless.html

DevopsRoles.com

Devops Tutorial

Introduction

Prerequisites

Load data from S3 to Redshift into Redshift example with AWS CLI

Install PSQL on MacOS

Creating a data warehouse with Amazon Redshift using AWS CLI

Connect to the Redshift cluster using PSQL

Create Redshift cluster tables using PSQL

Redshift default role setting uses AWS Console manage

Loading sample data from S3 to Redshift with PSQL

Delete the sample cluster using AWS CLI

Conclusion

About Dang Nhu Hieu

Leave a Reply Cancel reply

Introduction

Prerequisites

Load data from S3 to Redshift into Redshift example with AWS CLI

Install PSQL on MacOS

Creating a data warehouse with Amazon Redshift using AWS CLI

Connect to the Redshift cluster using PSQL

Create Redshift cluster tables using PSQL

Redshift default role setting uses AWS Console manage

Loading sample data from S3 to Redshift with PSQL

Delete the sample cluster using AWS CLI

Conclusion

Related Posts

Leave a Reply Cancel reply