Bobcares

AWS DMS: Migrate data to Amazon S3 – How to do

PDF Header PDF Footer

Wondering how to use AWS DMS: Migrate data to Amazon S3? We can help you!

Here at Bobcares, we often handle requests to migrate data using AWS DMS for our customers using AWS as a part of our Server Management Services.

Today let’s see how our Support Engineers do this for our customers.

Using AWS DMS: Migrate data to Amazon S3

Here we will be migrating data in Apache Parquet (.parquet) format to Amazon Simple Storage Service.

We can migrate data to an S3 bucket in Apache Parquet format if we use replication 3.1.3 or a more recent version. The default Parquet version is Parquet 1.0.

Following are the steps that our Support Engineers use for the migration:

1. First we have to create a target Amazon SE endpoint from the AWS DMS Console.

2. Then add an extra connection attribute (ECA) using the following:

dataFormat=parquet;

We must also, check the other extra connection attributes that we can use for storing parquet objects in an S3 target.

Or, create a target Amazon S3 endpoint using the following create-endpoint command in the AWS Command Line Interface (AWS CLI):

aws dms create-endpoint --endpoint-identifier s3-target-parque --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": <IAM role ARN for S3 endpoint>, "BucketName": <S3 bucket name to migrate to>, "DataFormat": "parquet"}'

3. After that we can use the following extra connection attribute to specify the Parquet version of output file:

parquetVersion=PARQUET_2_0;

4. And run the describe-endpoints command to see if the S3 endpoint that we created has the S3 setting DataFormat or the extra connection attribute dataFormat set to “parquet”.

To check the S3 setting DataFormat, we can use the following command:

aws dms describe-endpoints --filters Name=endpoint-arn,Values=<S3 target endpoint ARN> --query "Endpoints[].S3Settings.DataFormat"
[
    "parquet"
]

5. If the value of the DataFormat parameter is CSV, then we must recreate the endpoint.

6. After we get the output in Parquet format, we can parse the output file by installing the Apache Parquet command line tool:

pip install parquet-cli --user

7. Then, inspect the file format:

parq LOAD00000001.parquet 
 # Metadata 
 <pyarrow._parquet.FileMetaData object at 0x10e948aa0>
  created_by: AWS
  num_columns: 2
  num_rows: 2
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 169

8.  Finally, we can print the file content:

parq LOAD00000001.parquet --head
   i        c
0  1  insert1
1  2  insert2

[Need assistance? We can help you]

Conclusion

To conclude, we saw the steps that our Support Techs follow to migrate data to Amazon S3.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

var google_conversion_label = "owonCMyG5nEQ0aD71QM";
0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

server management

Spend time on your business, not on your servers.

TALK TO US

Or click here to learn more.

Speed issues driving customers away?
We’ve got your back!

Privacy Preference Center

Necessary

Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

PHPSESSID - Preserves user session state across page requests.

gdpr[consent_types] - Used to store user consents.

gdpr[allowed_cookies] - Used to store user allowed cookies.

PHPSESSID, gdpr[consent_types], gdpr[allowed_cookies]
PHPSESSID
WHMCSpKDlPzh2chML

Statistics

Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.

_ga - Preserves user session state across page requests.

_gat - Used by Google Analytics to throttle request rate

_gid - Registers a unique ID that is used to generate statistical data on how you use the website.

smartlookCookie - Used to collect user device and location information of the site visitors to improve the websites User Experience.

_ga, _gat, _gid
_ga, _gat, _gid
smartlookCookie
_clck, _clsk, CLID, ANONCHK, MR, MUID, SM

Marketing

Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.

IDE - Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.

test_cookie - Used to check if the user's browser supports cookies.

1P_JAR - Google cookie. These cookies are used to collect website statistics and track conversion rates.

NID - Registers a unique ID that identifies a returning user's device. The ID is used for serving ads that are most relevant to the user.

DV - Google ad personalisation

_reb2bgeo - The visitor's geographical location

_reb2bloaded - Whether or not the script loaded for the visitor

_reb2bref - The referring URL for the visit

_reb2bsessionID - The visitor's RB2B session ID

_reb2buid - The visitor's RB2B user ID

IDE, test_cookie, 1P_JAR, NID, DV, NID
IDE, test_cookie
1P_JAR, NID, DV
NID
hblid
_reb2bgeo, _reb2bloaded, _reb2bref, _reb2bsessionID, _reb2buid

Security

These are essential site cookies, used by the google reCAPTCHA. These cookies use an unique identifier to verify if a visitor is human or a bot.

SID, APISID, HSID, NID, PREF
SID, APISID, HSID, NID, PREF