Amazon S3 is a popular cloud storage service that provides a scalable and secure storage option for large files. It is commonly used to store and distribute files, including data in the CSV (Comma Separated Values) format. In this tutorial, we will show you how to read CSV files from an S3 bucket using Python.
Prerequisites:
- An AWS account and an S3 bucket
- AWS CLI and boto3 library installed on your system
- Python 3 installed
Step 1: Set up the AWS CLI To use the AWS CLI, you need to configure it with your AWS credentials. To do this, run the following command:
aws configure
You will be prompted to enter your AWS access key ID, secret access key, default region name, and default output format.
Alternatively, if you prefer to work with config or yaml files then create a config class to read credentials from a yaml file and add the properties you need to the class
class Config:
def __init__(self, configyamlfile):
self.settings = dict()
with open(configyamlfile, "r") as stream:
try:
self.settings = yaml.safe_load(stream)
except yaml.YAMLError as exception:
print(exception)
@property
def User(self):
return self.settings["user"]
Step 2: Install the boto3 library To read files from an S3 bucket, we will use the boto3 library, which is an Amazon Web Services (AWS) SDK for Python. To install boto3, run the following command:
pip install boto3
Step 3: Read the CSV file from the S3 bucket. We will create a class CsvReader that reads CSV files from S3. We can add different modes of reading to this class such as reading an individual csv file, reading a zip containing multiple csv files, processing a local csv file that has been downloaded in the previous run and more. We will start by importing the necessary libraries and initializing a boto3 client and a boto3 session.
import boto3
self.client = boto3.client('s3',
aws_access_key_id=self.config.ACCESS_KEY,
aws_secret_access_key=self.config.SECRET_ACCESS_KEY,
region_name=self.config.REGION)
Next, we will list the files in a folder in the bucket using a prefix and process each of them one by one. For each file create a local path and copy the file locally
response = self.client.list_objects(Bucket=bucket_name, Prefix=prefix)
for content in response.get('Contents', []):
s3filepath = content.get('Key')
name = s3filepath.rsplit('/', 1)[-1]
self.client.download_file(bucket_name, s3filepath, name)
Finally, we can use pandas dataframe to perform data analysis or visualization. I have used some extra parameters for read_csv. I will discuss those in the next article. You can just pass the file name for a generic reading of csv with the pandas dataframe
df = pd.read_csv(file, on_bad_lines='warn', nrows=3)
What if you need to process zip files in s3 buckets. If the file is large (> 2 GB) fetch the file locally as for any other file and process the zip.
If the zip file size is reasonable then you can process the csv files in the zip without ever downloading them.
s3 = self.session.resource("s3")
bucket = s3.Bucket(BUCKET_NAME)
obj = bucket.Object(prefix)
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with ZipFile(tf, mode='r') as zipf:
for subfile in zipf.namelist():
name = subfile[:-4]
filepath = zipf.extract(subfile)
print(f"processing {subfile} at {filepath}")
self.process_csvfile(filepath, name)
In conclusion, reading CSV files from an S3 bucket in Python is a simple process that can be accomplished using the boto3 library. With just a few lines of code, you can retrieve and process data stored in an S3 bucket, making it a convenient and scalable option for data storage and distribution.