- Prerequisites
 - Overview
 - Architecture
 - System Setup
 - DAG Explanation and Scheduling
 - System Benefits
 - Conclusion
 
- System OS: Ubuntu 22.04 or any Linux distro that can run Docker Desktop without much issues
 - Python Version: 3.9
 - Reddit App API Credentials
 - AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift
 - Dedicated Docker Hardware Specs:
- RAM: >=5GB
 - Swapfile: >=2GB
 - CPU: >=4 cores
 
 
This system consists of multiple pipelines each with its own functions:
- Extract data from Reddit using its API.
 - Store Reddit data transformed (timestamped) by Airflow to the output folder.
 - Store the output folder data into an S3 bucket from Airflow.
 - Transform the data using AWS Glue (locally using AWS Glue Docker container or on the cloud using AWS Glue job).
 - Catalog the transformed data using AWS Glue crawler.
 - Load the transformed data into Amazon Redshift, Amazon Athena, Power BI, Amazon QuickSight, Tableau, and Looker Studio for analytics and querying.
 
- Deleting Airflow DAG run logs that are older than 30 days on the Airflow S3 logs bucket.
 
- Delete Reddit data files in the output folder and the shared_folder that are older than 30 days.
 - Copy the newer Reddit data files from the output folder to the shared_folder (we assume that it's a shared folder between multiple computers).
 
- Reddit API: Source of the data
 - Apache Airflow & Celery: Orchestrates the ETL process and manages task distribution
 - PostgreSQL: Temporary storage and metadata management
 - Amazon S3: Raw data storage and Airflow logs storage
 - AWS Glue: Data cataloging and ETL jobs
 - Amazon Athena: SQL-based data transformation and analytics
 - Amazon Redshift: Data warehousing and analytics
 
- 
After creating the IAM user with any username you want, add an access key to it and then export and save it with you as we will need it later:

 - 
Setup the Reddit data S3 bucket with any name you want (remember the name as we will need it later) with the required folders inside it:

 - 
Setup the Airflow logs S3 bucket with any name you want (remember the name as we will need it later) with the required folders inside it:

 - 
Clone this repo. then execute the setup_folders.bash
 - 
Inside the
remote_glue_reddit_job.py, write the value of theaws_s3_raw_file_namevariable to any filename you want. Same withaws_s3_transformed_file_namevariable (remember their names as we will need it later). - 
Setup the AWS Glue ETL job using the
remote_glue_reddit_job.pyand remember the name of the job as we will need it later and do not run the job after saving it:

 - 
Setup the AWS Glue crawler and point its data source location to the transformed folder and set up its database on to which it will dump the output data catalog and remember the name of the crawler as we will need it later and do not run the crawler:
 - 
If you haven't installed AWS CLI, install it in your home folder using this guide "AWS CLI Installation Guide" then configure AWS CLI using the
aws configurecommand. Enter the access key ID of the IAM user you created previously, then the secret key, region, and finally the output type (make it JSON). - 
In the
airflow.cfgfile, write the path of the Airflow log folder in theremote_base_log_foldervariables in the logging and core sections but keep all the lines commented out. - 
In the
.envfile, write the path to your.awsfolder in theAWS_CREDINTIALS_PATH, write your AWS profile name inAWS_PROFILE, and finally write the path to your Docker socket (usually/var/run/docker.sock). - 
In the
config.conffile, write the filename of the Reddit CSV file inside the output folderreddit_output_file_name(will be prefixed by a timestamp in the Reddit DAG). Write your Reddit API keys, AWS access key ID, secret key, and region. Your AWS Reddit data bucket name inaws_bucket_name, Glue job name, Glue crawler name, the name of the raw file in therawS3 folder which will then be preprocessed by the Glue job (must matchaws_s3_raw_file_nameinremote_glue_reddit_job.py), the name of the transformed file in thetransformedS3 folder (must matchaws_s3_transformed_file_nameinremote_glue_reddit_job.py), Airflow log bucket, Airflow log (key/folder location) inside the log bucket, and finally whether to use local Glue script or not by settinguse_local_glue_transform_scripteither to true or false. - 
(Optional) Setup a virtual Python 3.9 environment using
venvwith the name.venvinside the project folder, then install the dependencies inside therequirements-dev.txtusingpipin the virtual environment. - 
Start the Docker containers with
docker compose up -d --build:
 - 
If the Airflow web server didn't start up correctly the first time (as it usually does), then turn off the entire container stack from the Docker GUI and start again from the start button:
Or just build again with docker compose up -d --build. - 
When you log in to the Airflow server using the username and password "admin", you will find that all the DAGs are paused by default (that is because of setting
is_paused_upon_creationto true on all DAGs in thedagsfolder and on the first successful launch of the Airflow server the DAGs get created. Therefore, they are paused) which is good because we need to set up one last thing on the server. - 
Click on the admin button, you will see a dropdown menu. Select connections and set up a new one using the plus button. The connection type is AWS (Amazon Web Service) and write the ID of the connection (must match the
remote_log_conn_idinairflow.cfgfile), then write out the access key ID and secret key of the user we created on Amazon IAM. Test the connection; if you see a green bar appear on the top of the page, then the connection is good to go and you can save it. (This connection that we just set up will help us ship the Airflow DAG run logs to our Airflow logs S3 bucket, achieving the concept of log shipping without writing out a single line of code). - 
Turn off the Docker container stack again from the GUI, then uncomment all the lines in
airflow.cfgand lastly turn on the container stack again. - 
Congratulations, the project is now set up and ready to go. First, enable the
etl_reddit_pipelineusing the toggle right next to it, then watch its progress by right-clicking on it. All the tasks should complete successfully if everything is set up correctly. Then after the DAGs successfully finish, you can then enable the remaining DAGsmanage_output_files_pipelineandmanage_s3_logs_pipeline:

 
- Description: This DAG is responsible for extracting data from Reddit using its API, transforming the data, and storing it in an S3 bucket. The data is then further processed using AWS Glue and cataloged using AWS Glue crawler. Finally, the transformed data is loaded into Amazon Redshift, Amazon Athena, Power BI, Amazon QuickSight, Tableau, and Looker Studio for analytics and querying.
 - Schedule: Runs daily at midnight UTC.
 - Tasks:
- Extract Data: Extracts data from Reddit using its API.
 - Transform Data: Transforms the extracted data and stores it in the output folder.
 - Upload to S3: Uploads the transformed data to an S3 bucket.
 - Run Glue Job: Runs the AWS Glue job to further process the data.
 - Run Glue Crawler: Runs the AWS Glue crawler to catalog the transformed data.
 - Load to Redshift/Athena: Loads the transformed data into Amazon Redshift and Amazon Athena for analytics and querying.
 
 
- Description: This DAG is responsible for managing the output files generated by the 
etl_reddit_pipeline. It deletes Reddit data files in the output folder and the shared_folder that are older than 30 days and copies the newer Reddit data files from the output folder to the shared_folder. - Schedule: Runs daily at 1:00 AM UTC.
 - Tasks:
- Delete Old Files: Deletes Reddit data files in the output folder and the shared_folder that are older than 30 days.
 - Copy New Files: Copies the newer Reddit data files from the output folder to the shared_folder.
 
 
- Description: This DAG is responsible for managing the Airflow DAG run logs stored in the S3 bucket. It deletes Airflow DAG run logs that are older than 30 days.
 - Schedule: Runs daily at 1:00 AM UTC.
 - Tasks:
- Delete Old Logs: Deletes Airflow DAG run logs that are older than 30 days from the S3 bucket.
 
 
- Fully automated Reddit ETL pipeline that not only automates the finetuning and transforming of data to the desired specs but also automates the launch of the Glue crawler to refresh our AWS Glue data catalog with the new data so that on a daily basis, AWS Athena, AWS Redshift, Tableau, Power BI, and more can get access to the latest data without having to do much of anything at all other than what the 
etl_reddit_pipelineDAG already does. - Fully automated Airflow logs management by default, it deletes DAG run logs that are older than 30 days (
manage_s3_logs_pipeline). - Fully automated management of the CSV files with timestamps to indicate when they entered the pipeline processing. This is useful for debugging; for instance, if the data on a specific date caused some issues while running through our pipeline, then we know which version of the data caused these issues to include it in our debugging sessions and hopefully be able to replicate these issues and solve them.
 - Automatically copies the CSV files from the output folder to the shared_folder (assuming it's a shared folder between multiple computers) and deletes files that are older than 30 days on both the output folder and the shared_folder.
 - Data versioning as we copy multiple versions of data from different dates from the output folder to the shared_folder.
 - The ability to run the Glue transformation locally given you have enough hardware to handle the amount of data to be processed as processing data on the cloud with AWS Glue can get quite expensive as data to be processed gets larger.
 
This project demonstrates a robust and fully automated data pipeline that integrates various technologies such as Reddit API, Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, Amazon Athena, and Amazon Redshift. The pipeline efficiently handles data extraction, transformation, and loading processes, ensuring that the latest data is always available for analytics and querying. The automated management of logs and output files further enhances the system's reliability and maintainability. Overall, this project showcases the power of combining cloud services and orchestration tools to build scalable and efficient data pipelines.





