Transport uses the NYC Open Data to predict the best method of transportation to get from point A to B based on user preferences. It not only considers factors like traffic, bus routes and subway maintance, but also environmental factors like air quality and carbon footprint. Goal of the project is to make
Transport is deployed in an Amazon EKS cluster. Each namespace in the cluster represents a step in the data process. Even though it is code agnostic, vast majority of the code is written in python, my preffered language for data engineering and science. Below is a quick list of each step and what is deployed in each namespace.
Data comes into the application in two forms:
1. Real-time data: subway trip data, traffic data.
2. Batch data: Taxi trip data, subway utilization, subway stops and routes.
For a list of all datasets and APIs used in this project, please refer to
the data section.
Once we have identified the data sources, the next step is to perform
ETL on this data.
To extract the data we make requests to each API endpoint, endpoints
used in this project can be found in their respective folders.
As an orchestration tool for batch data, we use [Apache Airflow](https://airflow.apache.org/). Airflow is a powerful ETL tool that works really well in Kubernetes deployments through its KubernetesOperator. Airflow uses a PostGres database to store its operation metadata. We build and deploy airflow from scratch but keep in mind there are many managed solutions out there for Airflow that can be used interchangeably with less setup effort.
For streaming real-time data, we will use Apache Kafka. Kafka streams work with producers, which send data to a Kafka cluster, and consumers, which read the data by subscribing to a Kafka topic. Please refer to the architecture diagram below for an example.
[Kafka Real-Time Streaming](https://github.com/enisaras/Transport/blob/main/diagrams/MTAKafkaExample.png)
