Skip to content

Learning to use Spark Hadoop Pig technologies in GCP with Python. Doing some benchmarks on the PageRank algorithm

Notifications You must be signed in to change notification settings

grallm/m2-spark-hadoop-pig

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PageRank on Google Cloud Platform using Pig and Spark

Malo GRALL

Alex MAINGUY

Mathis ROCHER

Method

Small dataset

Pig vs Spark PageRank algorithm - Small dataset (in ms)

Pig vs Spark PageRank algorithm - Table Small dataset

Big dataset

The differences should be more visible but we did not setup the optimal partitioning, so the differences are not clearly visible.

Pig vs Spark PageRank algorithm - Big dataset (in ms)

Pig vs Spark PageRank algorithm - Table Big dataset

Problems

In order to get clearer results, we gathered Spark results and saved them in a separate file instead of printing them in the terminal with the Cloud Logging for python feature of GCP.

With pig we had trouble debugging with the logs because they were not easily accessible in GCP. The Logging menu had some logs but they were only logging the terminal outputs.

About

Learning to use Spark Hadoop Pig technologies in GCP with Python. Doing some benchmarks on the PageRank algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 75.2%
  • Shell 24.8%