Skip to content
This repository was archived by the owner on Feb 8, 2026. It is now read-only.

FlorentTariolle/Metal-Mining

Repository files navigation

PAO Metal Mining

Large-scale analysis of heavy metal lyrics using Data Science and Natural Language Processing (NLP) techniques.

Note: "PAO" is a French acronym for "Projet d'Approfondissement et d'Ouverture" (Deepening and Opening Project), which is a school project at INSA Rouen.

Objective

Examine vocabulary, themes, emotions, and stylistic evolution of metal music through a large and representative corpus of songs.

Project Structure

The project is organized into modular packages:

data_collection/

  • scraper.py : Distributed scraping of lyrics via DarkLyrics (quarter system for parallel work)
  • metallum_webscraper.py : Scraping of musical genres via Metal Archives (Selenium)
  • merge_progress_files.py : Merging of progress files and automatic language detection
  • recalculate_languages.py : Language detection recalculation utilities

data_loading/

  • dataset_loader.py : Loading and parsing of the main dataset JSON file
  • filters.py : Data filtering functions (language, genre, etc.)

metrics/

  • song_metrics.py : Main function for computing all song-level metrics
  • swear_words.py : Swear word ratio calculation
  • readability.py : Coleman–Liau Readability Index computation
  • metalness.py : Metalness metric (word specificity to Metal vs Non-Metal corpus)
  • metallitude.py : Metallitude metric (cross TF-IDF)
  • sentiment.py : Sentiment analysis using VADER

clustering/

  • artists_tf_idf.py : TF-IDF vectorization for artists
  • albums_tf_idf.py : TF-IDF vectorization for albums
  • clustering_MRSW.py : Clustering based on Metalness, Readability, Swear words
  • albums_clustering.py : Album-level clustering analysis
  • cluster_labeling.py : Automatic cluster labeling and interpretation

visualization/

  • metrics_plots.py : General metric visualization functions
  • sentiment_plots.py : Sentiment analysis visualizations
  • wordcloud.py : Word cloud generation
  • distribution.py : Data distribution plots

aggregation/

  • aggregation.py : Aggregation functions (by artist, by year, etc.)

scripts/

Main analysis scripts:

  • metrics_analysis.py : Swear words, readability, and "stupidity curve" analysis
  • sentiment_analysis.py : Sentiment analysis and emotional valence
  • metalness_computation.py : Metalness metric computation
  • metallitude_computation.py : Metallitude metric computation
  • dataset_distribution.py : Dataset statistics and distribution analysis
  • top_romanian_artists.py : Analysis of Romanian metal artists
  • clean_genres.py : Genre data cleaning utility

utils/

  • metalness_loader.py : Loading precomputed metalness data
  • extract_top_bands.py : Utility for extracting top bands from dataset

resources/

  • swear_words_eng.txt : English swear words dictionary for profanity ratio
  • stopwords_eng.txt : English stopwords list for text filtering
  • vader_lexicon.txt : VADER sentiment lexicon

Data Files

  • data/dataset.json : Main dataset with lyrics, metadata, and detected language
  • data/bands_genres_cleaned.json : Musical genres by band
  • data/artists_list.json : List of artists for scraping
  • cache/ : Cached intermediate data files (hedonometer, lyrics data, etc.)
  • output_data/ : Generated output files (cluster labels, metrics, word rankings)
  • output_pics/ : Generated visualization images

Installation

Install required dependencies:

pip install -r requirements.txt

Usage

Main analysis scripts are located in the scripts/ directory. Run them from the project root:

# Metrics analysis (swear words, readability)
python scripts/metrics_analysis.py [--json data/dataset.json] [--sample N]

# Sentiment analysis
python scripts/sentiment_analysis.py

# Metalness computation
python scripts/metalness_computation.py

# Metallitude computation
python scripts/metallitude_computation.py

Most scripts support command-line arguments for customizing input paths and parameters. Use --help for detailed options.

Main Metrics

  1. Swear word ratio : Proportion of profane words
  2. Coleman–Liau Readability Index : Text complexity
  3. Metalness : Specificity of a word to the Metal corpus vs Non-Metal (log-ratio + sigmoid)
  4. Metallitude : Cross TF-IDF (TF on Metal, IDF on Non-Metal)

Key Results

  • Clear distinction between "extreme" metal (Death/Black) and "melodic" metal (Power/Heavy) via multivariate clustering
  • Inverse correlation between Metallitude and Happiness (emotional valence)
  • Temporal evolution of vocabulary and profanity since the 1970s

Detailed Documentation

For complete details on methodology, results, visualizations, and conclusions, see REPORT.md.

Credits

Rayen, Mathis, Nizar, and Florent

Sources

About

Large-scale analysis of metal lyrics combining custom lexical metrics, sentiment scoring, and unsupervised clustering to reveal subgenre structure

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages