Large-scale analysis of heavy metal lyrics using Data Science and Natural Language Processing (NLP) techniques.
Note: "PAO" is a French acronym for "Projet d'Approfondissement et d'Ouverture" (Deepening and Opening Project), which is a school project at INSA Rouen.
Examine vocabulary, themes, emotions, and stylistic evolution of metal music through a large and representative corpus of songs.
The project is organized into modular packages:
scraper.py: Distributed scraping of lyrics via DarkLyrics (quarter system for parallel work)metallum_webscraper.py: Scraping of musical genres via Metal Archives (Selenium)merge_progress_files.py: Merging of progress files and automatic language detectionrecalculate_languages.py: Language detection recalculation utilities
dataset_loader.py: Loading and parsing of the main dataset JSON filefilters.py: Data filtering functions (language, genre, etc.)
song_metrics.py: Main function for computing all song-level metricsswear_words.py: Swear word ratio calculationreadability.py: Coleman–Liau Readability Index computationmetalness.py: Metalness metric (word specificity to Metal vs Non-Metal corpus)metallitude.py: Metallitude metric (cross TF-IDF)sentiment.py: Sentiment analysis using VADER
artists_tf_idf.py: TF-IDF vectorization for artistsalbums_tf_idf.py: TF-IDF vectorization for albumsclustering_MRSW.py: Clustering based on Metalness, Readability, Swear wordsalbums_clustering.py: Album-level clustering analysiscluster_labeling.py: Automatic cluster labeling and interpretation
metrics_plots.py: General metric visualization functionssentiment_plots.py: Sentiment analysis visualizationswordcloud.py: Word cloud generationdistribution.py: Data distribution plots
aggregation.py: Aggregation functions (by artist, by year, etc.)
Main analysis scripts:
metrics_analysis.py: Swear words, readability, and "stupidity curve" analysissentiment_analysis.py: Sentiment analysis and emotional valencemetalness_computation.py: Metalness metric computationmetallitude_computation.py: Metallitude metric computationdataset_distribution.py: Dataset statistics and distribution analysistop_romanian_artists.py: Analysis of Romanian metal artistsclean_genres.py: Genre data cleaning utility
metalness_loader.py: Loading precomputed metalness dataextract_top_bands.py: Utility for extracting top bands from dataset
swear_words_eng.txt: English swear words dictionary for profanity ratiostopwords_eng.txt: English stopwords list for text filteringvader_lexicon.txt: VADER sentiment lexicon
data/dataset.json: Main dataset with lyrics, metadata, and detected languagedata/bands_genres_cleaned.json: Musical genres by banddata/artists_list.json: List of artists for scrapingcache/: Cached intermediate data files (hedonometer, lyrics data, etc.)output_data/: Generated output files (cluster labels, metrics, word rankings)output_pics/: Generated visualization images
Install required dependencies:
pip install -r requirements.txtMain analysis scripts are located in the scripts/ directory. Run them from the project root:
# Metrics analysis (swear words, readability)
python scripts/metrics_analysis.py [--json data/dataset.json] [--sample N]
# Sentiment analysis
python scripts/sentiment_analysis.py
# Metalness computation
python scripts/metalness_computation.py
# Metallitude computation
python scripts/metallitude_computation.pyMost scripts support command-line arguments for customizing input paths and parameters. Use --help for detailed options.
- Swear word ratio : Proportion of profane words
- Coleman–Liau Readability Index : Text complexity
- Metalness : Specificity of a word to the Metal corpus vs Non-Metal (log-ratio + sigmoid)
- Metallitude : Cross TF-IDF (TF on Metal, IDF on Non-Metal)
- Clear distinction between "extreme" metal (Death/Black) and "melodic" metal (Power/Heavy) via multivariate clustering
- Inverse correlation between Metallitude and Happiness (emotional valence)
- Temporal evolution of vocabulary and profanity since the 1970s
For complete details on methodology, results, visualizations, and conclusions, see REPORT.md.
Rayen, Mathis, Nizar, and Florent