Spotify Popularity Inference

Overview

Predicting Spotify track popularity from various audio features (danceability, energy, etc.).

Goals

  • Explore relationships between audio features and popularity.
  • Build predictive models and assess performance.
  • Document insights and caveats (e.g., confounding, platform effects).

Dataset

13k+ tracks with popularity scores and acoustic features. Source: Kaggle — Song Popularity Dataset.

Dataset Features Info

  • Name: Track title (excluded).
  • Popularity: Integer score from 0 to 100.
  • Acousticness: Confidence (0.0–1.0) that the track is acoustic.
  • Danceability: Suitability for dancing (rhythm stability and beat strength).
  • Duration (ms): Track duration in milliseconds.
  • Energy: Proxy for intensity and activity (0.0–1.0).
  • Instrumentalness: Likelihood the track contains no vocals (0.0–1.0).
  • Key: Estimated pitch class (0=C, …, 11=B).
  • Liveness: Probability the track was performed live (0.0–1.0).
  • Loudness (dB): Overall loudness in decibels.
  • Mode: Musical modality (1=major, 0=minor).
  • Speechiness: Presence of spoken words (0.0–1.0).
  • Tempo (BPM): Estimated beats per minute.
  • Time signature: Estimated time signature.
  • Valence: Musical positiveness (0.0–1.0).

Methods

We are conducting supervised statistical learning to see which factors most impact song popularity, our response variable Y. Expand any step below for notes and details.

Data Cleanup

  • Drop song_name (non‑predictive identifier).
  • Encode audio_mode as binary (1 = major, 0 = minor); keep the other 12 covariates continuous.
  • No missing values detected.
  • Scale song_popularity from 0–100 to 0–1 by dividing by 100 (retains percentage interpretability).
  • Since the response is bounded, prepare for a GLM; prefer beta regression over plain OLS.

Dataset head (first 5 rows)

song_popularitysong_duration_msacousticnessdanceabilityenergyinstrumentalnesskeylivenessloudnessaudio_modespeechinesstempotime_signatureaudio_valence
0.732623330.0055200.4960.6822.94e-0580.0589-4.09510.0294167.06040.474
0.662169330.0103000.5420.8530.00e+0030.1080-6.40700.0498105.25640.370
0.762317330.0081700.7370.4634.47e-0100.2550-7.82810.0792123.88140.324
0.742169330.0264000.4510.9703.55e-0300.1020-4.93810.1070122.44440.198
0.562238260.0009540.4470.7660.00e+00100.1130-5.06510.0313172.01140.574

Data Visualization

  • Inspect distributions of popularity and key audio features.
  • Correlation heatmap and pairwise plots to gauge linear associations.
  • Initial residual diagnostics to check variance and functional form.

Summary statistics

FeatureMinQ1MedianMeanQ3Max
song_popularity0.00000.37000.52000.48750.63251.0000
song_duration_ms120001839372118402189492447221799346
acousticness0.0000010.0236000.1390000.2704720.4582500.996000
danceability0.00000.52400.63600.62460.74000.9870
energy0.001070.496000.672000.639730.818000.99900
instrumentalness0.0000000.0000000.0000210.0920790.0051120.997000
key0.0002.0005.0005.3028.00011.000
liveness0.01090.09300.12200.18040.22400.9860
loudness (dB)-38.768-9.390-6.751-7.678-4.9911.585
speechiness0.00000.03720.05410.09940.11300.9410
tempo (BPM)0.0098.12120.02121.10139.94242.32
time_signature0.0004.0004.0003.9534.0005.000
audio_valence0.0000.3320.5270.5270.7280.984

audio_mode (binary) counts: 0 = 5,493; 1 = 9,431. Statistics above omit audio_mode since it’s categorical.


ScatterplotCorrelation Matrix
Feature vs Popularity ScatterplotsCorrelation Matrix

Fitting Linear Regression

  • Baseline OLS; compare with a logit transform of popularity.
  • Check multicollinearity (e.g., VIF) and remove/retain predictors accordingly.
  • Use cross‑validation to assess out‑of‑sample error.
Call:
lm(formula = song_popularity ~ ., data = song.data)

Residuals: Min 1Q Median 3Q Max -0.57788 -0.11272 0.03017 0.14619 0.49295

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.691e-01 2.791e-02 20.391 < 2e-16 *** song_duration_ms -2.630e-08 2.706e-08 -0.972 0.331038
acousticness -2.954e-02 7.903e-03 -3.738 0.000186 *** danceability 6.503e-02 1.246e-02 5.218 1.83e-07 *** energy -9.077e-02 1.464e-02 -6.199 5.83e-10 *** instrumentalness -7.128e-02 7.711e-03 -9.244 < 2e-16 *** key 1.426e-05 4.667e-04 0.031 0.975621
liveness -4.122e-02 1.169e-02 -3.525 0.000424 *** loudness 4.162e-03 7.133e-04 5.835 5.50e-09 *** audio_mode1 5.656e-03 3.520e-03 1.607 0.108093
speechiness -3.301e-02 1.667e-02 -1.981 0.047619 *
tempo -1.493e-04 5.879e-05 -2.540 0.011095 *
time_signature 1.107e-02 5.320e-03 2.080 0.037561 *
audio_valence -5.777e-02 7.917e-03 -7.297 3.09e-13 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 0.2015 on 14910 degrees of freedom Multiple R-squared: 0.02341, Adjusted R-squared: 0.02255 F-statistic: 27.49 on 13 and 14910 DF, p-value: < 2.2e-16


Residuals vs. FittedDensity Contours
Residuals vs FittedDensity Contours

studentized Breusch-Pagan test
data:  full_model
BP = 102.81, df = 13, p-value = 4.719e-16

FeatureVIF
song_duration_ms1.045922
acousticness2.040434
danceability1.425841
energy3.875072
instrumentalness1.263800
key1.033466
liveness1.057927
loudness3.020238
audio_mode1.059304
speechiness1.094461
tempo1.071715
time_signature1.043724
audio_valence1.414254

Residuals (Transformed)Density (Transformed)
Residuals (Transformed)Density (Transformed)

studentized Breusch-Pagan test
data:  transformed.model
BP = 59.472, df = 13, p-value = 6.528e-08

Fitting Beta Regression

  • Model popularity (scaled to 0–1) with beta regression.
  • Compare full vs. reduced specifications for interpretability.
  • Report pseudo R² and cross‑validated performance alongside OLS baselines.

Call:
betareg(formula = song_popularity ~ ., data = song.data)

Quantile residuals: Min 1Q Median 3Q Max -5.4128 -0.2457 0.2438 0.6481 5.3982

Coefficients (mean model with logit link): Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.232e-01 1.315e-01 0.937 0.348767
song_duration_ms -4.375e-07 1.279e-07 -3.421 0.000623 *** acousticness -2.355e-02 3.717e-02 -0.633 0.526416
danceability 3.138e-01 5.864e-02 5.350 8.78e-08 *** energy -3.089e-01 6.889e-02 -4.484 7.32e-06 *** instrumentalness -3.617e-01 3.645e-02 -9.922 < 2e-16 *** key -2.578e-03 2.195e-03 -1.174 0.240254
liveness -1.158e-01 5.505e-02 -2.103 0.035452 *
loudness 1.069e-02 3.360e-03 3.182 0.001462 ** audio_mode1 2.148e-02 1.656e-02 1.297 0.194543
speechiness -2.918e-01 7.848e-02 -3.718 0.000201 *** tempo -7.231e-04 2.767e-04 -2.614 0.008959 ** time_signature 4.393e-02 2.508e-02 1.752 0.079855 .
audio_valence -2.306e-01 3.725e-02 -6.190 6.03e-10 ***

Phi coefficients (precision model with identity link): Estimate Std. Error z value Pr(>|z|)
(phi) 3.28336 0.03366 97.54 <2e-16 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ’ ’ 1

Type of estimator: ML (maximum likelihood) Log-likelihood: 1246 on 15 Df Pseudo R-squared: 0.01217 Number of iterations: 30 (BFGS) + 2 (Fisher scoring)


Likelihood ratio test

Model 1: song_popularity ~ . - acousticness - key - audio_mode - time_signature - liveness Model 2: song_popularity ~ . #Df LogLik Df Chisq Pr(>Chisq)
1 10 1240.5
2 15 1246.4 5 11.827 0.03724 *

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ’ ’ 1


Call:
betareg(formula = song_popularity ~ . - acousticness - key - audio_mode - 
    time_signature - liveness, data = song.data)

Quantile residuals: Min 1Q Median 3Q Max -5.3549 -0.2479 0.2450 0.6510 5.4122

Coefficients (mean model with logit link): Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.424e-01 8.201e-02 2.956 0.003120 ** song_duration_ms -4.397e-07 1.275e-07 -3.449 0.000562 *** danceability 3.425e-01 5.623e-02 6.091 1.12e-09 *** energy -2.942e-01 5.948e-02 -4.946 7.58e-07 *** instrumentalness -3.653e-01 3.638e-02 -10.041 < 2e-16 *** loudness 1.062e-02 3.355e-03 3.166 0.001547 ** speechiness -3.199e-01 7.770e-02 -4.118 3.82e-05 *** tempo -6.811e-04 2.759e-04 -2.468 0.013569 *
audio_valence -2.348e-01 3.650e-02 -6.434 1.24e-10 ***

Phi coefficients (precision model with identity link): Estimate Std. Error z value Pr(>|z|)
(phi) 3.28074 0.03363 97.55 <2e-16 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ’ ’ 1

Type of estimator: ML (maximum likelihood) Log-likelihood: 1241 on 10 Df Pseudo R-squared: 0.01167 Number of iterations: 24 (BFGS) + 3 (Fisher scoring)


FeatureVIF
song_duration_ms1.039005
danceability1.309506
energy2.885457
instrumentalness1.257748
loudness3.004867
speechiness1.072703
tempo1.065073
audio_valence1.357336

"MSE of Full Model: 0.0418216517896375"
"MSE of Reduced Model: 0.0418503691186738"

Results

Results

Drake Career Analysis

Overview

Story-driven exploration of a decade-plus career: releases, collaborations, and evolving audio profiles. Minimal, skimmable summary below; see materials for the full narrative with visuals.

Goals

  • Curate key milestones and summarize quantitative trends.
  • Present findings in a digestible, self-contained report and notebook.

Methods

Part One: Motivations

Part Two: Visualization of Drake's popularity over the past decade

Part Three: Clustering Drake's music by song attributes

  • k-means on standardized audio features (k = 4) to identify style groups.
  • Elbow/Knee methods to choose k; inspect cluster profiles (danceability, energy, valence).
  • Compare cluster distribution across eras and top-streamed tracks.

Part Four: Old vs. new Drake preferences

Term FrequencyWord Cloud
Career WordsCareer Word Cloud

Album Term FrequencyAlbum Word Cloud
Album WordsAlbum Word Cloud

Part Five: Sentiment analysis on rising popularity

Word Cloud (Positive)Word Cloud (Negative)
Positive Word CloudNegative Word Cloud

Sentiment Scores (YouTube/Reddit)

Results

Results

Highlights notable phases in the discography and shifting audio profiles; discusses limitations and data caveats.

Note: Course projects for (STA 141A - Spotify track popularity inference; STA 141B - Drake Music Career Analysis.)