- scraped headlines with `politics` tag on [substack.com](https://substack.com/discover/category/politics/all) and [medium.com](https://medium.com/tag/politics/archive/2021/12/01)
- identified top 150 most popular headlines per month from each platform (since June 2021 to Dec 2021; 7 months)
- popularity based on likes/comments (sorted by likes then comments)
- 1050 medium, 1050 substack
- fitted logistic regression (L2 regularization) in sklearn with default hyperparameters
- 5-fold cross validation (prediction accuracy)
- see also unpopular headlines [[220117_163635 medium vs substack - unpopular 150 monthly june 2021 to dec 2021]]
Goal: Can we train models that can classify whether a headline came from substack or medium?
- outcome: substack (coded 1) or medium (coded 0)
- features/predictors: headline text, headline length, polarity, subjectivity, [[VADER]] sentiment
# Model 1 (includes headline text)
- 8 input columns (headline text, headline length, polarity [-1, 1], subjectivity [0, 1])
- transformed into 3345 features
- mean prediction accuracy: 70.76% (if we tune the model or use better models, accuracy would go up a lot more!)
Top 50 (+8: polarity, subjectivity, headline length, 4 [[VADER]]) features
- positive coefs: more on substack
- negative coefs: more on medium
![[feature_imp__popular150monthlyjune__headline-headline_len-polarity_subjectivity 2.png|700]]
# Model 2 (excludes headline text)
- 7 input columns/features (headline length, polarity [-1, 1], subjectivity [0, 1], 4 [[VADER]])
- mean prediction accuracy: 56.67%
![[feature_imp__popular150monthlyjune__headline_len-polarity_subjectivity 1.png|700]]