220117_163635 medium vs substack - popular 150 monthly june 2021 to dec 2021

- scraped headlines with `politics` tag on [substack.com](https://substack.com/discover/category/politics/all) and [medium.com](https://medium.com/tag/politics/archive/2021/12/01) - identified top 150 most popular headlines per month from each platform (since June 2021 to Dec 2021; 7 months) - popularity based on likes/comments (sorted by likes then comments) - 1050 medium, 1050 substack - fitted logistic regression (L2 regularization) in sklearn with default hyperparameters - 5-fold cross validation (prediction accuracy) - see also unpopular headlines [[220117_163635 medium vs substack - unpopular 150 monthly june 2021 to dec 2021]] Goal: Can we train models that can classify whether a headline came from substack or medium? - outcome: substack (coded 1) or medium (coded 0) - features/predictors: headline text, headline length, polarity, subjectivity, [[VADER]] sentiment # Model 1 (includes headline text) - 8 input columns (headline text, headline length, polarity [-1, 1], subjectivity [0, 1]) - transformed into 3345 features - mean prediction accuracy: 70.76% (if we tune the model or use better models, accuracy would go up a lot more!) Top 50 (+8: polarity, subjectivity, headline length, 4 [[VADER]]) features - positive coefs: more on substack - negative coefs: more on medium ![[feature_imp__popular150monthlyjune__headline-headline_len-polarity_subjectivity 2.png|700]] # Model 2 (excludes headline text) - 7 input columns/features (headline length, polarity [-1, 1], subjectivity [0, 1], 4 [[VADER]]) - mean prediction accuracy: 56.67% ![[feature_imp__popular150monthlyjune__headline_len-polarity_subjectivity 1.png|700]]