- randomly selected 1 control user from each block (5423 blocks in total)
- all tweets (including quotes, replies, retweets) from Oct 4 to Oct 24
- ~2 million unique tweets/documents
- excluded tokens/words that occur in fewer than 100 documents
- excluded tokens/words that occur in >60% of documents
- represented words in two ways: [[bags of words|bag of words]], [[term frequency-inverse document frequency]] (TF-IDF)
- used [[latent Dirichlet allocation]] (LDA) and [[non-negative matrix factorization]] (NMF)
- requested for 64 topics (only 16 show in plots below)
# Bag-of-words model
## LDA
![[lda.png|900]]
## NMF
![[nmf_64.png|900]]
# TF-IDF model
Weird? Worse performance than simple [[bags of words|bag-of-word]] models? Maybe because of the exclusion (infrequent and frequent tokens)?
## LDA
![[tfidf_lda.png|900]]
## NMF
![[tfidf_nmf_64.png|900]]