220126_112101 topic models

- randomly selected 1 control user from each block (5423 blocks in total) - all tweets (including quotes, replies, retweets) from Oct 4 to Oct 24 - ~2 million unique tweets/documents - excluded tokens/words that occur in fewer than 100 documents - excluded tokens/words that occur in >60% of documents - represented words in two ways: [[bags of words|bag of words]], [[term frequency-inverse document frequency]] (TF-IDF) - used [[latent Dirichlet allocation]] (LDA) and [[non-negative matrix factorization]] (NMF) - requested for 64 topics (only 16 show in plots below) # Bag-of-words model ## LDA ![[lda.png|900]] ## NMF ![[nmf_64.png|900]] # TF-IDF model Weird? Worse performance than simple [[bags of words|bag-of-word]] models? Maybe because of the exclusion (infrequent and frequent tokens)? ## LDA ![[tfidf_lda.png|900]] ## NMF ![[tfidf_nmf_64.png|900]]