![[correlations 2.jpg]] ## batches 10, 11 (min cluster size = 4000) ```r cols2use <- c("verified", "followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "name_alpha_pct", "name_upper_pct", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create") clusters 1 2 3 4 5 6 7 4022 4014 4024 26160 39846 10722 13976 clust clust_control_treat N 1: 1 1 2011 2: 1 2 2011 3: 2 3 2007 4: 2 4 2007 5: 3 5 2012 6: 3 6 2012 7: 4 7 13080 8: 4 8 13080 9: 5 9 19923 10: 5 10 19923 11: 6 11 5361 12: 6 12 5361 13: 7 13 6988 14: 7 14 6988 ``` ![[Pasted image 20210417142926.png]] ![[Pasted image 20210417143253.png]] ## batches 12, 13 (min cluster size = 4000) - winsorize `c(0.05, 0.95)` all features before normalizing `[0, 1]` - uploaded batch 13 ```r cols2use <- c("followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "name_alpha_pct", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create") clust clust_control_treat N 1: 1 1 20800 2: 1 2 20801 3: 2 3 4026 4: 2 4 4027 5: 3 5 3178 6: 3 6 3179 7: 4 7 8323 8: 4 8 8323 9: 5 9 15053 10: 5 10 15054 ``` ![[Pasted image 20210417173302.png]] ![[Pasted image 20210417173346.png]] ## batches 14, 15 (min cluster size = 4000) - winsorize `c(0.05, 0.95)` all features before normalizing `[0, 1]` - uploaded batch 15 ```r cols2use <- c("followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create") clusters 1 2 3 4 5 31059 10213 6077 13678 41737 clust clust_control_treat N 1: 1 1 15529 2: 1 2 15530 3: 2 3 5106 4: 2 4 5107 5: 3 5 3038 6: 3 6 3039 7: 4 7 6839 8: 4 8 6839 9: 5 9 20868 10: 5 10 20869 ``` ![[Pasted image 20210417181034.png]] ![[Pasted image 20210417181045.png]] ## batches 16, 17 (min cluster size = 1500) - features: botometer metrics - only 3803 users checked against botometer - uploaded batches 16 and 17 - clust 1: bots? - clust 2: non-bots? ```r cols2use <- c('bot_cap_eng','bot_cap_uni','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer','bot_uni_astroturf','bot_uni_fake_follower','bot_uni_financial','bot_uni_other','bot_uni_overall','bot_uni_self_declared','bot_uni_spammer') clust clust_control_treat N 1: NA NA 98961 # not checked yet 2: 1 1 750 3: 1 2 750 4: 2 3 1151 5: 2 4 1152 ``` ![[Pasted image 20210418214229.png]] ## batches 18, 19 (cluster size = 1335) - features: botometer metrics - only 3803 users checked against botometer - removed bots "manually" ```r # r manually subset users dt2 <- dt2[bot_cap_eng < 0.6 & bot_cap_uni < 0.6 & bot_eng_self_declared < 0.3 & bot_uni_self_declared < 0.5] clust clust_control_treat N 1: NA NA 101429 2: 1 1 667 3: 1 2 668 ``` ![[Pasted image 20210418224657.png]] ## batches 20, 21 (min cluster size = 4000) - winsorize `c(0.05, 0.95)` all features before normalizing `[0, 1]` ```r # features chosen based on significant regression coefs cols2use <- c('followers_count', 'friends_count', 'favourites_count', 'screen_name_len', 'days_since_create') clust clust_control_treat N 1: 1 1 2609 2: 1 2 2609 3: 2 3 5234 4: 2 4 5234 5: 3 5 5183 6: 3 6 5184 7: 4 7 8786 8: 4 8 8787 9: 5 9 4940 10: 5 10 4940 11: 6 11 3965 12: 6 12 3965 13: 7 13 3359 14: 7 14 3359 15: 8 15 7524 16: 8 16 7525 17: 9 17 6527 18: 9 18 6527 19: 10 19 3253 20: 10 20 3254 ``` ![[Pasted image 20210419132132.png]] ![[Pasted image 20210419132111.png]] # batches 22, 23 (min clust size = 1000) - features: botometer metrics - only 3803 users checked against botometer ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') clust clust_control_treat N 1: NA NA 98961 2: 1 1 500 3: 1 2 500 4: 2 3 1401 5: 2 4 1402 ``` ![[Pasted image 20210419163315.png]] # batches 24, 25 (min clust size = 1300) - features: botometer metrics - 7660 checked ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') clust clust_control_treat N 1: NA NA 95104 2: 1 1 1847 3: 1 2 1847 4: 2 3 1147 5: 2 4 1147 6: 3 5 836 7: 3 6 836 ``` ![[Pasted image 20210420215015.png]] # batches 26, 27 (min clust size = 900) - features: botometer metrics - 7660 checked ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') clust clust_control_treat N 1: NA NA 95104 2: 1 1 1159 3: 1 2 1159 4: 2 3 1510 5: 2 4 1511 6: 3 5 450 # highest so far 7: 3 6 450 # highest so far 8: 4 7 710 9: 4 8 711 ``` ![[Pasted image 20210420215533.png]] # batches 28, 29 (min clust size = 15000) - winsorize c(0.05, 0.95) all features before normalizing [0, 1] ```r cols2use <- c("followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "name_alpha_pct", "name_upper_pct", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create") clust clust_control_treat N 1: 1 1 7976 2: 1 2 7976 3: 2 3 43406 4: 2 4 43406 ``` ![[Pasted image 20210421093039.png]] ![[Pasted image 20210421093027.png]] # batches 30, 31 (min clust size = 700) - 2980 users matched the criteria below ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') dt2 <- dt2[bot_cap_eng %between% c(0.45, 0.78)] clust clust_control_treat N 1: NA NA 99784 2: 1 1 1126 3: 1 2 1127 4: 2 3 363 5: 2 4 364 ``` ![[Pasted image 20210421094546.png]] # batches 32, 33 (min clust size = 700) - 1306 users matched criteria below ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') dt2 <- dt2[bot_eng_astroturf > 0.45] clust clust_control_treat N 1: NA NA 101458 2: 1 1 653 3: 1 2 653 ``` ![[Pasted image 20210421095347.png]] # batches 34, 35 (min clust size = 400) - 435 users matched criteria below ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') dt2 <- dt2[bot_eng_fake_follower > 0.5] clust clust_control_treat N 1: NA NA 102329 2: 1 1 217 3: 1 2 218 ``` ![[Pasted image 20210421095822.png]] # batches 36, 37 (min clust size = 800) - 1205 users matched criteria below ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') dt2 <- dt2[bot_eng_overall > 0.5] clust clust_control_treat N 1: NA NA 101559 2: 1 1 602 3: 1 2 603 ``` ![[Pasted image 20210421100158.png]] # batches 38, 39 (min clust size = 300) - 358 users matched criteria below ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') dt2 <- dt2[bot_eng_self_declared > 0.6] clust clust_control_treat N 1: NA NA 102406 2: 1 1 179 3: 1 2 179 ``` ![[Pasted image 20210421100632.png]] # batches 40, 41 (min clust size = 300) - 323 users matched criteria below ```r cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer') dt2 <- dt2[bot_eng_spammer > 0.3] clust clust_control_treat N 1: NA NA 102441 2: 1 1 161 3: 1 2 162 ``` ![[Pasted image 20210421101215.png]] # batches 42, 43 - manual bin `bot_cap_eng` into 25 equally-sized bins `bot_cap_eng_bin` ```r bot_cap_eng_bin mi ma me med n 1: 0 0.0000000 0.0000000 0.0000000 0.0000000 1129 2: 1 0.2263183 0.2263183 0.2263183 0.2263183 1627 3: 2 0.2520420 0.2520420 0.2520420 0.2520420 1725 4: 3 0.2787260 0.2787260 0.2787260 0.2787260 1641 5: 4 0.3061881 0.3061881 0.3061881 0.3061881 1541 6: 5 0.3342309 0.3342309 0.3342309 0.3342309 1482 7: 6 0.3626456 0.3626456 0.3626456 0.3626456 1487 8: 7 0.3912159 0.4197222 0.4048158 0.3912159 2509 9: 8 0.4479466 0.4479466 0.4479466 0.4479466 1136 10: 9 0.4756770 0.5027123 0.4885207 0.4756770 2086 11: 10 0.5288665 0.5539726 0.5416210 0.5539726 1807 12: 11 0.5778862 0.6004883 0.5888219 0.5778862 1732 13: 12 0.6216870 0.6414184 0.6315163 0.6216870 1624 14: 13 0.6596465 0.6763619 0.6679046 0.6596465 1510 15: 14 0.6915800 0.7176911 0.7049036 0.7053378 2099 16: 15 0.7287106 0.7384783 0.7334737 0.7287106 1335 17: 16 0.7470840 0.7611879 0.7540342 0.7546219 1810 18: 17 0.7668770 0.7759887 0.7714559 0.7717813 1774 19: 18 0.7795815 0.7874002 0.7834972 0.7826358 1983 20: 19 0.7892295 0.7930870 0.7911184 0.7907587 1559 21: 20 0.7939576 0.7965991 0.7954241 0.7957282 1644 22: 21 0.7966222 0.7970759 0.7968202 0.7968285 1746 23: 22 0.7971037 0.8036707 0.7992922 0.7989890 1813 24: 23 0.8054784 0.8395864 0.8200870 0.8199149 1783 25: 24 0.8457544 1.0000000 0.8864360 0.8829633 1625 bot_cap_eng_bin mi ma me med n ``` ![[Pasted image 20210510110644.png]] # batches 44, 45 - manual bin `bot_eng_fake_follower` into 25 equally-sized bins `bot_eng_fake_follower_bin` ```r cluster mi ma me med n 1: 0 0.00 0.00 0.0000000 0.00 887 2: 1 0.01 0.01 0.0100000 0.01 1531 3: 2 0.02 0.02 0.0200000 0.02 1797 4: 3 0.03 0.03 0.0300000 0.03 1882 5: 4 0.04 0.04 0.0400000 0.04 1913 6: 5 0.05 0.05 0.0500000 0.05 1886 7: 6 0.06 0.06 0.0600000 0.06 1695 8: 7 0.07 0.07 0.0700000 0.07 1632 9: 8 0.08 0.08 0.0800000 0.08 1471 10: 9 0.09 0.09 0.0900000 0.09 1445 11: 10 0.10 0.11 0.1049785 0.10 2561 12: 11 0.12 0.12 0.1200000 0.12 1111 13: 12 0.13 0.14 0.1347621 0.13 2039 14: 13 0.15 0.16 0.1550524 0.16 1910 15: 14 0.17 0.17 0.1700000 0.17 899 16: 15 0.18 0.20 0.1898462 0.19 2470 17: 16 0.21 0.22 0.2150906 0.22 1546 18: 17 0.23 0.24 0.2349749 0.23 1395 19: 18 0.25 0.27 0.2598353 0.26 1943 20: 19 0.28 0.30 0.2901061 0.29 1603 21: 20 0.31 0.34 0.3241068 0.32 1836 22: 21 0.35 0.38 0.3641765 0.36 1518 23: 22 0.39 0.45 0.4169846 0.42 1887 24: 23 0.46 0.56 0.5036695 0.50 1661 25: 24 0.57 1.00 0.7066543 0.67 1689 cluster mi ma me med n ``` ![[Pasted image 20210510111518.png]] # batches 46, 47 - manual bin `bot_eng_astroturf` into 25 equally-sized bins `bot_eng_astroturf_bin` ```r cluster mi ma me med n 1: 0 0.00 0.00 0.00000000 0.00 474 2: 1 0.01 0.02 0.01585458 0.02 1843 3: 2 0.03 0.03 0.03000000 0.03 1309 4: 3 0.04 0.05 0.04520683 0.05 3046 5: 4 0.06 0.06 0.06000000 0.06 1712 6: 5 0.07 0.07 0.07000000 0.07 1616 7: 6 0.08 0.08 0.08000000 0.08 1651 8: 7 0.09 0.09 0.09000000 0.09 1646 9: 8 0.10 0.10 0.10000000 0.10 1505 10: 9 0.11 0.11 0.11000000 0.11 1429 11: 10 0.12 0.13 0.12481365 0.12 2549 12: 11 0.14 0.14 0.14000000 0.14 1180 13: 12 0.15 0.16 0.15481517 0.15 2083 14: 13 0.17 0.17 0.17000000 0.17 900 15: 14 0.18 0.20 0.18965347 0.19 2424 16: 15 0.21 0.22 0.21480684 0.21 1346 17: 16 0.23 0.25 0.23962878 0.24 1751 18: 17 0.26 0.28 0.26985180 0.27 1552 19: 18 0.29 0.33 0.30933198 0.31 1976 20: 19 0.34 0.38 0.35909091 0.36 1606 21: 20 0.39 0.45 0.41912556 0.42 1784 22: 21 0.46 0.53 0.49366983 0.49 1684 23: 22 0.54 0.63 0.58303452 0.58 1651 24: 23 0.64 0.77 0.70164937 0.70 1734 25: 24 0.78 1.00 0.87140661 0.87 1756 cluster mi ma me med n ``` ![[Pasted image 20210510111924.png]] # batches 48, 49 - manual bin `statuses_count` into 10 equally-sized bins `statuses_count_bin` ```r cluster mi ma me med n 1: 1 1 388 163.2539 149.0 10261 2: 2 389 1194 753.7337 738.0 10232 3: 3 1195 2508 1800.6692 1780.0 10248 4: 4 2509 4545 3461.6201 3439.0 10248 5: 5 4546 7703 6009.0025 5948.0 10249 6: 6 7704 12558 9970.8473 9895.0 10248 7: 7 12559 20627 16215.4793 16015.0 10247 8: 8 20628 35770 27306.7512 26883.5 10248 9: 9 35771 71960 50526.9134 48912.0 10248 10: 10 71963 2721215 164694.3694 121011.0 10249 ``` ![[Pasted image 20210510113505.png]] # batches 50, 51 - manual bin `friend_follow_ratio` into 20 equally-sized bins `friend_follow_ratio_bin` ```r cluster mi ma me med n 1: 1 0.00 0.14 0.05 0.05 5408 2: 2 0.14 0.40 0.27 0.26 5409 3: 3 0.40 0.66 0.53 0.53 5408 4: 4 0.66 0.87 0.77 0.78 5409 5: 5 0.87 0.99 0.94 0.94 5408 6: 6 0.99 1.07 1.03 1.03 5409 7: 7 1.07 1.20 1.13 1.12 5408 8: 8 1.20 1.39 1.29 1.29 5409 9: 9 1.39 1.62 1.50 1.50 5408 10: 10 1.62 1.89 1.75 1.74 5409 11: 11 1.89 2.19 2.03 2.02 5408 12: 12 2.19 2.60 2.39 2.39 5409 13: 13 2.60 3.11 2.85 2.84 5408 14: 14 3.11 3.80 3.44 3.43 5437 15: 15 3.80 4.75 4.25 4.23 5400 16: 16 4.75 6.13 5.39 5.34 5389 17: 17 6.13 8.50 7.20 7.11 5409 18: 18 8.51 14.00 10.84 10.62 5465 19: 19 14.03 6855.50 31.47 21.56 5352 ``` ![[Pasted image 20210510131239.png]] # batch 52 - manual bin `bot_eng_fake_follower` into 120 equally-sized bins `bot_eng_fake_follower_bin` - each bin has about 400 users # batch 53 - manual bin `favourites_count` into 99 equally-sized bins - each bin has about 1038 users # batches 54, 55 - see [[20210514_232619 ML models to predict match rate]] - manual bin `match_rate_pred` into 10 equally-sized bins - batch 54: each bin has about 10k users - batch 55 is 54 split into random halves ![[Pasted image 20210515000851.png]] # batches 56, 57 - see [[20210514_232619 ML models to predict match rate]] - manual bin `match_rate_pred` into 10 equally-sized bins - batch 56: each bin has about 10k users - batch 57 is 56 split into random halves ```r # match rate by cluster cluster mi ma mu med n 1: 1 0.190 0.402 0.378 0.383 10277 2: 2 0.402 0.427 0.415 0.416 10276 3: 3 0.427 0.448 0.438 0.438 10276 4: 4 0.448 0.468 0.458 0.458 10277 5: 5 0.468 0.487 0.477 0.477 10276 6: 6 0.487 0.508 0.497 0.497 10276 7: 7 0.508 0.531 0.519 0.519 10277 8: 8 0.531 0.562 0.546 0.545 10276 9: 9 0.562 0.608 0.583 0.581 10276 10: 10 0.608 0.905 0.660 0.648 10277 ``` ![[Pasted image 20210516003008.png]] # batches 58, 59 - newly scraped users! - `"../data/users/users_tweet_count_2021-05-17_subset.csv" ` - manual bin `n_tweets` into 2 equally-sized bins # batches 60, 61 - all users # batch 62 - 24k users (out of 185k): shared > 3 but <= 30 newsguard links tweets # batch 63 - 19k users - min tweets per day 5 - min active days 8 (out of 10 to 12 days) - 5k removed from batch 62 for "inactivity"