![[correlations 2.jpg]]
## batches 10, 11 (min cluster size = 4000)
```r
cols2use <- c("verified", "followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "name_alpha_pct", "name_upper_pct", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create")
clusters
1 2 3 4 5 6 7
4022 4014 4024 26160 39846 10722 13976
clust clust_control_treat N
1: 1 1 2011
2: 1 2 2011
3: 2 3 2007
4: 2 4 2007
5: 3 5 2012
6: 3 6 2012
7: 4 7 13080
8: 4 8 13080
9: 5 9 19923
10: 5 10 19923
11: 6 11 5361
12: 6 12 5361
13: 7 13 6988
14: 7 14 6988
```
![[Pasted image 20210417142926.png]]
![[Pasted image 20210417143253.png]]
## batches 12, 13 (min cluster size = 4000)
- winsorize `c(0.05, 0.95)` all features before normalizing `[0, 1]`
- uploaded batch 13
```r
cols2use <- c("followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "name_alpha_pct", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create")
clust clust_control_treat N
1: 1 1 20800
2: 1 2 20801
3: 2 3 4026
4: 2 4 4027
5: 3 5 3178
6: 3 6 3179
7: 4 7 8323
8: 4 8 8323
9: 5 9 15053
10: 5 10 15054
```
![[Pasted image 20210417173302.png]]
![[Pasted image 20210417173346.png]]
## batches 14, 15 (min cluster size = 4000)
- winsorize `c(0.05, 0.95)` all features before normalizing `[0, 1]`
- uploaded batch 15
```r
cols2use <- c("followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create")
clusters
1 2 3 4 5
31059 10213 6077 13678 41737
clust clust_control_treat N
1: 1 1 15529
2: 1 2 15530
3: 2 3 5106
4: 2 4 5107
5: 3 5 3038
6: 3 6 3039
7: 4 7 6839
8: 4 8 6839
9: 5 9 20868
10: 5 10 20869
```
![[Pasted image 20210417181034.png]]
![[Pasted image 20210417181045.png]]
## batches 16, 17 (min cluster size = 1500)
- features: botometer metrics
- only 3803 users checked against botometer
- uploaded batches 16 and 17
- clust 1: bots?
- clust 2: non-bots?
```r
cols2use <- c('bot_cap_eng','bot_cap_uni','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer','bot_uni_astroturf','bot_uni_fake_follower','bot_uni_financial','bot_uni_other','bot_uni_overall','bot_uni_self_declared','bot_uni_spammer')
clust clust_control_treat N
1: NA NA 98961 # not checked yet
2: 1 1 750
3: 1 2 750
4: 2 3 1151
5: 2 4 1152
```
![[Pasted image 20210418214229.png]]
## batches 18, 19 (cluster size = 1335)
- features: botometer metrics
- only 3803 users checked against botometer
- removed bots "manually"
```r
# r manually subset users
dt2 <- dt2[bot_cap_eng < 0.6 & bot_cap_uni < 0.6 & bot_eng_self_declared < 0.3 & bot_uni_self_declared < 0.5]
clust clust_control_treat N
1: NA NA 101429
2: 1 1 667
3: 1 2 668
```
![[Pasted image 20210418224657.png]]
## batches 20, 21 (min cluster size = 4000)
- winsorize `c(0.05, 0.95)` all features before normalizing `[0, 1]`
```r
# features chosen based on significant regression coefs
cols2use <- c('followers_count', 'friends_count', 'favourites_count', 'screen_name_len', 'days_since_create')
clust clust_control_treat N
1: 1 1 2609
2: 1 2 2609
3: 2 3 5234
4: 2 4 5234
5: 3 5 5183
6: 3 6 5184
7: 4 7 8786
8: 4 8 8787
9: 5 9 4940
10: 5 10 4940
11: 6 11 3965
12: 6 12 3965
13: 7 13 3359
14: 7 14 3359
15: 8 15 7524
16: 8 16 7525
17: 9 17 6527
18: 9 18 6527
19: 10 19 3253
20: 10 20 3254
```
![[Pasted image 20210419132132.png]]
![[Pasted image 20210419132111.png]]
# batches 22, 23 (min clust size = 1000)
- features: botometer metrics
- only 3803 users checked against botometer
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
clust clust_control_treat N
1: NA NA 98961
2: 1 1 500
3: 1 2 500
4: 2 3 1401
5: 2 4 1402
```
![[Pasted image 20210419163315.png]]
# batches 24, 25 (min clust size = 1300)
- features: botometer metrics
- 7660 checked
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
clust clust_control_treat N
1: NA NA 95104
2: 1 1 1847
3: 1 2 1847
4: 2 3 1147
5: 2 4 1147
6: 3 5 836
7: 3 6 836
```
![[Pasted image 20210420215015.png]]
# batches 26, 27 (min clust size = 900)
- features: botometer metrics
- 7660 checked
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
clust clust_control_treat N
1: NA NA 95104
2: 1 1 1159
3: 1 2 1159
4: 2 3 1510
5: 2 4 1511
6: 3 5 450 # highest so far
7: 3 6 450 # highest so far
8: 4 7 710
9: 4 8 711
```
![[Pasted image 20210420215533.png]]
# batches 28, 29 (min clust size = 15000)
- winsorize c(0.05, 0.95) all features before normalizing [0, 1]
```r
cols2use <- c("followers_count", "friends_count", "listed_count", "favourites_count", "statuses_count", "name_len", "name_alpha_pct", "name_upper_pct", "screen_name_len", "screen_name_alpha_pct", "screen_name_upper_pct", "description_len", "description_alpha_pct", "description_upper_pct", "days_since_create")
clust clust_control_treat N
1: 1 1 7976
2: 1 2 7976
3: 2 3 43406
4: 2 4 43406
```
![[Pasted image 20210421093039.png]]
![[Pasted image 20210421093027.png]]
# batches 30, 31 (min clust size = 700)
- 2980 users matched the criteria below
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
dt2 <- dt2[bot_cap_eng %between% c(0.45, 0.78)]
clust clust_control_treat N
1: NA NA 99784
2: 1 1 1126
3: 1 2 1127
4: 2 3 363
5: 2 4 364
```
![[Pasted image 20210421094546.png]]
# batches 32, 33 (min clust size = 700)
- 1306 users matched criteria below
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
dt2 <- dt2[bot_eng_astroturf > 0.45]
clust clust_control_treat N
1: NA NA 101458
2: 1 1 653
3: 1 2 653
```
![[Pasted image 20210421095347.png]]
# batches 34, 35 (min clust size = 400)
- 435 users matched criteria below
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
dt2 <- dt2[bot_eng_fake_follower > 0.5]
clust clust_control_treat N
1: NA NA 102329
2: 1 1 217
3: 1 2 218
```
![[Pasted image 20210421095822.png]]
# batches 36, 37 (min clust size = 800)
- 1205 users matched criteria below
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
dt2 <- dt2[bot_eng_overall > 0.5]
clust clust_control_treat N
1: NA NA 101559
2: 1 1 602
3: 1 2 603
```
![[Pasted image 20210421100158.png]]
# batches 38, 39 (min clust size = 300)
- 358 users matched criteria below
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
dt2 <- dt2[bot_eng_self_declared > 0.6]
clust clust_control_treat N
1: NA NA 102406
2: 1 1 179
3: 1 2 179
```
![[Pasted image 20210421100632.png]]
# batches 40, 41 (min clust size = 300)
- 323 users matched criteria below
```r
cols2use <- c('bot_cap_eng','bot_eng_astroturf','bot_eng_fake_follower','bot_eng_financial','bot_eng_other','bot_eng_overall','bot_eng_self_declared','bot_eng_spammer')
dt2 <- dt2[bot_eng_spammer > 0.3]
clust clust_control_treat N
1: NA NA 102441
2: 1 1 161
3: 1 2 162
```
![[Pasted image 20210421101215.png]]
# batches 42, 43
- manual bin `bot_cap_eng` into 25 equally-sized bins `bot_cap_eng_bin`
```r
bot_cap_eng_bin mi ma me med n
1: 0 0.0000000 0.0000000 0.0000000 0.0000000 1129
2: 1 0.2263183 0.2263183 0.2263183 0.2263183 1627
3: 2 0.2520420 0.2520420 0.2520420 0.2520420 1725
4: 3 0.2787260 0.2787260 0.2787260 0.2787260 1641
5: 4 0.3061881 0.3061881 0.3061881 0.3061881 1541
6: 5 0.3342309 0.3342309 0.3342309 0.3342309 1482
7: 6 0.3626456 0.3626456 0.3626456 0.3626456 1487
8: 7 0.3912159 0.4197222 0.4048158 0.3912159 2509
9: 8 0.4479466 0.4479466 0.4479466 0.4479466 1136
10: 9 0.4756770 0.5027123 0.4885207 0.4756770 2086
11: 10 0.5288665 0.5539726 0.5416210 0.5539726 1807
12: 11 0.5778862 0.6004883 0.5888219 0.5778862 1732
13: 12 0.6216870 0.6414184 0.6315163 0.6216870 1624
14: 13 0.6596465 0.6763619 0.6679046 0.6596465 1510
15: 14 0.6915800 0.7176911 0.7049036 0.7053378 2099
16: 15 0.7287106 0.7384783 0.7334737 0.7287106 1335
17: 16 0.7470840 0.7611879 0.7540342 0.7546219 1810
18: 17 0.7668770 0.7759887 0.7714559 0.7717813 1774
19: 18 0.7795815 0.7874002 0.7834972 0.7826358 1983
20: 19 0.7892295 0.7930870 0.7911184 0.7907587 1559
21: 20 0.7939576 0.7965991 0.7954241 0.7957282 1644
22: 21 0.7966222 0.7970759 0.7968202 0.7968285 1746
23: 22 0.7971037 0.8036707 0.7992922 0.7989890 1813
24: 23 0.8054784 0.8395864 0.8200870 0.8199149 1783
25: 24 0.8457544 1.0000000 0.8864360 0.8829633 1625
bot_cap_eng_bin mi ma me med n
```
![[Pasted image 20210510110644.png]]
# batches 44, 45
- manual bin `bot_eng_fake_follower` into 25 equally-sized bins `bot_eng_fake_follower_bin`
```r
cluster mi ma me med n
1: 0 0.00 0.00 0.0000000 0.00 887
2: 1 0.01 0.01 0.0100000 0.01 1531
3: 2 0.02 0.02 0.0200000 0.02 1797
4: 3 0.03 0.03 0.0300000 0.03 1882
5: 4 0.04 0.04 0.0400000 0.04 1913
6: 5 0.05 0.05 0.0500000 0.05 1886
7: 6 0.06 0.06 0.0600000 0.06 1695
8: 7 0.07 0.07 0.0700000 0.07 1632
9: 8 0.08 0.08 0.0800000 0.08 1471
10: 9 0.09 0.09 0.0900000 0.09 1445
11: 10 0.10 0.11 0.1049785 0.10 2561
12: 11 0.12 0.12 0.1200000 0.12 1111
13: 12 0.13 0.14 0.1347621 0.13 2039
14: 13 0.15 0.16 0.1550524 0.16 1910
15: 14 0.17 0.17 0.1700000 0.17 899
16: 15 0.18 0.20 0.1898462 0.19 2470
17: 16 0.21 0.22 0.2150906 0.22 1546
18: 17 0.23 0.24 0.2349749 0.23 1395
19: 18 0.25 0.27 0.2598353 0.26 1943
20: 19 0.28 0.30 0.2901061 0.29 1603
21: 20 0.31 0.34 0.3241068 0.32 1836
22: 21 0.35 0.38 0.3641765 0.36 1518
23: 22 0.39 0.45 0.4169846 0.42 1887
24: 23 0.46 0.56 0.5036695 0.50 1661
25: 24 0.57 1.00 0.7066543 0.67 1689
cluster mi ma me med n
```
![[Pasted image 20210510111518.png]]
# batches 46, 47
- manual bin `bot_eng_astroturf` into 25 equally-sized bins `bot_eng_astroturf_bin`
```r
cluster mi ma me med n
1: 0 0.00 0.00 0.00000000 0.00 474
2: 1 0.01 0.02 0.01585458 0.02 1843
3: 2 0.03 0.03 0.03000000 0.03 1309
4: 3 0.04 0.05 0.04520683 0.05 3046
5: 4 0.06 0.06 0.06000000 0.06 1712
6: 5 0.07 0.07 0.07000000 0.07 1616
7: 6 0.08 0.08 0.08000000 0.08 1651
8: 7 0.09 0.09 0.09000000 0.09 1646
9: 8 0.10 0.10 0.10000000 0.10 1505
10: 9 0.11 0.11 0.11000000 0.11 1429
11: 10 0.12 0.13 0.12481365 0.12 2549
12: 11 0.14 0.14 0.14000000 0.14 1180
13: 12 0.15 0.16 0.15481517 0.15 2083
14: 13 0.17 0.17 0.17000000 0.17 900
15: 14 0.18 0.20 0.18965347 0.19 2424
16: 15 0.21 0.22 0.21480684 0.21 1346
17: 16 0.23 0.25 0.23962878 0.24 1751
18: 17 0.26 0.28 0.26985180 0.27 1552
19: 18 0.29 0.33 0.30933198 0.31 1976
20: 19 0.34 0.38 0.35909091 0.36 1606
21: 20 0.39 0.45 0.41912556 0.42 1784
22: 21 0.46 0.53 0.49366983 0.49 1684
23: 22 0.54 0.63 0.58303452 0.58 1651
24: 23 0.64 0.77 0.70164937 0.70 1734
25: 24 0.78 1.00 0.87140661 0.87 1756
cluster mi ma me med n
```
![[Pasted image 20210510111924.png]]
# batches 48, 49
- manual bin `statuses_count` into 10 equally-sized bins `statuses_count_bin`
```r
cluster mi ma me med n
1: 1 1 388 163.2539 149.0 10261
2: 2 389 1194 753.7337 738.0 10232
3: 3 1195 2508 1800.6692 1780.0 10248
4: 4 2509 4545 3461.6201 3439.0 10248
5: 5 4546 7703 6009.0025 5948.0 10249
6: 6 7704 12558 9970.8473 9895.0 10248
7: 7 12559 20627 16215.4793 16015.0 10247
8: 8 20628 35770 27306.7512 26883.5 10248
9: 9 35771 71960 50526.9134 48912.0 10248
10: 10 71963 2721215 164694.3694 121011.0 10249
```
![[Pasted image 20210510113505.png]]
# batches 50, 51
- manual bin `friend_follow_ratio` into 20 equally-sized bins `friend_follow_ratio_bin`
```r
cluster mi ma me med n
1: 1 0.00 0.14 0.05 0.05 5408
2: 2 0.14 0.40 0.27 0.26 5409
3: 3 0.40 0.66 0.53 0.53 5408
4: 4 0.66 0.87 0.77 0.78 5409
5: 5 0.87 0.99 0.94 0.94 5408
6: 6 0.99 1.07 1.03 1.03 5409
7: 7 1.07 1.20 1.13 1.12 5408
8: 8 1.20 1.39 1.29 1.29 5409
9: 9 1.39 1.62 1.50 1.50 5408
10: 10 1.62 1.89 1.75 1.74 5409
11: 11 1.89 2.19 2.03 2.02 5408
12: 12 2.19 2.60 2.39 2.39 5409
13: 13 2.60 3.11 2.85 2.84 5408
14: 14 3.11 3.80 3.44 3.43 5437
15: 15 3.80 4.75 4.25 4.23 5400
16: 16 4.75 6.13 5.39 5.34 5389
17: 17 6.13 8.50 7.20 7.11 5409
18: 18 8.51 14.00 10.84 10.62 5465
19: 19 14.03 6855.50 31.47 21.56 5352
```
![[Pasted image 20210510131239.png]]
# batch 52
- manual bin `bot_eng_fake_follower` into 120 equally-sized bins `bot_eng_fake_follower_bin`
- each bin has about 400 users
# batch 53
- manual bin `favourites_count` into 99 equally-sized bins
- each bin has about 1038 users
# batches 54, 55
- see [[20210514_232619 ML models to predict match rate]]
- manual bin `match_rate_pred` into 10 equally-sized bins
- batch 54: each bin has about 10k users
- batch 55 is 54 split into random halves
![[Pasted image 20210515000851.png]]
# batches 56, 57
- see [[20210514_232619 ML models to predict match rate]]
- manual bin `match_rate_pred` into 10 equally-sized bins
- batch 56: each bin has about 10k users
- batch 57 is 56 split into random halves
```r
# match rate by cluster
cluster mi ma mu med n
1: 1 0.190 0.402 0.378 0.383 10277
2: 2 0.402 0.427 0.415 0.416 10276
3: 3 0.427 0.448 0.438 0.438 10276
4: 4 0.448 0.468 0.458 0.458 10277
5: 5 0.468 0.487 0.477 0.477 10276
6: 6 0.487 0.508 0.497 0.497 10276
7: 7 0.508 0.531 0.519 0.519 10277
8: 8 0.531 0.562 0.546 0.545 10276
9: 9 0.562 0.608 0.583 0.581 10276
10: 10 0.608 0.905 0.660 0.648 10277
```
![[Pasted image 20210516003008.png]]
# batches 58, 59
- newly scraped users!
- `"../data/users/users_tweet_count_2021-05-17_subset.csv" `
- manual bin `n_tweets` into 2 equally-sized bins
# batches 60, 61
- all users
# batch 62
- 24k users (out of 185k): shared > 3 but <= 30 newsguard links tweets
# batch 63
- 19k users
- min tweets per day 5
- min active days 8 (out of 10 to 12 days)
- 5k removed from batch 62 for "inactivity"