220707_150753 exploratory moderation grf causal forest HTE analysis

- used generalized random forests (GRFs) to find heterogeneous treatment effects - days/dvs chosen based on [[220518_170142 campaign 3 daily analysis and results|earlier results]] - fixed parameters - winsorize: 0.95 - cluster on block # summary of results Performed GRFs on select days (with strongest treatment effects - treatment group shared less bad stuff). Tried it on four different days/DVs (see 4 sections below) Overall, no significant heterogeneous treatment effects (HTE) in all the analyses But if we look at which individual covariates (of 13) are most important and likely to moderate treatment effects, it's usually the outcome itself that's measured pre-campaign (`t0`), total activity during the campaign, or activity prior to the campaign. **But surprisingly, the effects are generally negative (see figures below): users with greater activity or shared more bad stuff (x-axes) tend to have greater (more negative) treatment effect (y-axes). Most of the figures below have the same trend**. How to perform these analyses across days? Probably analyses done on single days don't have enough statistical power to detect HTEs, so would be great if we can somehow aggregate across days... # fc summed badness, threshold 50, 2021-10-23 - file: `../data-v09-badness-daily/dv_fc_badness_threshold50_day2021-10-23.csv` - somewhat significant condition effect (see quasipoisson model below) ```r > m <- feglm(t1 ~ conditionC * t0LC | block, dt1, family = "quasipoisson") NOTE: 1,530 fixed-effects (6,782 observations) removed because of only 0 outcomes. > m GLM estimation, family = quasipoisson, Dep. Var.: t1 Observations: 20,968 Fixed-effects: block: 3,889 Standard-errors: Clustered (block) Estimate Std. Error t value Pr(>|t|) conditionC -0.092554 0.035737 -2.58983 0.0096381 ** t0LC 0.140713 0.010101 13.93079 < 2.2e-16 *** conditionC:t0LC 0.029397 0.012812 2.29443 0.0218191 * ``` **Overall**, no heterogeneous treatment effects (HTE). ```r Best linear fit using forest predictions (on held-out data) as well as the mean forest prediction as regressors, along with one-sided heteroskedasticity-robust (HC3) SEs: Estimate Std. Error t value Pr(>t) mean.forest.prediction 1.019108 0.432859 2.3544 0.009281 ** differential.forest.prediction 0.086326 0.380972 0.2266 0.410371 # not significant - no HTE ``` covariate importance ```r # cov variable importance (sorted most to least important) covariate imp 1: t0 0.24765623 2: total_activity 0.23195134 3: total_rt_t0 0.12609527 4: statuses_count 0.08421867 5: favourites_count 0.06184099 6: friends_count 0.04580157 7: description_len 0.04376029 8: followers_count 0.04265408 9: friend_follow_ratio 0.03835385 10: days_since_create 0.02967993 11: description_alpha_pct 0.01878629 12: name_alpha_pct 0.01462584 13: name_len 0.01457565 ``` Stronger treatment effect for users with higher `t0` values (the outcome, but measured before campaign). ![[t0.png]] ![[1657922619.png]] ![[1657922750.png]] ### but when `t0` is the ONLY covariate used in the analysis (instead of 13 covariates) We see potentially opposite effects (more bad stuff shared pre-campaign is associated with weaker treatment effect) relative to when we used all 13 covariates - consistent with the positive `conditionC:t0LC` linear interaction effect (p = .02), but the figure below looks more like a null interaction effect? ![[1658170379.png]] # fc count badness, threshold 65, 2021-10-23 - file: `../data-v09-badness-daily/dv_fc_badness_threshold65_day2021-10-23.csv` ```r GLM estimation, family = quasipoisson, Dep. Var.: t1 Observations: 18,496 Fixed-effects: block: 3,399 Standard-errors: Clustered (block) Estimate Std. Error t value Pr(>|t|) conditionC -0.093240 0.041005 -2.27386 0.023037 * t0LC 0.481618 0.024707 19.49328 < 2.2e-16 *** conditionC:t0LC 0.053403 0.025872 2.06410 0.039084 * ``` No HTE. ```r Best linear fit using forest predictions (on held-out data) as well as the mean forest prediction as regressors, along with one-sided heteroskedasticity-robust (HC3) SEs: Estimate Std. Error t value Pr(>t) mean.forest.prediction 1.00553 0.43871 2.2921 0.01096 * differential.forest.prediction 0.12120 0.40697 0.2978 0.38293 ``` covariate importance ```r covariate imp 1: t0 0.26697278 2: total_activity 0.21717985 3: total_rt_t0 0.12697095 4: statuses_count 0.08707942 5: favourites_count 0.05294394 6: friends_count 0.05156918 7: friend_follow_ratio 0.04555643 8: followers_count 0.03920187 9: description_len 0.03177402 10: days_since_create 0.03137276 11: description_alpha_pct 0.01883286 12: name_len 0.01658380 13: name_alpha_pct 0.01396215 ``` ![[1657923587.png]] ![[1657923611.png]] # mbfc_min sum badness, threshold 80, 2021-10-22 - file: `../data-v09-badness-daily/dv_mbfc_min_badness_threshold80_day2021-10-22.csv` ```r > m <- feglm(t1 ~ conditionC * t0LC | block, dt1, family = "quasipoisson") NOTE: 202 fixed-effects (702 observations) removed because of only 0 outcomes. > m GLM estimation, family = quasipoisson, Dep. Var.: t1 Observations: 27,718 Fixed-effects: block: 5,219 Standard-errors: Clustered (block) Estimate Std. Error t value Pr(>|t|) conditionC -0.036149 0.015626 -2.31344 2.0737e-02 * t0LC 0.019798 0.004617 4.28795 1.8359e-05 *** conditionC:t0LC 0.007465 0.007078 1.05469 2.9162e-01 ``` No HTE. ```r Best linear fit using forest predictions (on held-out data) as well as the mean forest prediction as regressors, along with one-sided heteroskedasticity-robust (HC3) SEs: Estimate Std. Error t value Pr(>t) mean.forest.prediction 0.97994 0.44240 2.2151 0.01338 * differential.forest.prediction -0.10503 0.47918 -0.2192 0.58675 ``` covariate importance ``` covariate imp 1: total_activity 0.21904111 2: total_rt_t0 0.20239302 3: t0 0.11599926 4: friends_count 0.10135643 5: statuses_count 0.09968321 6: favourites_count 0.07836697 7: followers_count 0.03859189 8: days_since_create 0.03823481 9: description_len 0.02555517 10: friend_follow_ratio 0.02341081 11: name_len 0.02074606 12: description_alpha_pct 0.01976041 13: name_alpha_pct 0.01686086 ``` ![[1657924283.png]] # afm_min fraction badness, threshold 80, 2021-10-22 - file: `../data-v09-badness-daily/dv_afm_min_badness_threshold80_day2021-10-21.csv` ```r OLS estimation, Dep. Var.: t1 Observations: 28,674 Fixed-effects: block: 5,421 Standard-errors: Clustered (block) Estimate Std. Error t value Pr(>|t|) conditionC -0.002031 0.000831 -2.44366 0.014571 * t0LC 0.687980 0.012675 54.27834 < 2.2e-16 *** conditionC:t0LC -0.037243 0.019566 -1.90348 0.057031 . ``` No HTE ```r Best linear fit using forest predictions (on held-out data) as well as the mean forest prediction as regressors, along with one-sided heteroskedasticity-robust (HC3) SEs: Estimate Std. Error t value Pr(>t) mean.forest.prediction 1.01617 0.76373 1.3305 0.09168 . differential.forest.prediction -0.27473 0.46446 -0.5915 0.72291 ``` covariate importance ```r covariate imp 1: t0 0.20769795 2: followers_count 0.13051875 3: friend_follow_ratio 0.12117910 4: total_activity 0.08039265 5: friends_count 0.07972335 6: days_since_create 0.07204435 7: favourites_count 0.07175145 8: total_rt_t0 0.07051493 9: statuses_count 0.04879788 10: description_len 0.04353977 11: description_alpha_pct 0.03048560 12: name_len 0.02864845 13: name_alpha_pct 0.01470578 ``` ![[1657924882.png]] ![[1657924911.png]]