Failing to Predict across High Anxiety Levels

Author

Ian McFarlane

Exploratory analysis revealed little evidence of a relationship between high anxiety levels (8–10) and the available predictors, raising concerns about whether these classes contain enough signal to support meaningful modeling.

Mutual Information for Anxiety Level (8-10)

To evaluate this more formally, we computed the mutual information between Anxiety Level and each predictor—a measure of how much information a feature provides about the target. All values were approximately 0.01 or lower, indicating minimal predictive power.

Permutation tests confirmed these results. Only one predictor—Diet Quality—yielded a statistically significant mutual information score (MI ≈ 0.01, p < 0.01), suggesting a weak but detectable association. However, the effect is negligible in practical terms. All other predictors had MI < 0.01 and p-values > 0.10, reinforcing the conclusion that high anxiety levels cannot be reliably predicted from the available features.

Trying to Overfit a Model

As a diagnostic exercise, we intentionally overfit a decision tree model to probe for residual signal, using highly permissive parameters (mincriterion = 0.5, minsplit = 5). The outcome (Anxiety Level 8–10) was treated as categorical to increase sensitivity to subtle distinctions.

Even with relaxed constraints, the tree produced no splits—implying that no features provided even marginal discriminatory value. The confusion matrix confirms this: the model defaulted to predicting the majority class (8) for all observations, relying entirely on class frequency rather than learned structure.

Confusion Matrix and Statistics

          Reference
Prediction   8   9  10
        8  363 329 322
        9    0   0   0
        10   0   0   0

Overall Statistics
                                          
               Accuracy : 0.358           
                 95% CI : (0.3284, 0.3884)
    No Information Rate : 0.358           
    P-Value [Acc > NIR] : 0.5118          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 8 Class: 9 Class: 10
Sensitivity             1.000   0.0000    0.0000
Specificity             0.000   1.0000    1.0000
Pos Pred Value          0.358      NaN       NaN
Neg Pred Value            NaN   0.6755    0.6824
Prevalence              0.358   0.3245    0.3176
Detection Rate          0.358   0.0000    0.0000
Detection Prevalence    1.000   0.0000    0.0000
Balanced Accuracy       0.500   0.5000    0.5000

Conclusion

The consistent failure of both information-theoretic and model-based methods confirms that anxiety levels 8–10 cannot be reliably distinguished from one another based on the available predictors. These observations form an indistinct cluster with no usable internal structure. Accordingly, we model them as a single high-anxiety class using binary logistic regression. For all remaining levels, we apply a separate linear regression to capture finer-grained variation. This stacked modeling approach respects the limits of the data while maximizing interpretability where possible.