Exploratory Data Analysis of Anxiety and Predictors

Author

Ian McFarlane

This exploratory data analysis examines how a range of behavioral, physiological, and demographic predictors relate to self-reported anxiety levels (1–10). The goal is to surface patterns, distributions, and potential structural distinctions that may guide future modeling decisions.

Visualizations

Single Variable Plots

To guide univariate visualizations, predictors were grouped into five structural types: Categorical (Few), Categorical (Many), Ordinal, Discrete, and Continuous. Visualizations were matched accordingly—e.g., bar plots for discrete or ordinal variables, and histograms with overlaid density curves for continuous ones (10 bins, adjust = 1.5) to smooth sampling artifacts without distorting shape. This approach balances visual clarity with representational accuracy.

Most variables are roughly uniform or flat. Notable exceptions include Sleep Hours (approximately normal), Caffeine Intake (slightly left-skewed), and Physical Activity and Therapy Sessions (strongly right-skewed, which may hint towards log transformation when modeling). Anxiety Level, while also right-skewed, is treated as an ordinal response and shouldn’t be transformed.

Use the dropdown below to explore each variable’s distribution and summary statistics interactively.

Female	Male	Other
3730	3657	3613

No	Yes
5221	5779

No	Yes
5153	5847

No	Yes
5328	5672

No	Yes
5334	5666

No	Yes
5377	5623

Artist	Athlete	Chef	Doctor	Engineer	Freelancer	Lawyer	Musician	Nurse	Other	Scientist	Student	Teacher
888	822	858	842	833	838	809	892	861	840	832	878	807

1	2	3	4	5	6	7	8	9	10
978	999	964	976	1037	998	1006	1326	1335	1381

1	2	3	4	5
1978	2092	2323	2279	2328

1	2	3	4	5	6	7	8	9	10
1281	1291	1254	1260	994	985	991	960	1002	982

1	2	3	4	5	6	7	8	9	10
1039	1756	2407	2416	1629	616	123	363	329	322

Mean	Median	SD	Skewness
9.702	10	5.69	-0.023

Mean	Median	SD	Skewness
2.428	2	2.183	1.035

Mean	Median	SD	Skewness
20.958	21	5.16	-0.141

Mean	Median	SD	Skewness
40.242	40	13.236	0.097

Mean	Median	SD	Skewness
6.651	6.7	1.228	-0.224

Mean	Median	SD	Skewness
2.942	2.8	1.828	0.507

Mean	Median	SD	Skewness
286.09	273	144.813	0.324

Mean	Median	SD	Skewness
90.916	92	17.326	-0.119

Anxiety Level vs Other Variable Plots

To explore how each variable relates to the response variable Anxiety Level (1–10), we selected visualization strategies matched to variable type. While ordinal, Anxiety Level can flexibly be treated as categorical or numeric depending on context. For example, heatmaps were used for ordinal/discrete pairs, while continuous predictors were paired with boxplots, and categorical variables with bar or density plots. These choices maximize interpretability, particularly given the discrete nature of anxiety ratings.

Use the dropdown below to explore each variable’s distribution and summary statistics interactively.

The visualizations suggest that anxiety levels 8–10 may form a distinct subgroup. Across many variables, these levels appear to behave differently from levels 1–7. We summarize the key differences below.

Categorical Variables

Gender, Smoking, Dizziness, Medication, and Recent Major Life Event display internally consistent distributions within levels 1–7 and within 8–10 — but the proportions differ markedly between these two ranges.
Family History of Anxiety shows a clear increasing trend with Anxiety Level, which plateaus at higher levels.
Occupation appears to have minimal association with anxiety — inter-group distributions are largely overlapping.

Numeric and Ordinal Variables

Age, Sleep Hours, Physical Activity, and Heart Rate show relatively stable averages within levels 1–7, then shift to consistently different values for levels 8–10 — again supporting a two-regime structure.
Caffeine Intake increases with anxiety level but levels off at the top three — similar to the Family History trend.
Stress Level increases steadily with anxiety through level 6, has sparse data at level 7, and becomes uniformly high in the top three.
Therapy Sessions follows a bimodal distribution with respect to anxiety. The lower anxiety group (1–7) is characterized by an approximately a concentrated cluster around low session counts (≈1/month), while levels 8–10 show a broader, less structured spread across session counts from 4 to 9. This separation suggests two distinct behavioral regimes — one centered and structured, the other elevated and diffuse.
Alcohol Consumption, Breathing Rate, Sweating Level, and Diet Quality don’t show clear trends across the full scale but stand out for a different reason: none of the participants at levels 8–10 fall into the “healthy” range for these variables.

Modeling Implications

The analysis suggests a clear divide in how predictors relate to Anxiety Level: responses in the 8–10 range show different patterns than those in the 1–7 range across many variables. These differences are substantial enough to warrant treating the groups separately in modeling.

To address this, we adopt a stacked modeling strategy:

A logistic model first distinguishes between Low / Moderate Anxiety (levels 1–7) and High Anxiety (levels 8–10).
Then, two separate models are used:
- One to predict specific anxiety levels within the Low / Moderate group.
- Another trained only on the High Anxiety group.

This setup allows each model to focus on patterns that are internally consistent within its group, rather than forcing a single model to bridge competing dynamics. While this approach introduces additional complexity, it offers gains in both interpretability and performance. The specific challenges and decisions for each stage will be addressed in the modeling sections that follow.

Visualizations

Single Variable Plots

Gender

Smoking

Family History of Anxiety

Dizziness

Medication

Recent Major Life Event

Occupation

Stress Level (1-10)

Sweating Level (1-5)

Diet Quality (1-10)

Anxiety Level (1-10)

Alcohol Consumption (drinks/week)

Therapy Sessions (per month)

Breathing Rate (breaths/min)

Age

Sleep Hours

Physical Activity (hrs/week)

Caffeine Intake (mg/day)

Heart Rate (bpm)

Anxiety Level vs Other Variable Plots

Anxiety Level (1-10) vs Gender

Anxiety Level (1-10) vs Smoking

Anxiety Level (1-10) vs Family History of Anxiety

Anxiety Level (1-10) vs Dizziness

Anxiety Level (1-10) vs Medication

Anxiety Level (1-10) vs Recent Major Life Event

Anxiety Level (1-10) vs Occupation

Anxiety Level (1-10) vs Stress Level (1-10)

Anxiety Level (1-10) vs Sweating Level (1-5)

Anxiety Level (1-10) vs Diet Quality (1-10)

Anxiety Level (1-10) vs Alcohol Consumption (drinks/week)

Anxiety Level (1-10) vs Therapy Sessions (per month)

Anxiety Level (1-10) vs Breathing Rate (breaths/min)

Anxiety Level (1-10) vs Age

Anxiety Level (1-10) vs Sleep Hours

Anxiety Level (1-10) vs Physical Activity (hrs/week)

Anxiety Level (1-10) vs Caffeine Intake (mg/day)

Anxiety Level (1-10) vs Heart Rate (bpm)

Categorical Variables

Numeric and Ordinal Variables

Modeling Implications