<- smoking_drinking %>%
smoking_cleaned mutate(
BMI = weight / ((height / 100) ^ 2),
weight_category = case_when(
> 30 ~ "Obesity",
BMI > 25 ~ "Overweight",
BMI > 18.5 ~ "Healthy",
BMI TRUE ~ "Underweight" # Catch-all for values <= 18.5
),
smoking_status = case_when(
== "1" ~ "Never Smoke",
SMK_stat_type_cd == "2" ~ "Smoked But Quit",
SMK_stat_type_cd == "3" ~ "Still Smoke",),
SMK_stat_type_cd %>%
)select(!c(DRK_YN,SMK_stat_type_cd,urine_protein,hear_left,hear_right, triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP, height, weight, waistline))
The trends, spatial difference, and associated outcome of smoking
<- read.csv(here::here('data_raw', 'by_country.csv'), skip = 1) %>%
by_country clean_names() %>%
select(c(countries_territories_and_areas, year, both_sexes_1,male_1,female_1)) %>%
rename(
smoking_rate = both_sexes_1,
male_smoking_rate = male_1,
female_smoking_rate = female_1
%>%
) filter(!rowSums(is.na(.)) > 0)
<- by_states %>%
by_states clean_names() %>%
mutate(
smoking_rate = 1-never_smoked,
smoking_rate_new = smoke_everyday + smoke_some_days
)
1. Introduction
Smoking remains a pervasive global health concern with significant implications for public well-being. According to CDC, cigarette smoking is the leading cause of preventable death in the United States, which causes more than 480,000 deaths each year in the United States. It is proven to increase the risks of lung cancer, heart disease, and strokes. As shown in the photo, there is a tragic change in the lungs after smoking. Besides, there are multiple toxic chemicals generated in the life cycle of cigarettes and associated products, such as nicotine, and butane. In this context, we are curious about whether the smoking rate has decreased as the public awareness of the risks brought by smoking cigarettes. In addition, are there significant differences in smoking rates all around the world, and across the US? At last, we would like to know if there are other associated health outcomes with smoking status.Understanding these patterns is crucial for informing public health initiatives and guiding efforts to mitigate the adverse effects associated with smoking.
With our survey, we found that the trend of smoking rate went down all over the world. In addition, the data in the US also shows a declining smoking rate all across the US. Some top states with highest smoking rate over the years are Kentucky, West Virginia, and Nevada.
There is a strong relation between systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as total cholesterol (total_chole) and low-density lipoprotein cholesterol (LDL_chole). Overall, within the dataset, there have been significantly elevated risks in overweight, blood pressure problems, and blood glucose problems in smoking groups (smoke but quit, and still smoke). Surprisingly, the still smoke group shows a better performance in those metrics than the smoke but quit group.
2. Research question
1.What countries have higher smoking rates? What regions in the US have higher tobacco consumption?
2.Does smoking rate increase or decrease over time?
3.Are those measurable health outcomes correlated with each other?
4.What measurable health outcomes are correlated with different types of smoking statuses?
3. Data sources
There are three data sets involved in our research.
The first data set is smoking_driking_dataset_Ver01.csv. It is used to explore: What measurable health outcomes are correlated with different types of smoking statuses? Are those measurable health outcomes correlated with each other? We downloaded this data from this link. This dataset is already pre-cleaned by a Kaggle user.There are 24 columns, 991346 rows of data. The original source is this link. It is collected by the National Health Insurance Service in Korea.
The second data is CDC_US_v2.csv. It is used to explore: What countries have higher smoking rates? Does smoking rate increase or decrease over time? We downloaded from this linkThis data is pre-cleaned by the github owner. This data is originally from this link. This data was collected by the CDC and describes the frequency of smoking rates of different countries.
The third data is by_country.csv. It is used to explore: what regions in the US have higher tobacco consumption? Does smoking rate increase or decrease over time? This data set is originally from this link. We cleaned this data by ourselves. This data is collected by WHO. It shows Age-standardized estimates of current tobacco use, tobacco smoking and cigarette smoking data by country. This data set contains 1478 rows and 11 columns. This estimates uses a statistical model based on a Bayesian negative binomial meta-regression is used to model prevalence of current tobacco use for each country, separately for men and women. The model has two main components: (a) adjusting for missing indicators and age groups, and (b) generating an estimate of trends over time as well as the 95% credible interval around the estimate.
4. Results
Does smoking rate increase or decrease over time? What countries have higher smoking rates?
<- by_country %>%
top10countries filter(year==2000) %>%
arrange(desc(smoking_rate)) %>%
slice(1:10)
<- by_country %>%
bottom10countries filter(year==2000) %>%
arrange(smoking_rate) %>%
slice(1:10)
There are 164 countries/territories/areas in this cleaned data sets.The estimates are in the year of 2000, 2005, 2010, 2018, 2019, 2020, 2023, and 2025. We decide to compare the year of 2000 and 2023 using a slope chart.
<- by_country %>%
by_country_slope filter(
%in% c(2000, 2023),
year %in% top10countries$countries_territories_and_areas) %>%
countries_territories_and_areas mutate(
# Reorder state variables
countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
desc(smoking_rate)),
year, # Convert year to discrete variable
year = as.factor(year),
# Make labels
label = paste(countries_territories_and_areas, ' (',
round(smoking_rate), ')'),
label_left = ifelse(year == 2000, label, NA),
label_right = ifelse(year == 2023, label, NA))
The overall trend of smoking rate in top 10 countries goes down by 20%. The top three smoking countries in 2000 are Kiribati (68%), Nauru (57%), and Greece (55%).The top three smoking countries in 2023 are Nauru(42%), Serbia(39%), and Bulgaria(38%).
ggplot(by_country_slope,
aes(
x = year,
y = smoking_rate,
group = countries_territories_and_areas)) +
geom_line(size=0.8)+
# Add 2000 labels (left side)
geom_text_repel(
aes(label = label_left),
hjust = 1, nudge_x = -0.05,
direction = 'y', segment.color = 'grey') +
# Add 2023 labels (right side)
geom_text_repel(aes(label = label_right),
hjust = 0, nudge_x = 0.05,
direction = 'y', segment.color = 'grey') +
scale_x_discrete(position = 'top') +
scale_color_manual(values = c('black')) +
# Annotate & adjust theme
labs(x = NULL,
y = 'Smoking rate (%) ',
title = 'Top 10 smoking countries (2000 - 2023)') +
theme_minimal_grid() +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = 'none')
The overall trend goes down by 1-6% for the bottom 10 countries, except Oman has a slight increase in smoking rate. Ethiopia has the lowest smoking rate at 5% in 2000. Ghana has the lowest smoking rate at 2% in 2023.
<- by_country %>%
by_country_slope_bottom filter(
%in% c(2000, 2023),
year %in% bottom10countries$countries_territories_and_areas) %>%
countries_territories_and_areas mutate(
# Reorder state variables
countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
desc(smoking_rate)),
year, # Convert year to discrete variable
year = as.factor(year),
# Define line color
lineColor = if_else(
== 'Oman', 'OMAN', 'other'),
countries_territories_and_areas # Make labels
label = paste(countries_territories_and_areas, ' (',
round(smoking_rate), ')'),
label_left = ifelse(year == 2000, label, NA),
label_right = ifelse(year == 2023, label, NA))
ggplot(by_country_slope_bottom,
aes(
x = year,
y = smoking_rate,
group = countries_territories_and_areas)) +
geom_line(aes(color = lineColor),
size=0.8)+
# Add 2000 labels (left side)
geom_text_repel(
aes(label = label_left),
hjust = 1, nudge_x = -0.05,
direction = 'y', segment.color = 'grey') +
# Add 2023 labels (right side)
geom_text_repel(aes(label = label_right),
hjust = 0, nudge_x = 0.05,
direction = 'y', segment.color = 'grey') +
scale_x_discrete(position = 'top') +
scale_color_manual(values = c('red', 'black')) +
# Annotate & adjust theme
labs(x = NULL,
y = 'Smoking rate (%) ',
title = '10 countries with lowest smoking rates (2000 - 2023)') +
theme_minimal_grid() +
theme(panel.grid = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
legend.position = 'none')
What is the trends over time? What states have higher smoking rate over time?
This data set ranges from 1995 to 2010. During the 15 years, the overall trends of smoking goes down. Kentucky has the highest smoking rate in most years(0.25 to 0.33). West Virginia has an interesting trend which goes down and rises again in 15 years, showing almost no trends of going down. Nevada is the highest one in 1999 at 31 %. It should be noted that the smoking rate is the total of everyday smoking rate and some day smoking rate, without the presence of the smoked but quit.
<- by_states %>%
by_states_formatted_new group_by(year) %>%
# The * 1 makes it possible to have non-integer ranks while sliding
mutate(new_rank = rank(-smoking_rate_new, ties.method = "first"),
Value_lbl = paste0(" ",smoking_rate_new)) %>%
group_by(state) %>%
filter(new_rank <=10) %>%
ungroup()
<- by_states_formatted_new %>%
by_states_anim_new mutate(year = as.integer(year)) %>%
ggplot(aes(x = new_rank, group = state,fill = state)) +
geom_tile(aes(y = smoking_rate_new / 2,
height = smoking_rate_new),
width = 0.5, alpha = 0.8, color = NA) +
geom_text(aes(y=smoking_rate_new,label = Value_lbl, hjust=0)) +
coord_flip(clip = "off", expand = FALSE) +
geom_text(aes(y = 0, label = paste(state, " ")),
vjust = 0.2, hjust = 1) +
coord_flip(clip = 'off', expand = FALSE) +
scale_y_continuous(labels = scales::comma) +
scale_fill_viridis(discrete = TRUE) +
scale_color_viridis(discrete = TRUE) +
scale_x_reverse() +
# scale_x_continuous( limits = c(0, 0.6),)+
guides(color = FALSE) +
theme_minimal_vgrid() +
theme(
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.position = "none",
legend.background = element_rect(fill = 'white'),
plot.title = element_text(
size = 22, hjust = 0.5, face = 'bold',
colour = 'grey', vjust = -1),
plot.subtitle = element_text(
size = 18, hjust = 0.5,
face = 'italic', color = 'grey'),
plot.caption = element_text(
size = 8, hjust = 0.5,
face = 'italic', color = 'grey'),
plot.margin = margin(0.5, 2, 0.5, 3, 'cm')) +
transition_time(year) +
view_follow(fixed_x = TRUE) +
labs(title = 'Year : {frame_time}',
subtitle = 'Top 10 states by smoking rate')
animate(by_states_anim_new, duration = 15, end_pause = 15,
width = 800, height = 700, res = 150,fps = 20,
renderer = magick_renderer())