The trends, spatial difference, and associated outcome of smoking

Author

Kejia Hu, Jiaxin Wang

Published

December 6, 2023

smoking_cleaned <- smoking_drinking %>%
  mutate(
    BMI = weight / ((height / 100) ^ 2),
    weight_category = case_when(
      BMI > 30 ~ "Obesity",
      BMI > 25 ~ "Overweight",
      BMI > 18.5 ~ "Healthy",
      TRUE ~ "Underweight"  # Catch-all for values <= 18.5
      ),
        
    smoking_status = case_when(
      SMK_stat_type_cd == "1" ~ "Never Smoke",
      SMK_stat_type_cd == "2" ~ "Smoked But Quit",
      SMK_stat_type_cd == "3" ~ "Still Smoke",),
    )%>% 
    select(!c(DRK_YN,SMK_stat_type_cd,urine_protein,hear_left,hear_right, triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP, height, weight, waistline))

by_country <- read.csv(here::here('data_raw', 'by_country.csv'), skip = 1) %>% 
    clean_names() %>% 
    select(c(countries_territories_and_areas, year, both_sexes_1,male_1,female_1)) %>% 
    rename(
     smoking_rate = both_sexes_1,
     male_smoking_rate = male_1,
     female_smoking_rate = female_1  
    ) %>%
    filter(!rowSums(is.na(.)) > 0)

by_states <- by_states %>%
    clean_names() %>% 
    mutate(
    smoking_rate = 1-never_smoked,
    smoking_rate_new = smoke_everyday + smoke_some_days
    )

1. Introduction

Smoking remains a pervasive global health concern with significant implications for public well-being. According to CDC, cigarette smoking is the leading cause of preventable death in the United States, which causes more than 480,000 deaths each year in the United States. It is proven to increase the risks of lung cancer, heart disease, and strokes. As shown in the photo, there is a tragic change in the lungs after smoking. Besides, there are multiple toxic chemicals generated in the life cycle of cigarettes and associated products, such as nicotine, and butane. In this context, we are curious about whether the smoking rate has decreased as the public awareness of the risks brought by smoking cigarettes. In addition, are there significant differences in smoking rates all around the world, and across the US? At last, we would like to know if there are other associated health outcomes with smoking status.Understanding these patterns is crucial for informing public health initiatives and guiding efforts to mitigate the adverse effects associated with smoking.

With our survey, we found that the trend of smoking rate went down all over the world. In addition, the data in the US also shows a declining smoking rate all across the US. Some top states with highest smoking rate over the years are Kentucky, West Virginia, and Nevada.

There is a strong relation between systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as total cholesterol (total_chole) and low-density lipoprotein cholesterol (LDL_chole). Overall, within the dataset, there have been significantly elevated risks in overweight, blood pressure problems, and blood glucose problems in smoking groups (smoke but quit, and still smoke). Surprisingly, the still smoke group shows a better performance in those metrics than the smoke but quit group.

2. Research question

1.What countries have higher smoking rates? What regions in the US have higher tobacco consumption?
2.Does smoking rate increase or decrease over time?
3.Are those measurable health outcomes correlated with each other? 
4.What measurable health outcomes are correlated with different types of smoking statuses?

3. Data sources

There are three data sets involved in our research.

The first data set is smoking_driking_dataset_Ver01.csv. It is used to explore: What measurable health outcomes are correlated with different types of smoking statuses? Are those measurable health outcomes correlated with each other? We downloaded this data from this link. This dataset is already pre-cleaned by a Kaggle user.There are 24 columns, 991346 rows of data. The original source is this link. It is collected by the National Health Insurance Service in Korea.

The second data is CDC_US_v2.csv. It is used to explore: What countries have higher smoking rates? Does smoking rate increase or decrease over time? We downloaded from this linkThis data is pre-cleaned by the github owner. This data is originally from this link. This data was collected by the CDC and describes the frequency of smoking rates of different countries.

The third data is by_country.csv. It is used to explore: what regions in the US have higher tobacco consumption? Does smoking rate increase or decrease over time? This data set is originally from this link. We cleaned this data by ourselves. This data is collected by WHO. It shows Age-standardized estimates of current tobacco use, tobacco smoking and cigarette smoking data by country. This data set contains 1478 rows and 11 columns. This estimates uses a statistical model based on a Bayesian negative binomial meta-regression is used to model prevalence of current tobacco use for each country, separately for men and women. The model has two main components: (a) adjusting for missing indicators and age groups, and (b) generating an estimate of trends over time as well as the 95% credible interval around the estimate.

4. Results

Does smoking rate increase or decrease over time? What countries have higher smoking rates?

top10countries <- by_country %>%
    filter(year==2000) %>% 
    arrange(desc(smoking_rate)) %>%
    slice(1:10)

bottom10countries <- by_country %>%
    filter(year==2000) %>% 
    arrange(smoking_rate) %>%
    slice(1:10)

There are 164 countries/territories/areas in this cleaned data sets.The estimates are in the year of 2000, 2005, 2010, 2018, 2019, 2020, 2023, and 2025. We decide to compare the year of 2000 and 2023 using a slope chart.

by_country_slope <- by_country %>%
  filter(
    year %in% c(2000, 2023),
    countries_territories_and_areas %in% top10countries$countries_territories_and_areas) %>%
  mutate(
    # Reorder state variables
   countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
      year, desc(smoking_rate)),
    # Convert year to discrete variable
    year = as.factor(year),
    # Make labels
    label = paste(countries_territories_and_areas, ' (',
                  round(smoking_rate), ')'),
    label_left = ifelse(year == 2000, label, NA),
    label_right = ifelse(year == 2023, label, NA))

The overall trend of smoking rate in top 10 countries goes down by 20%. The top three smoking countries in 2000 are Kiribati (68%), Nauru (57%), and Greece (55%).The top three smoking countries in 2023 are Nauru(42%), Serbia(39%), and Bulgaria(38%).

ggplot(by_country_slope,
       aes(
           x = year,
           y = smoking_rate,
           group = countries_territories_and_areas)) +
    geom_line(size=0.8)+
    # Add 2000 labels (left side)
    geom_text_repel(
      aes(label = label_left),
      hjust = 1, nudge_x = -0.05,
      direction = 'y', segment.color = 'grey') +
    # Add 2023 labels (right side)
    geom_text_repel(aes(label = label_right),
      hjust = 0, nudge_x = 0.05,
      direction = 'y', segment.color = 'grey') +
    scale_x_discrete(position = 'top') +
    scale_color_manual(values = c('black')) +
    # Annotate & adjust theme
    labs(x = NULL,
         y = 'Smoking rate (%) ',
         title = 'Top 10 smoking countries (2000 - 2023)') +
    theme_minimal_grid() +
    theme(panel.grid  = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          legend.position = 'none')

The overall trend goes down by 1-6% for the bottom 10 countries, except Oman has a slight increase in smoking rate. Ethiopia has the lowest smoking rate at 5% in 2000. Ghana has the lowest smoking rate at 2% in 2023.

by_country_slope_bottom <- by_country %>%
  filter(
    year %in% c(2000, 2023),
    countries_territories_and_areas %in% bottom10countries$countries_territories_and_areas) %>%
  mutate(
    # Reorder state variables
   countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
      year, desc(smoking_rate)),
    # Convert year to discrete variable
    year = as.factor(year),
    # Define line color
    lineColor = if_else(
        countries_territories_and_areas == 'Oman', 'OMAN', 'other'),
    # Make labels
    label = paste(countries_territories_and_areas, ' (',
                  round(smoking_rate), ')'),
    label_left = ifelse(year == 2000, label, NA),
    label_right = ifelse(year == 2023, label, NA))

ggplot(by_country_slope_bottom,
       aes(
           x = year,
           y = smoking_rate,
           group = countries_territories_and_areas)) +
    geom_line(aes(color = lineColor),
              size=0.8)+
    # Add 2000 labels (left side)
    geom_text_repel(
      aes(label = label_left),
      hjust = 1, nudge_x = -0.05,
      direction = 'y', segment.color = 'grey') +
    # Add 2023 labels (right side)
    geom_text_repel(aes(label = label_right),
      hjust = 0, nudge_x = 0.05,
      direction = 'y', segment.color = 'grey') +
    scale_x_discrete(position = 'top') +
    scale_color_manual(values = c('red', 'black')) +
    # Annotate & adjust theme
    labs(x = NULL,
         y = 'Smoking rate (%) ',
         title = '10 countries with lowest smoking rates (2000 - 2023)') +
    theme_minimal_grid() +
    theme(panel.grid  = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          legend.position = 'none')

What is the trends over time? What states have higher smoking rate over time?

This data set ranges from 1995 to 2010. During the 15 years, the overall trends of smoking goes down. Kentucky has the highest smoking rate in most years(0.25 to 0.33). West Virginia has an interesting trend which goes down and rises again in 15 years, showing almost no trends of going down. Nevada is the highest one in 1999 at 31 %. It should be noted that the smoking rate is the total of everyday smoking rate and some day smoking rate, without the presence of the smoked but quit.

by_states_formatted_new <- by_states %>%
  group_by(year) %>%
  # The * 1 makes it possible to have non-integer ranks while sliding
  mutate(new_rank = rank(-smoking_rate_new, ties.method = "first"),
         Value_lbl = paste0(" ",smoking_rate_new)) %>%
  group_by(state) %>% 
  filter(new_rank <=10) %>%
  ungroup()

by_states_anim_new <- by_states_formatted_new %>%
    mutate(year = as.integer(year)) %>%
    ggplot(aes(x = new_rank, group = state,fill = state)) +
    geom_tile(aes(y = smoking_rate_new / 2,
                  height = smoking_rate_new),
              width = 0.5, alpha = 0.8, color = NA) +
  geom_text(aes(y=smoking_rate_new,label = Value_lbl, hjust=0)) +
  coord_flip(clip = "off", expand = FALSE) +
    geom_text(aes(y = 0, label = paste(state, " ")),
              vjust = 0.2, hjust = 1) +
    coord_flip(clip = 'off', expand = FALSE) +
    scale_y_continuous(labels = scales::comma) +
    scale_fill_viridis(discrete = TRUE) +
    scale_color_viridis(discrete = TRUE) +
    scale_x_reverse() +
    # scale_x_continuous( limits = c(0, 0.6),)+
    guides(color = FALSE) +
    theme_minimal_vgrid() +
    theme(
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.position = "none",
        legend.background = element_rect(fill = 'white'),
        plot.title = element_text(
          size = 22, hjust = 0.5, face = 'bold',
          colour = 'grey', vjust = -1),
        plot.subtitle = element_text(
          size = 18, hjust = 0.5,
          face = 'italic', color = 'grey'),
        plot.caption = element_text(
          size = 8, hjust = 0.5,
          face = 'italic', color = 'grey'),
          plot.margin = margin(0.5, 2, 0.5, 3, 'cm')) +
    transition_time(year) +
    view_follow(fixed_x = TRUE) +
    labs(title    = 'Year : {frame_time}',
         subtitle = 'Top 10 states by smoking rate')

animate(by_states_anim_new, duration = 15, end_pause = 15,
        width = 800, height = 700, res = 150,fps = 20, 
        renderer = magick_renderer())

Are those measurable health outcomes correlated with each other?

There is a strong relation between systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as total cholesterol (total_chole) and low-density lipoprotein cholesterol (LDL_chole).

ggcorr(smoking_cleaned, label = TRUE) +
  ggtitle("Correlation Plot of Variables") +  # Adding a title
  theme_minimal() +  # Applying a minimal theme (you can change the theme as desired)
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

What measurable health outcomes are correlated with different types of smoking statuses?

To figure out the potential health outcomes brought by smoking, we use box plot to compare among different smoking categories and added medical reference as red dashed lines.

Overall, within the dataset, there have been significantly elevated risks in overweight, blood pressure problems, and blood glucose problems in smoking groups (smoke but quit, and still smoke). Surprisingly, the still smoke group shows a better performance in those metrics than the smoke but quit group.

The body mass index shows a highest median in smoked but quit group and followed by the still smoke group. There is significant overweight problem brought by smoking, whether you are quitting or not.

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = BMI, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Body Mass Index', y = NULL) +
    geom_vline(
            xintercept = c(19,25),
            color = 'red', 
            linetype = 'dashed',
            size = 1.5,
            )+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(15, 30),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    labs(
        title = 'Higher overweight risks in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

The systolic blood pressure higher than 130 indicating an elevated health risks. The smoked but quit group has a highest median systolic blood pressure. Those suggests smoking is associated with higher systolic blood pressure.

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = SBP, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Systolic blood pressure (mm Hg)', y = NULL) +
    geom_vline(
            xintercept = 130,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(75, 175),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    labs(
        title = 'Elevated systolic blood pressure in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

The Diastolic blood pressure higher than 80 indicating an elevated health risks. The smoked but quit group has a highest median followed by the still smoke group, and they are approaching the medical reference. While the never smoke group stays far away from the medical reference. Those suggests smoking is associated with higher diastolic blood pressure.

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = DBP, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Diastolic blood pressure(mm Hg)', y = NULL) +
    geom_vline(
            xintercept = 80,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(40, 120),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    labs(
        title = 'Elevated diastolic blood pressure in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

Elevated blood glucose is seen in two smoking groups, but they are within the medical reference levels.

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = BLDS, y = smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Fasting blood glucose [mg/dL]', y = NULL) +
    geom_vline(
            xintercept = c(79, 110),
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
    scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(50, 140),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    #theme_minimal_vgrid(font_size = 18) +
    labs(
        title = 'Elevated blood glucose in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

There is no significant changes in Total cholesterol among different smoking groups.

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = tot_chole, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Total cholesterol [mg/dL]', y = NULL) +
    geom_vline(
            xintercept = 239,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(50, 320),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+    
    labs(
            title = 'No significant changes in total cholesterol'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

There is no significant changes in LDL cholesterol among different smoking groups.

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = LDL_chole, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'LDL cholesterol [mg/dL]', y = NULL) +
    geom_vline(
            xintercept = 160,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5,
            )+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(0, 220),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+    
    labs(
            title = 'No significant changes in LDL cholesterol'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

5. Conclusions

Firstly, we found that the trend of smoking rate went down all over the world. In addition, the data in the US also shows a declining smoking rate all across the US. Some top states with highest smoking rate over the years are Kentucky, West Virginia, and Nevada. Within the south Korea health survey dataset, which shows a strong relation between systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as total cholesterol (total_chole) and low-density lipoprotein cholesterol (LDL_chole). At last, within the dataset, there have been significantly elevated risks in overweight, blood pressure problems, and blood glucose problems in smoking groups (smoke but quit, and still smoke). Surprisingly, the still smoke group shows a better performance in those metrics than the smoke but quit group.

In this study, the smoking status is only divided into three groups, which might be improved by reporting the years, or how frequent smoking is for each person. We could also further introduce the genders to see if there are any specific patterns for the associated health outcomes and trends. The gender-specific smoking rate/ smoking status is available in the by_country dataset and the South Korea Health Survey Dataset.

6. Attribution

Kejia and Jiaxin contributed equally in this project.

# Load libraries and settings here
library(knitr)
library(cowplot)
library(viridis)
library(readxl)
library(maps)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(rnaturalearthhires)
library(janitor)
library(tidyverse)
library(here)
library(ggrepel)
library(ggridges)
library(HistData)
library(GGally)
library(palmerpenguins)
library(tidyverse)
library(patchwork)
library(viridis)
library(gganimate)
library(magick)

knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE,
  comment = "#>",
  fig.path = "figs/", # Folder where rendered plots are saved
  fig.width = 7.252, # Default plot width
  fig.height = 4, # Default plot height
  fig.retina = 3 # For better plot resolution
)

# Put any other "global" settings here, e.g. a ggplot theme:
theme_set(theme_bw(base_size = 20))


# Write code below here to load any data used in project
smoking_drinking<-read.csv(here::here('data_raw',"smoking_driking_dataset_Ver01.csv"))
by_states<-read.csv(here::here('data_raw',"CDC_US_v2.csv"))

smoking_cleaned <- smoking_drinking %>%
  mutate(
    BMI = weight / ((height / 100) ^ 2),
    weight_category = case_when(
      BMI > 30 ~ "Obesity",
      BMI > 25 ~ "Overweight",
      BMI > 18.5 ~ "Healthy",
      TRUE ~ "Underweight"  # Catch-all for values <= 18.5
      ),
        
    smoking_status = case_when(
      SMK_stat_type_cd == "1" ~ "Never Smoke",
      SMK_stat_type_cd == "2" ~ "Smoked But Quit",
      SMK_stat_type_cd == "3" ~ "Still Smoke",),
    )%>% 
    select(!c(DRK_YN,SMK_stat_type_cd,urine_protein,hear_left,hear_right, triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP, height, weight, waistline))
by_country <- read.csv(here::here('data_raw', 'by_country.csv'), skip = 1) %>% 
    clean_names() %>% 
    select(c(countries_territories_and_areas, year, both_sexes_1,male_1,female_1)) %>% 
    rename(
     smoking_rate = both_sexes_1,
     male_smoking_rate = male_1,
     female_smoking_rate = female_1  
    ) %>%
    filter(!rowSums(is.na(.)) > 0)
    
by_states <- by_states %>%
    clean_names() %>% 
    mutate(
    smoking_rate = 1-never_smoked,
    smoking_rate_new = smoke_everyday + smoke_some_days
    )
top10countries <- by_country %>%
    filter(year==2000) %>% 
    arrange(desc(smoking_rate)) %>%
    slice(1:10)
bottom10countries <- by_country %>%
    filter(year==2000) %>% 
    arrange(smoking_rate) %>%
    slice(1:10)
by_country_slope <- by_country %>%
  filter(
    year %in% c(2000, 2023),
    countries_territories_and_areas %in% top10countries$countries_territories_and_areas) %>%
  mutate(
    # Reorder state variables
   countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
      year, desc(smoking_rate)),
    # Convert year to discrete variable
    year = as.factor(year),
    # Make labels
    label = paste(countries_territories_and_areas, ' (',
                  round(smoking_rate), ')'),
    label_left = ifelse(year == 2000, label, NA),
    label_right = ifelse(year == 2023, label, NA))

ggplot(by_country_slope,
       aes(
           x = year,
           y = smoking_rate,
           group = countries_territories_and_areas)) +
    geom_line(size=0.8)+
    # Add 2000 labels (left side)
    geom_text_repel(
      aes(label = label_left),
      hjust = 1, nudge_x = -0.05,
      direction = 'y', segment.color = 'grey') +
    # Add 2023 labels (right side)
    geom_text_repel(aes(label = label_right),
      hjust = 0, nudge_x = 0.05,
      direction = 'y', segment.color = 'grey') +
    scale_x_discrete(position = 'top') +
    scale_color_manual(values = c('black')) +
    # Annotate & adjust theme
    labs(x = NULL,
         y = 'Smoking rate (%) ',
         title = 'Top 10 smoking countries (2000 - 2023)') +
    theme_minimal_grid() +
    theme(panel.grid  = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          legend.position = 'none')
by_country_slope_bottom <- by_country %>%
  filter(
    year %in% c(2000, 2023),
    countries_territories_and_areas %in% bottom10countries$countries_territories_and_areas) %>%
  mutate(
    # Reorder state variables
   countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
      year, desc(smoking_rate)),
    # Convert year to discrete variable
    year = as.factor(year),
    # Define line color
    lineColor = if_else(
        countries_territories_and_areas == 'Oman', 'OMAN', 'other'),
    # Make labels
    label = paste(countries_territories_and_areas, ' (',
                  round(smoking_rate), ')'),
    label_left = ifelse(year == 2000, label, NA),
    label_right = ifelse(year == 2023, label, NA))
ggplot(by_country_slope_bottom,
       aes(
           x = year,
           y = smoking_rate,
           group = countries_territories_and_areas)) +
    geom_line(aes(color = lineColor),
              size=0.8)+
    # Add 2000 labels (left side)
    geom_text_repel(
      aes(label = label_left),
      hjust = 1, nudge_x = -0.05,
      direction = 'y', segment.color = 'grey') +
    # Add 2023 labels (right side)
    geom_text_repel(aes(label = label_right),
      hjust = 0, nudge_x = 0.05,
      direction = 'y', segment.color = 'grey') +
    scale_x_discrete(position = 'top') +
    scale_color_manual(values = c('red', 'black')) +
    # Annotate & adjust theme
    labs(x = NULL,
         y = 'Smoking rate (%) ',
         title = '10 countries with lowest smoking rates (2000 - 2023)') +
    theme_minimal_grid() +
    theme(panel.grid  = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          legend.position = 'none')
by_states_formatted_new <- by_states %>%
  group_by(year) %>%
  # The * 1 makes it possible to have non-integer ranks while sliding
  mutate(new_rank = rank(-smoking_rate_new, ties.method = "first"),
         Value_lbl = paste0(" ",smoking_rate_new)) %>%
  group_by(state) %>% 
  filter(new_rank <=10) %>%
  ungroup()
by_states_anim_new <- by_states_formatted_new %>%
    mutate(year = as.integer(year)) %>%
    ggplot(aes(x = new_rank, group = state,fill = state)) +
    geom_tile(aes(y = smoking_rate_new / 2,
                  height = smoking_rate_new),
              width = 0.5, alpha = 0.8, color = NA) +
  geom_text(aes(y=smoking_rate_new,label = Value_lbl, hjust=0)) +
  coord_flip(clip = "off", expand = FALSE) +
    geom_text(aes(y = 0, label = paste(state, " ")),
              vjust = 0.2, hjust = 1) +
    coord_flip(clip = 'off', expand = FALSE) +
    scale_y_continuous(labels = scales::comma) +
    scale_fill_viridis(discrete = TRUE) +
    scale_color_viridis(discrete = TRUE) +
    scale_x_reverse() +
    # scale_x_continuous( limits = c(0, 0.6),)+
    guides(color = FALSE) +
    theme_minimal_vgrid() +
    theme(
        axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank(),
        legend.position = "none",
        legend.background = element_rect(fill = 'white'),
        plot.title = element_text(
          size = 22, hjust = 0.5, face = 'bold',
          colour = 'grey', vjust = -1),
        plot.subtitle = element_text(
          size = 18, hjust = 0.5,
          face = 'italic', color = 'grey'),
        plot.caption = element_text(
          size = 8, hjust = 0.5,
          face = 'italic', color = 'grey'),
          plot.margin = margin(0.5, 2, 0.5, 3, 'cm')) +
    transition_time(year) +
    view_follow(fixed_x = TRUE) +
    labs(title    = 'Year : {frame_time}',
         subtitle = 'Top 10 states by smoking rate')

animate(by_states_anim_new, duration = 15, end_pause = 15,
        width = 800, height = 700, res = 150,fps = 20, 
        renderer = magick_renderer())


ggcorr(smoking_cleaned, label = TRUE) +
  ggtitle("Correlation Plot of Variables") +  # Adding a title
  theme_minimal() +  # Applying a minimal theme (you can change the theme as desired)
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = BMI, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Body Mass Index', y = NULL) +
    geom_vline(
            xintercept = c(19,25),
            color = 'red', 
            linetype = 'dashed',
            size = 1.5,
            )+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(15, 30),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    labs(
        title = 'Higher overweight risks in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )
smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = SBP, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Systolic blood pressure (mm Hg)', y = NULL) +
    geom_vline(
            xintercept = 130,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(75, 175),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    labs(
        title = 'Elevated systolic blood pressure in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = DBP, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Diastolic blood pressure(mm Hg)', y = NULL) +
    geom_vline(
            xintercept = 80,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(40, 120),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    labs(
        title = 'Elevated diastolic blood pressure in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )

smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = BLDS, y = smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Fasting blood glucose [mg/dL]', y = NULL) +
    geom_vline(
            xintercept = c(79, 110),
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
    scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(50, 140),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+
    #theme_minimal_vgrid(font_size = 18) +
    labs(
        title = 'Elevated blood glucose in smoking groups'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )
smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = tot_chole, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'Total cholesterol [mg/dL]', y = NULL) +
    geom_vline(
            xintercept = 239,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5)+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(50, 320),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+    
    labs(
            title = 'No significant changes in total cholesterol'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )
smoking_cleaned %>%
  ggplot() +
  geom_boxplot(aes(x = LDL_chole, y=smoking_status),outlier.shape = NA, size=1) +
  labs(x = 'LDL cholesterol [mg/dL]', y = NULL) +
    geom_vline(
            xintercept = 160,
            color = 'red', 
            linetype = 'dashed',
            size = 1.5,
            )+
        scale_x_continuous(
        # Set the lower & upper boundaries
        limits = c(0, 220),
        # Explicitly set the break points
        # breaks = c(break1, break2, etc.)
        # # Adjust the axis so bars start at 0
        expand = expand_scale(mult = c(0, 0.05)))+    
    labs(
            title = 'No significant changes in LDL cholesterol'
    )+
  theme(
    # Adjusting font sizes for various text elements
    plot.title = element_text(size = 16),  # Change plot title font size
    axis.title.x = element_text(size = 14),  # Change X-axis label font size
    axis.title.y = element_text(size = 14),  # Change Y-axis label font size
    axis.text = element_text(size = 12)  # Change axis tick label font size
    # Add other theme settings as needed
  )