The trends, spatial difference, and associated outcome of smoking

Author

Kejia Hu, Jiaxin Wang

Published

December 6, 2023

smoking_cleaned <- smoking_drinking %>%
  mutate(
    BMI = weight / ((height / 100) ^ 2),
    weight_category = case_when(
      BMI > 30 ~ "Obesity",
      BMI > 25 ~ "Overweight",
      BMI > 18.5 ~ "Healthy",
      TRUE ~ "Underweight"  # Catch-all for values <= 18.5
      ),
        
    smoking_status = case_when(
      SMK_stat_type_cd == "1" ~ "Never Smoke",
      SMK_stat_type_cd == "2" ~ "Smoked But Quit",
      SMK_stat_type_cd == "3" ~ "Still Smoke",),
    )%>% 
    select(!c(DRK_YN,SMK_stat_type_cd,urine_protein,hear_left,hear_right, triglyceride,hemoglobin,urine_protein,serum_creatinine,SGOT_AST,SGOT_ALT,gamma_GTP, height, weight, waistline))
by_country <- read.csv(here::here('data_raw', 'by_country.csv'), skip = 1) %>% 
    clean_names() %>% 
    select(c(countries_territories_and_areas, year, both_sexes_1,male_1,female_1)) %>% 
    rename(
     smoking_rate = both_sexes_1,
     male_smoking_rate = male_1,
     female_smoking_rate = female_1  
    ) %>%
    filter(!rowSums(is.na(.)) > 0)
by_states <- by_states %>%
    clean_names() %>% 
    mutate(
    smoking_rate = 1-never_smoked,
    smoking_rate_new = smoke_everyday + smoke_some_days
    )

1. Introduction

Smoking remains a pervasive global health concern with significant implications for public well-being. According to CDC, cigarette smoking is the leading cause of preventable death in the United States, which causes more than 480,000 deaths each year in the United States. It is proven to increase the risks of lung cancer, heart disease, and strokes. As shown in the photo, there is a tragic change in the lungs after smoking. Besides, there are multiple toxic chemicals generated in the life cycle of cigarettes and associated products, such as nicotine, and butane. In this context, we are curious about whether the smoking rate has decreased as the public awareness of the risks brought by smoking cigarettes. In addition, are there significant differences in smoking rates all around the world, and across the US? At last, we would like to know if there are other associated health outcomes with smoking status.Understanding these patterns is crucial for informing public health initiatives and guiding efforts to mitigate the adverse effects associated with smoking.

With our survey, we found that the trend of smoking rate went down all over the world. In addition, the data in the US also shows a declining smoking rate all across the US. Some top states with highest smoking rate over the years are Kentucky, West Virginia, and Nevada.

There is a strong relation between systolic blood pressure (SBP) and diastolic blood pressure (DBP), as well as total cholesterol (total_chole) and low-density lipoprotein cholesterol (LDL_chole). Overall, within the dataset, there have been significantly elevated risks in overweight, blood pressure problems, and blood glucose problems in smoking groups (smoke but quit, and still smoke). Surprisingly, the still smoke group shows a better performance in those metrics than the smoke but quit group.

2. Research question

1.What countries have higher smoking rates? What regions in the US have higher tobacco consumption?
2.Does smoking rate increase or decrease over time?
3.Are those measurable health outcomes correlated with each other? 
4.What measurable health outcomes are correlated with different types of smoking statuses? 

3. Data sources

There are three data sets involved in our research.

The first data set is smoking_driking_dataset_Ver01.csv. It is used to explore: What measurable health outcomes are correlated with different types of smoking statuses? Are those measurable health outcomes correlated with each other? We downloaded this data from this link. This dataset is already pre-cleaned by a Kaggle user.There are 24 columns, 991346 rows of data. The original source is this link. It is collected by the National Health Insurance Service in Korea.

The second data is CDC_US_v2.csv. It is used to explore: What countries have higher smoking rates? Does smoking rate increase or decrease over time? We downloaded from this linkThis data is pre-cleaned by the github owner. This data is originally from this link. This data was collected by the CDC and describes the frequency of smoking rates of different countries.

The third data is by_country.csv. It is used to explore: what regions in the US have higher tobacco consumption? Does smoking rate increase or decrease over time? This data set is originally from this link. We cleaned this data by ourselves. This data is collected by WHO. It shows Age-standardized estimates of current tobacco use, tobacco smoking and cigarette smoking data by country. This data set contains 1478 rows and 11 columns. This estimates uses a statistical model based on a Bayesian negative binomial meta-regression is used to model prevalence of current tobacco use for each country, separately for men and women. The model has two main components: (a) adjusting for missing indicators and age groups, and (b) generating an estimate of trends over time as well as the 95% credible interval around the estimate.

4. Results

Does smoking rate increase or decrease over time? What countries have higher smoking rates?

top10countries <- by_country %>%
    filter(year==2000) %>% 
    arrange(desc(smoking_rate)) %>%
    slice(1:10)
bottom10countries <- by_country %>%
    filter(year==2000) %>% 
    arrange(smoking_rate) %>%
    slice(1:10)

There are 164 countries/territories/areas in this cleaned data sets.The estimates are in the year of 2000, 2005, 2010, 2018, 2019, 2020, 2023, and 2025. We decide to compare the year of 2000 and 2023 using a slope chart.

by_country_slope <- by_country %>%
  filter(
    year %in% c(2000, 2023),
    countries_territories_and_areas %in% top10countries$countries_territories_and_areas) %>%
  mutate(
    # Reorder state variables
   countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
      year, desc(smoking_rate)),
    # Convert year to discrete variable
    year = as.factor(year),
    # Make labels
    label = paste(countries_territories_and_areas, ' (',
                  round(smoking_rate), ')'),
    label_left = ifelse(year == 2000, label, NA),
    label_right = ifelse(year == 2023, label, NA))

The overall trend of smoking rate in top 10 countries goes down by 20%. The top three smoking countries in 2000 are Kiribati (68%), Nauru (57%), and Greece (55%).The top three smoking countries in 2023 are Nauru(42%), Serbia(39%), and Bulgaria(38%).

ggplot(by_country_slope,
       aes(
           x = year,
           y = smoking_rate,
           group = countries_territories_and_areas)) +
    geom_line(size=0.8)+
    # Add 2000 labels (left side)
    geom_text_repel(
      aes(label = label_left),
      hjust = 1, nudge_x = -0.05,
      direction = 'y', segment.color = 'grey') +
    # Add 2023 labels (right side)
    geom_text_repel(aes(label = label_right),
      hjust = 0, nudge_x = 0.05,
      direction = 'y', segment.color = 'grey') +
    scale_x_discrete(position = 'top') +
    scale_color_manual(values = c('black')) +
    # Annotate & adjust theme
    labs(x = NULL,
         y = 'Smoking rate (%) ',
         title = 'Top 10 smoking countries (2000 - 2023)') +
    theme_minimal_grid() +
    theme(panel.grid  = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          legend.position = 'none')

The overall trend goes down by 1-6% for the bottom 10 countries, except Oman has a slight increase in smoking rate. Ethiopia has the lowest smoking rate at 5% in 2000. Ghana has the lowest smoking rate at 2% in 2023.

by_country_slope_bottom <- by_country %>%
  filter(
    year %in% c(2000, 2023),
    countries_territories_and_areas %in% bottom10countries$countries_territories_and_areas) %>%
  mutate(
    # Reorder state variables
   countries_territories_and_areas = fct_reorder2(countries_territories_and_areas,
      year, desc(smoking_rate)),
    # Convert year to discrete variable
    year = as.factor(year),
    # Define line color
    lineColor = if_else(
        countries_territories_and_areas == 'Oman', 'OMAN', 'other'),
    # Make labels
    label = paste(countries_territories_and_areas, ' (',
                  round(smoking_rate), ')'),
    label_left = ifelse(year == 2000, label, NA),
    label_right = ifelse(year == 2023, label, NA))
ggplot(by_country_slope_bottom,
       aes(
           x = year,
           y = smoking_rate,
           group = countries_territories_and_areas)) +
    geom_line(aes(color = lineColor),
              size=0.8)+
    # Add 2000 labels (left side)
    geom_text_repel(
      aes(label = label_left),
      hjust = 1, nudge_x = -0.05,
      direction = 'y', segment.color = 'grey') +
    # Add 2023 labels (right side)
    geom_text_repel(aes(label = label_right),
      hjust = 0, nudge_x = 0.05,
      direction = 'y', segment.color = 'grey') +
    scale_x_discrete(position = 'top') +
    scale_color_manual(values = c('red', 'black')) +
    # Annotate & adjust theme
    labs(x = NULL,
         y = 'Smoking rate (%) ',
         title = '10 countries with lowest smoking rates (2000 - 2023)') +
    theme_minimal_grid() +
    theme(panel.grid  = element_blank(),
          axis.text.y = element_blank(),
          axis.ticks = element_blank(),
          legend.position = 'none')