Making maps with the CES

43 minute read

Published:

Making maps with the CES

What you’ll learn

  • How to use parts of the urbnmapr package to make some maps
  • Using ggplot2 attributes to make your maps more readable
  • When to not use maps (or at least think carefully before doing so)

What won’t you learn? There is a lot that goes into maps: projection types (see Dan Kelley’s excellent writeup on the subject), zip code fuzziness, and loading in shapefiles from the Census, just to name a few topics. Cartographers are beings of infinite knowledge and power, and I won’t be able to teach you all of their secrets (I don’t know most of them!). But don’t let that dissuade you! If you’re passionate about maps, this will be a good place for you to start. And if you just need to plot a state-level election outcome map for a class and are panicking, perhaps this tutorial will help you too.

We’ll also be using the dataverse and survey packages in this tutorial. I will not explain them in detail here. I have discussed them (and provided additional resources) in my last post: Plotting trends over time with the CES.

Much of this tutorial uses code that is adapted from an article by Christopher Goodman and the Urban Institute’s very own tutorial. When I do this adapting, I note it in-text, but I wanted to give credit up-top too. Thank you for these incredible resources!

As always, we need to start our code by loading in packages. You’ll need the following to packages execute my code, which you can install with install.packages("name"). You’ll then load them in like so:

#### LOAD PACKAGES ####
library(dataverse) # loading data
library(urbnmapr)  # geographic information for mapping!
library(ggplot2)   # pretty plots
library(geofacet)  # tile plots
library(scales)    # easy formatting
library(dplyr)     # data manipulation
library(survey)    # survey analysis
library(tidyverse) # the holy grail: more data wrangling

library(leaflet)   # optional: for interactive maps!
library(here)

The only exception to this is the urbnmapr package, which you need to install from GitHub. This is easy to do! Make sure you have the devtools installed as well and run what is below. If you are using a Mac, you may run into some problems loading the geofacet package. If this happens, follow this guidance on Stack Overflow, but apply it to the packages that did not load correctly. You may need to remove package installations and re-install them. In general, make sure you have the sf package installed prior to attempting to re-install geofacet.

Another note up top: my theming for plots includes fonts. If you do not have these fonts, your code will break. Simply run the code without lines mentioning fonts (I use “Open Sans Condensed Light”) and you should be good to go.

require("devtools")
devtools::install_github("UrbanInstitute/urbnmapr")

Getting the data

We’ll load in the 2020 CES data using the dataverse package.

##### LOADING DATA ####
ces2020_dataverse <- get_dataframe_by_name(
  filename = "CES20_Common_OUTPUT_vv.dta",
  dataset = "10.7910/DVN/E9N6PH",
  original = TRUE,
  .f = haven::read_dta,
  server = "dataverse.harvard.edu"
)

Now before proceeding, we’ll need to decide what questions we want to answer with maps. Picking these questions is important, and not something this tutorial can completely teach you. I’ll include a checklist that I like to use when I have an urge to make a map, but you’ll have to do some thinking on your own, too.

I’m interested in something methodological. The CES—and other surveys like it— often have multiple waves. There are many reasons for doing this. Maybe you want to see how respondents’ opinions will change after an important event (an election? The release of Red (Taylor’s Version)?). Perhaps you want to interview the same respondents again and again (called a panel study). Whatever your reason, you now have one main obstacle: attrition. It’s hard to get people to take surveys. Now getting those same people to take another one? It’s a tall task! I want to calculate what percentage of respondents failed to take the CES’ post-election survey in 2020 by state. And I want to map it!

Let’s start by selecting relevant questions. We’ll need some kind of case identifier, the variable of interest, and a geography variable of some kind (to use in mapping). I am most interested in state differences, so I am selecting the inputstate variable.

I do some other data manipulation in the following code to make things run smoothly (made sure state FIPS codes are interpreted correctly, and recoded the tookpost variable). I’ll explain what I’m doing in more detail in the code chunk!

# Let's select variables that we care about!
ces2020_selected <- ces2020_dataverse %>%
  select(
    
    # id for each respondent
    caseid, 
    
    # did you take the post-election survey?
    tookpost, 
    
    # state respondent is living in.
    # In this code, I am both selecting
    # the inputstate column and renaming
    # it to state_fips rough, all in
    # one line!
    state_fips_rough = inputstate) %>%
  mutate(
    
    # tookpost is originally coded so 2 = Yes
    # and 1 = No. I am recoding it so Yes = 1
    # and No = 0. Why? When I do calculations, 
    # the average of tookpost will be a proportion
    # between 0 and 1 that I can represent as a 
    # percent (ex. "85% of respondents took the
    # post-election survey").
    tookpost_recoded = case_when(
      tookpost == 1 ~ 0,
      tookpost == 2 ~ 1,
    ),
    
    # why am I changing state_fips_rough? Like
    # most coding, things were breaking which inspired
    # me to get creative. Try running this chunk without
    # this line and look at ces2020_selected. See anything
    # weird? Check before you read on!
    state_fips = as.character(sprintf("%02d", state_fips_rough))
    
    # Okay, now that you've looked, I'll spill the beans. 
    # State FIPS codes are always two digits, like 15, 50, 
    # and 02. R was reading in single digits like 02 as
    # just 2, which is a problem later on. To fix this
    # I used the function sprintf to tell R to add
    # leading zeroes to state_fips_rough until there
    # were two digits per entry. So a value of 4 would
    # turn into 04, but 36 would stay 36! I can't spend
    # more time on this now, but if people are interested
    # in this kind of data manipulation, I am happy to write
    # about it!
  )

Before we move forward, I want to see what sample sizes we’re working with. I can do this with the lovely dplyr and using piping.

# hey R, for the following, use ces2020_selected please!
ces2020_selected %>%
  
  # whatever calculation I tell you to do, please
  # do it grouped--- that is, only within entries
  # with the same state FIPS code.
  group_by(state_fips) %>%
  
  # could you tell me how many entries I have per
  # state?
  summarise(n = n())
## # A tibble: 51 x 2
##    state_fips     n
##    <chr>      <int>
##  1 01           947
##  2 02           115
##  3 04          1463
##  4 05           536
##  5 06          5035
##  6 08          1061
##  7 09           642
##  8 10           240
##  9 11           197
## 10 12          4615
## # ... with 41 more rows
# you'll notice that the whole table does not print here.
# To see the entire thing, print it in your R console

Interesting! It’s cool to see how the CES samples across the United States. Because I’m interested in just looking at the CES as is (and not using the CES to make a hypothesis about Americans writ large), I don’t need to do any other steps here. These numbers will be good to keep in mind as we calculate what percentage of respondents took the post-election survey by state. Actually, let’s go ahead and calculate that with dplyr!

post_aggregate <- ces2020_selected %>%
  group_by(state_fips) %>%
  
  # give me the mean of tookpost_recoded 
  # for each state, and make that a new
  # variable called mean_post
  summarise(mean_post = mean(tookpost_recoded))

# let's take a look!
post_aggregate
## # A tibble: 51 x 2
##    state_fips mean_post
##    <chr>          <dbl>
##  1 01             0.796
##  2 02             0.774
##  3 04             0.881
##  4 05             0.772
##  5 06             0.842
##  6 08             0.866
##  7 09             0.864
##  8 10             0.854
##  9 11             0.792
## 10 12             0.851
## # ... with 41 more rows

Now it’s time to use the urbnmapr package developed by The Urban Institute. Mapping is actually quite complicated! Even though we now have a summary table that tells us what percentage of respondents took the post-election survey for each state, R still needs to be able to draw the map! How will it know what states are shaped like? How boundaries intersect? How will it make the map pretty?

You try drawing a map of the U.S. from scratch without guidlines. It’s hard! urbnmapr helps by letting us pull from its database. Below, I use our state_fips column to pull geographic details from urbnmapr. I can do this because urbnmapr has a databse with a column named state_fips (almost like I named it the same on purpose!).

post_map <- left_join(post_aggregate, 
                      states, by = "state_fips")

post_map
## # A tibble: 83,933 x 10
##    state_fips mean_post  long   lat order hole  piece group state_abbv
##    <chr>          <dbl> <dbl> <dbl> <int> <lgl> <fct> <fct> <chr>     
##  1 01             0.796 -88.5  31.9     1 FALSE 1     01.1  AL        
##  2 01             0.796 -88.5  31.9     2 FALSE 1     01.1  AL        
##  3 01             0.796 -88.5  31.9     3 FALSE 1     01.1  AL        
##  4 01             0.796 -88.5  32.0     4 FALSE 1     01.1  AL        
##  5 01             0.796 -88.5  32.0     5 FALSE 1     01.1  AL        
##  6 01             0.796 -88.5  32.1     6 FALSE 1     01.1  AL        
##  7 01             0.796 -88.4  32.2     7 FALSE 1     01.1  AL        
##  8 01             0.796 -88.4  32.2     8 FALSE 1     01.1  AL        
##  9 01             0.796 -88.4  32.2     9 FALSE 1     01.1  AL        
## 10 01             0.796 -88.4  32.3    10 FALSE 1     01.1  AL        
## # ... with 83,923 more rows, and 1 more variable: state_name <chr>

When we print post_map, you’ll notice a lot of information that we didn’t have before. It looks like we have longitude and latitude, some other visual helpers for mapping, as well as helpful state abbreviations! We retain our other data (most importantly— our mean_post stat).

Believe it or not, it’s now time to map. The code for creating maps can be daunting, and I’ll present a big chunk to you with annotations. You should feel free to try this code with lines removed to see how the map would appear differently!

# Plotting 
# Code adapted from: 
# https://www.cgoodman.com/blog/archives/2018/06/16/maps-in-r-using-urbnmapr/
# I also retain some of Chris' annotations here for clarity.

ggplot() +
  
  # county map
  geom_polygon(data = post_map,
               mapping = aes(x = long, y = lat,
                             group = group,
                             
                             # I want the color 
                             # that each state
                             # has to depend
                             # on mean_post!
                             fill = mean_post)) +
 
   # add state outlines using urbnmapr
  geom_polygon(data = urbnmapr::states,
               # when we write urbnmapr::states
               # we are telling R to specifically
               # use the urbnmapr package to pull up
               # states. Some coders like using this notation
               # consistently ---whenever they use a function
               # they tell R which package it comes from.
               # I don't do that, but if your code is
               # breaking and you don't know why--- try
               # calling on the package explicitly (sometimes)
               # packages have functions that are named the
               # same thing, which confuses R!
               
               mapping = aes(long, lat,group = group),
               fill = NA, color = "#ffffff", size = 0.4) +
  
  # projection
  coord_map(projection = "polyconic")+
  
  # I want to create a scale that goes from low to
  # high, where the color of the high value connotes
  # that things are good. In the United States,
  # the color green is often the "good" or "everything's
  # fine" color, so I'll use a version of that.
  
  # You can find an R Color Cheat Sheet here by Melanie Frazier: 
  # https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf
  scale_fill_gradient(low = "white", 
                      high = "seagreen4", 
                      
                      # I can define limits explicitly if I want.
                      # If you do this, always check that you are not
                      # accidentally excluding data! (I made sure all my)
                      # data fit in this range reasonably.
                      limits = c(0.75, 0.95), 
                      
                      # Right now my code is in decimals.
                      # We can use a trick we learned last tutorial
                      # to make the labels into integer percentages.
                      # Try changing the accuracy = 5L argument
                      # to learn what it does!
                      labels = percent_format(accuracy = 5L)) + 
  # Theming
  theme_minimal(base_family = "Open Sans Condensed Light")+
  theme(
    legend.position = "right",
    legend.text.align = 0,
    legend.title = element_blank(),
    plot.margin = unit(c(.5,.5,.2,.5), "cm")) +
    theme(
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
  ) +
  
  # some useful titles!
  labs(x = "",
       y = "",
       title = "% of Respondents Who Took the Post Election Survey",
       caption = "Author: Pia Deshpande, \n Data: 2020 Cooperative Election Study")

And ta da! A map! It seems like respondents don’t take the post election survey as frequenty in Mississippi, Arkansas, Oklahoma, Alaska, and Texas, among others. On the other hand, New Mexico and New Hampshire seem to be avid survey takers.

There are some good things about the map, and some bad things.

Let’s start with the bad. All maps take data that is detailed, and represents that data in a less detailed way to be intelligible. I’m not sure exactly what percentage of Texans took the post-election survey, and to find out, I would need to pull up that table we created earlier! The color scale I picked gives me some idea how states are doing compared to each other, but not how they are doing compared to some norm. What if I wanted to see which states were below the CES’ national average? This would not be the plot to use then! Another bad thing is endemic to this type of map — I’ve talked about Texas twice now. Why? Well, it’s big on the map, and my eye wants to pay attention to it. The north east is dwarfed completely, and it makes it hard to see trends there.

Okay, but are there good things? Yes! Using state geography has its benefits. Your readers will likely be familiar with this type of map. They’ll know where to glance for their home state, and will (hopefully) be able to locate others. My color scale progresses from white to green, which gives the effect of states with a lower response rate being kinda transparent. That’s neat! Conceptually, it makes sense that Oklahoma is fading out compared to New Mexico, which took the post-election survey at a much higher rate.

We shouldn’t just settle for this map as is! Though some problems can’t be solved with maps at all (not being able to see the data exactly) or with chloropleths (state geographies); some can be solved! Let’s make a map that will help me tell the world which states are doing above, below, and around average on taking the post-election survey. I’ll do this by changing the color scale.

# Plotting (Code adapted from https://www.cgoodman.com/blog/archives/2018/06/16/maps-in-r-using-urbnmapr/)

ggplot() +
  # County map
  geom_polygon(data = post_map,
               mapping = aes(x = long, y = lat,
                             group = group,
                             fill = mean_post)) +
  # Add state outlines
  geom_polygon(data = urbnmapr::states,
               mapping = aes(long, lat,group = group),
               fill = NA, color = "#ffffff", size = 0.4) +
  # Projection
  coord_map(projection = "polyconic")+
  
# scale_fill_gradient 2 helps us create *diverging* color scales!
# I define red as a low point, white as the middle, and green
# as the high point. 
scale_fill_gradient2(
  low = "red",
  mid = "white",
  high = "seagreen4",
  
  # I define the midpoint explicitly here! 
  # I calculate it not as the mean of all the 
  # state averages, but as the mean of all CES
  # respondents. That value will be where the
  # scale is a stark white.
  midpoint = mean(ces2020_selected$tookpost_recoded),
  limits = c(0.75, 0.95), 
  labels = percent_format(accuracy = 5L)
)+
  # Theming
  theme_minimal(base_family = "Open Sans Condensed Light")+
  theme(
    legend.position = "right",
    legend.text.align = 0,
    legend.title = element_blank(),
    plot.margin = unit(c(.5,.5,.2,.5), "cm")) +
    theme(
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
  ) + 
  labs(x = "",
       y = "",
       title = "% of Respondents Who Took the Post Election Survey",
       caption = "Author: Pia Deshpande, \n Data: 2020 Cooperative Election Study")

Well this looks different! States that are around average fade into the background (goodbye Oregon!), while outliers are bleeding and verdant. Mississippi really stands out here, and so does New Hampshire!

What about substantive maps?

Okay, but I need to write a paper by the end of the semester using survey questions to show my teacher how weighting works. How do I make a map for that?

Have no fear! This section will help you. Let’s map what percentage of Americans were contacted by a political campaign in 2020 and how that differs by state.

Making a substantive map means we need to be using all of our normal data analysis tools. That is, when we calculate summary statistics, we should use survey weights and the survey function. It also means being well aware of sample size. For this analysis, if a state has less than 200 observations, I exclude it from my analysis. Since that state will still be on the map, we’ll have to decide how we depict it.

First, let’s select relevant variables from ces2020_dataverse, just like we did for the first set of maps!

# Let's try making a substantive map: what percentage of respondents were contacted
# by a political campaign in 2020?

# Let's select variables that we care about!
ces2020_substantive <- ces2020_dataverse %>%
  select(
    
    # id for each respondent
    caseid, 
    
    # weight for survey analysis!
    commonpostweight, 
    
    # state respondent is registered in
    state_fips_rough = inputstate_post, 
    
    # were you contacted by a political campaign in 2020?
    CC20_431a) %>%
  
  mutate(
    contact = case_when(
      
      # I am recoding this so a "Yes" is a 1
      # and a "No" is a 0.
      CC20_431a == 1 ~ 1,
      CC20_431a == 2 ~ 0
    ),
    state_fips = as.character(sprintf("%02d", state_fips_rough))
  ) %>%
  
  # We don't want to include someone if they did not take the post-election survey
  drop_na(commonpostweight)

Just like last time, we should look at state samples. I’m not comfortable making an inference about a state if less than 200 people were polled there. Let’s find which states we should cull from our analysis.

# Check sample sizes of each state
sample_cutoff <- ces2020_substantive %>%
  group_by(state_fips) %>%
  summarise(n = n()) %>%
  
  # only give me states with less than 200 responses
  filter(n < 200)

sample_cutoff
## # A tibble: 8 x 2
##   state_fips     n
##   <chr>      <int>
## 1 02            89
## 2 11           154
## 3 15           178
## 4 38           138
## 5 44           173
## 6 46           147
## 7 50           134
## 8 56            86

This means we are going to have to exclude states with fips code 56, 02, 50, 38, 46, 11, 44, and 15. Looking at the CES guide this is Wyoming, Alaska, Vermont, North Dakota, South Dakota, the District of Columbia (which is not a state right now!), Rhode Island, and Hawaii, respectively. Sad to see them go, but it’s better than making unsound claims!

Now to calculate some statistics. Since I went through the survey package in more detail in my previous tutorial, I won’t reiterate myself here.

survey <- svydesign(ids = ~0, 
                    data = ces2020_substantive, 
                    weights = ~commonpostweight)

contact_state <- as.data.frame(svyby(~contact, 
                                     ~state_fips, 
                                     survey, 
                                     svymean, 
                                     na.rm = TRUE)) %>%
  
  # I'm telling R to select all state_fips that were not in our
  # table of states with less than <200 entries!
  filter(!(state_fips %in% sample_cutoff$state_fips))

# Join with urbnmapr
contact_map <- left_join(contact_state, 
                         states, by = "state_fips") %>%
  
  
  # Spoiler! This step will let us make a tile map later on. We're specifying the dimensions of the 
  # tile here. I want my tiles to be square, but you can change this as you see fit.
  mutate(xdimension = 1, 
         ydimension = 1) 

Now to plot! Let’s first try to map a standard U.S. state geography with a sequential scale (low to high, with no midpoint to facilitate diverging). There will be less comments this time around!

ggplot() +
  # County map
  geom_polygon(data = contact_map,
               mapping = aes(x = long, y = lat,
                             group = group,
                             
                             # our variable of interest!
                             fill = contact)) +
  # Add state outlines
  geom_polygon(data = urbnmapr::states,
               mapping = aes(long, lat,group = group),
               
               # with the color argument, I am giving
               # states a grey outline!
               fill = NA, color = "grey30", size = 0.4) +
  # Projection
  coord_map(projection = "polyconic")+
  
  # purple seems like a bipartisan color?
  scale_fill_gradient(low = "plum", 
                      high = "mediumpurple4", 
                      labels = percent_format(accuracy = 5L),
                      limits = c(0.35, 0.75),
                      
                      # This blank "" means I do not want my 
                      # legend to have a title.
                      "") + 
  
  # Theming
  theme_minimal(base_family = "Open Sans Condensed Light")+
  theme(
    legend.position = "right",
    legend.text.align = 0,
    plot.margin = unit(c(.5,.5,.2,.5), "cm")) +
  theme(
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
  ) + 
  theme(plot.title=element_text(family="Open Sans Condensed Bold", margin=margin(b=15)))+
  theme(plot.subtitle=element_text(family="Open Sans Condensed Light Italic"))+
  theme(plot.margin=unit(rep(0.5, 4), "cm"))+
  labs(x = "",
       y = "",
       title = "% of Respondents Contacted by a Political Campaign in 2020",
       caption = "Author: Pia Deshpande, Data: 2020 Cooperative Election Study")

There she is! The states that are missing are in stark white (notice how I did not have my scale start with white?). We see some states like Montana with a pretty high campaign contact rate, and soms states like Louisiana with a pretty low one. However, there’s a problem — we used the survey package to calculate these values, and they have standard errors. There’s not a good way to represent standard errors in our map as it currently exists!

Why are we worried about standard errors here? I was suspiciously silent about them with our first map. It’s because our first map was depicting something about CES respondents, which means our sample was the same as our population of interest (an incredible and rare thing). But when we analyze substantive questions, we’re usually trying to use the survey as a proxy for how a certain population behaves (in this case, the American public), which means our estimates have uncertainty to them. People who make maps and use them in their analysis are aware of this difficulty, and there are ways to overcome it that I won’t go into in this tutorial— mostly because I am not equipped to teach you something I do not know! Penn State’s Department of Geography has a good writeup on the topic of mapping uncertainty.

But for our tutorial, there is one problem I might try and solve — state geography. Shy of redrawing all the state lines, I think we should make a tile map. In a tile map, all states are represented by similar sized squares, so readers will weigh them equally as they visually process them.

# The following code is adapted from the Urban Institute's Tutorial
# https://urbaninstitute.github.io/r-at-urban/mapping.html#geom_tile()

# create a custom geofacet grid
# This was constructed by the Urban Institute! I am using it here.
# It tells R how to draw the grid, and how to name each square.
# I am grateful someone else wrote this and not me!

urban_grid <- tibble(
  row = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
          4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 
          7, 7, 8, 8, 8),
  col = c(1, 11, 6, 10, 11, 1, 2, 3, 4, 5, 6, 7, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 1, 4, 9),
  code = c("AK", "ME", "WI", "VT", "NH", "WA", "ID", "MT", "ND", "MN", "IL", "MI", "NY", "MA", "OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT", "RI", "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE", "AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC", "OK", "LA", "MS", "AL", "GA", "HI", "TX", "FL"),
  name = c("Alaska", "Maine", "Wisconsin", "Vermont", "New Hampshire", "Washington", "Idaho", "Montana", "North Dakota", "Minnesota", "Illinois", "Michigan", "New York", "Massachusetts", "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Indiana", "Ohio", "Pennsylvania", "New Jersey", "Connecticut", "Rhode Island", "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky", "West Virginia", "Virginia", "Maryland", "Delaware", "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", " North Carolina", "South Carolina", " District of Columbia", "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia", "Hawaii", "Texas", "Florida")
)


contact_map %>%
  
  # remember when I defined xdimension and ydimension a while ago?
  # It's coming in handy here!
  ggplot(aes(x = xdimension, y = ydimension, fill = contact)) +

  # We're making a tile map!
  geom_tile() +
  
  # I am defining some display text here. I want
  # to print the value of "contact," which is the 
  # estimated percentage of Americans in a certain
  # state who were contacted by a political campaign!
  
  # However, this number is a decimal. Another way
  # of formatting percentages is by multiplying them by 100
  # and using the round function (I wanted no decimal places).
  # I then use the paste0 function to add a pretty percentage sign!
  geom_text(aes(label = paste0(round(contact*100,0), "%")),
            
            # this makes the text white!
            color = "white") +

  # Using our grid from before. Thanks again to the Urban Institute!
  # I also want to facet by state_abbv (That is, make a new tile) for
  # each state abbreviation.
  facet_geo(facets = ~state_abbv, grid = urban_grid) +
  labs(title = "Percentage of Americans Contacted by a Political Campaign in 2020",
       subtitle = "Adapted from Code from the Urban Institute",
       caption = "Graph by Pia Deshpande \n Data from the 2020 CES",
       x = NULL,
       y = NULL) +
  scale_fill_gradient(
    
  # Purple all the way down!
  low = "plum",
  high = "mediumpurple4",
  
  # No legend title please
  ""
)+
  theme(plot.background = element_rect(colour = "white"),
        panel.grid = element_blank(),
        panel.grid.major = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.line = element_blank(),
        panel.spacing = unit(0L, "pt"),
        legend.position = "none",
        strip.text.x = element_text(size = 9L))

Okay! We have it! A graph-table-color thing! An abomination of nature that does fix some of our problems, but adds new ones. Readers are no longer going to see Texas as more important than Connecticut because of state size, and they can now look at the exact percentage. Data we excluded is also easy to spot! However, because this chart provides more information, it asks viewers to spend more time looking at it. This is a complicated figure—and you could argue that the color fill behind the text label doesn’t do much.

All maps are trade offs. In fact, even using a map is a pretty important choice! How should you decide?

Okay, when should I use a map?

  • I am genuinely interested in answering a geographic question, and a map would help.
  • I have thought carefully about the geography I am using and whether it is appropriate. (For example, if you are interested in studying different levels of property tax, the state geography will be too broad for you. Most property taxes are decided at the local level).
  • I have thought about the trade offs of using maps and selected the best type of map. Tile maps are great when you don’t want the size of states, countries, or territories to make readers weigh larger geographic regions more importantly than small ones. On the other hand, state geography is recognizable, and can help people interpret your results. You have to make some important choices!
  • I have thought about mapping uncertainty and am either comfortable with not doing it (see our first slew of maps) or have determined how I will signal uncertainty to readers.

That brings us to the end of our tutorial. As always, the whole script is below (slightly rearranged so it will hopefully run smoothly on your computer).

The whole script

knitr::opts_chunk$set(echo=TRUE, warning=FALSE, message=FALSE)
# Setting up my RmD file

library(knitr)
# A package for knitting things together!
library(htmlwidgets)
# A package to save leaflet HTML so we can render these maps in Jekyll. If you are just looking at your maps on your local PC, you won't need to do this.
# A special thanks to Rob Williams for his tutorial on how to do this: https://jayrobwilliams.com/posts/2020/09/jekyll-html

#### LOAD PACKAGES ####
library(dataverse) # loading data
library(urbnmapr)  # geographic information for mapping!
library(ggplot2)   # pretty plots
library(geofacet)  # tile plots
library(scales)    # easy formatting
library(dplyr)     # data manipulation
library(survey)    # survey analysis
library(tidyverse) # the holy grail: more data wrangling

library(leaflet)   # optional: for interactive maps!
library(here)

require("devtools")
devtools::install_github("UrbanInstitute/urbnmapr")
##### LOADING DATA ####
ces2020_dataverse <- get_dataframe_by_name(
  filename = "CES20_Common_OUTPUT_vv.dta",
  dataset = "10.7910/DVN/E9N6PH",
  original = TRUE,
  .f = haven::read_dta,
  server = "dataverse.harvard.edu"
)
# Let's select variables that we care about!
ces2020_selected <- ces2020_dataverse %>%
  select(
    
    # id for each respondent
    caseid, 
    
    # did you take the post-election survey?
    tookpost, 
    
    # state respondent is living in.
    # In this code, I am both selecting
    # the inputstate column and renaming
    # it to state_fips rough, all in
    # one line!
    state_fips_rough = inputstate) %>%
  mutate(
    
    # tookpost is originally coded so 2 = Yes
    # and 1 = No. I am recoding it so Yes = 1
    # and No = 0. Why? When I do calculations, 
    # the average of tookpost will be a proportion
    # between 0 and 1 that I can represent as a 
    # percent (ex. "85% of respondents took the
    # post-election survey").
    tookpost_recoded = case_when(
      tookpost == 1 ~ 0,
      tookpost == 2 ~ 1,
    ),
    
    # why am I changing state_fips_rough? Like
    # most coding, things were breaking which inspired
    # me to get creative. Try running this chunk without
    # this line and look at ces2020_selected. See anything
    # weird? Check before you read on!
    state_fips = as.character(sprintf("%02d", state_fips_rough))
    
    # Okay, now that you've looked, I'll spill the beans. 
    # State FIPS codes are always two digits, like 15, 50, 
    # and 02. R was reading in single digits like 02 as
    # just 2, which is a problem later on. To fix this
    # I used the function sprintf to tell R to add
    # leading zeroes to state_fips_rough until there
    # were two digits per entry. So a value of 4 would
    # turn into 04, but 36 would stay 36! I can't spend
    # more time on this now, but if people are interested
    # in this kind of data manipulation, I am happy to write
    # about it!
  )

# hey R, for the following, use ces2020_selected please!
ces2020_selected %>%
  
  # whatever calculation I tell you to do, please
  # do it grouped--- that is, only within entries
  # with the same state FIPS code.
  group_by(state_fips) %>%
  
  # could you tell me how many entries I have per
  # state?
  summarise(n = n())


# you'll notice that the whole table does not print here.
# To see the entire thing, print it in your R console
post_aggregate <- ces2020_selected %>%
  group_by(state_fips) %>%
  
  # give me the mean of tookpost_recoded 
  # for each state, and make that a new
  # variable called mean_post
  summarise(mean_post = mean(tookpost_recoded))

# let's take a look!
post_aggregate
post_map <- left_join(post_aggregate, 
                      states, by = "state_fips")

post_map
# Plotting 
# Code adapted from: 
# https://www.cgoodman.com/blog/archives/2018/06/16/maps-in-r-using-urbnmapr/
# I also retain some of Chris' annotations here for clarity.

ggplot() +
  
  # county map
  geom_polygon(data = post_map,
               mapping = aes(x = long, y = lat,
                             group = group,
                             
                             # I want the color 
                             # that each state
                             # has to depend
                             # on mean_post!
                             fill = mean_post)) +
 
   # add state outlines using urbnmapr
  geom_polygon(data = urbnmapr::states,
               # when we write urbnmapr::states
               # we are telling R to specifically
               # use the urbnmapr package to pull up
               # states. Some coders like using this notation
               # consistently ---whenever they use a function
               # they tell R which package it comes from.
               # I don't do that, but if your code is
               # breaking and you don't know why--- try
               # calling on the package explicitly (sometimes)
               # packages have functions that are named the
               # same thing, which confuses R!
               
               mapping = aes(long, lat,group = group),
               fill = NA, color = "#ffffff", size = 0.4) +
  
  # projection
  coord_map(projection = "polyconic")+
  
  # I want to create a scale that goes from low to
  # high, where the color of the high value connotes
  # that things are good. In the United States,
  # the color green is often the "good" or "everything's
  # fine" color, so I'll use a version of that.
  
  # You can find an R Color Cheat Sheet here by Melanie Frazier: 
  # https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf
  scale_fill_gradient(low = "white", 
                      high = "seagreen4", 
                      
                      # I can define limits explicitly if I want.
                      # If you do this, always check that you are not
                      # accidentally excluding data! (I made sure all my)
                      # data fit in this range reasonably.
                      limits = c(0.75, 0.95), 
                      
                      # Right now my code is in decimals.
                      # We can use a trick we learned last tutorial
                      # to make the labels into integer percentages.
                      # Try changing the accuracy = 5L argument
                      # to learn what it does!
                      labels = percent_format(accuracy = 5L)) + 
  # Theming
  theme_minimal(base_family = "Open Sans Condensed Light")+
  theme(
    legend.position = "right",
    legend.text.align = 0,
    legend.title = element_blank(),
    plot.margin = unit(c(.5,.5,.2,.5), "cm")) +
    theme(
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
  ) +
  
  # some useful titles!
  labs(x = "",
       y = "",
       title = "% of Respondents Who Took the Post Election Survey",
       caption = "Author: Pia Deshpande, \n Data: 2020 Cooperative Election Study")

# Plotting (Code adapted from https://www.cgoodman.com/blog/archives/2018/06/16/maps-in-r-using-urbnmapr/)

ggplot() +
  # County map
  geom_polygon(data = post_map,
               mapping = aes(x = long, y = lat,
                             group = group,
                             fill = mean_post)) +
  # Add state outlines
  geom_polygon(data = urbnmapr::states,
               mapping = aes(long, lat,group = group),
               fill = NA, color = "#ffffff", size = 0.4) +
  # Projection
  coord_map(projection = "polyconic")+
  
# scale_fill_gradient 2 helps us create *diverging* color scales!
# I define red as a low point, white as the middle, and green
# as the high point. 
scale_fill_gradient2(
  low = "red",
  mid = "white",
  high = "seagreen4",
  
  # I define the midpoint explicitly here! 
  # I calculate it not as the mean of all the 
  # state averages, but as the mean of all CES
  # respondents. That value will be where the
  # scale is a stark white.
  midpoint = mean(ces2020_selected$tookpost_recoded),
  limits = c(0.75, 0.95), 
  labels = percent_format(accuracy = 5L)
)+
  # Theming
  theme_minimal(base_family = "Open Sans Condensed Light")+
  theme(
    legend.position = "right",
    legend.text.align = 0,
    legend.title = element_blank(),
    plot.margin = unit(c(.5,.5,.2,.5), "cm")) +
    theme(
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
  ) + 
  labs(x = "",
       y = "",
       title = "% of Respondents Who Took the Post Election Survey",
       caption = "Author: Pia Deshpande, \n Data: 2020 Cooperative Election Study")

# Let's try making a substantive map: what percentage of respondents were contacted
# by a political campaign in 2020?

# Let's select variables that we care about!
ces2020_substantive <- ces2020_dataverse %>%
  select(
    
    # id for each respondent
    caseid, 
    
    # weight for survey analysis!
    commonpostweight, 
    
    # state respondent is registered in
    state_fips_rough = inputstate_post, 
    
    # were you contacted by a political campaign in 2020?
    CC20_431a) %>%
  
  mutate(
    contact = case_when(
      
      # I am recoding this so a "Yes" is a 1
      # and a "No" is a 0.
      CC20_431a == 1 ~ 1,
      CC20_431a == 2 ~ 0
    ),
    state_fips = as.character(sprintf("%02d", state_fips_rough))
  ) %>%
  
  # We don't want to include someone if they did not take the post-election survey
  drop_na(commonpostweight)


# Check sample sizes of each state
sample_cutoff <- ces2020_substantive %>%
  group_by(state_fips) %>%
  summarise(n = n()) %>%
  
  # only give me states with less than 200 responses
  filter(n < 200)

sample_cutoff

survey <- svydesign(ids = ~0, 
                    data = ces2020_substantive, 
                    weights = ~commonpostweight)

contact_state <- as.data.frame(svyby(~contact, 
                                     ~state_fips, 
                                     survey, 
                                     svymean, 
                                     na.rm = TRUE)) %>%
  
  # I'm telling R to select all state_fips that were not in our
  # table of states with less than <200 entries!
  filter(!(state_fips %in% sample_cutoff$state_fips))

# Join with urbnmapr
contact_map <- left_join(contact_state, 
                         states, by = "state_fips") %>%
  
  
  # Spoiler! This step will let us make a tile map later on. We're specifying the dimensions of the 
  # tile here. I want my tiles to be square, but you can change this as you see fit.
  mutate(xdimension = 1, 
         ydimension = 1) 
ggplot() +
  # County map
  geom_polygon(data = contact_map,
               mapping = aes(x = long, y = lat,
                             group = group,
                             
                             # our variable of interest!
                             fill = contact)) +
  # Add state outlines
  geom_polygon(data = urbnmapr::states,
               mapping = aes(long, lat,group = group),
               
               # with the color argument, I am giving
               # states a grey outline!
               fill = NA, color = "grey30", size = 0.4) +
  # Projection
  coord_map(projection = "polyconic")+
  
  # purple seems like a bipartisan color?
  scale_fill_gradient(low = "plum", 
                      high = "mediumpurple4", 
                      labels = percent_format(accuracy = 5L),
                      limits = c(0.35, 0.75),
                      
                      # This blank "" means I do not want my 
                      # legend to have a title.
                      "") + 
  
  # Theming
  theme_minimal(base_family = "Open Sans Condensed Light")+
  theme(
    legend.position = "right",
    legend.text.align = 0,
    plot.margin = unit(c(.5,.5,.2,.5), "cm")) +
  theme(
    axis.line = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
  ) + 
  theme(plot.title=element_text(family="Open Sans Condensed Bold", margin=margin(b=15)))+
  theme(plot.subtitle=element_text(family="Open Sans Condensed Light Italic"))+
  theme(plot.margin=unit(rep(0.5, 4), "cm"))+
  labs(x = "",
       y = "",
       title = "% of Respondents Contacted by a Political Campaign in 2020",
       caption = "Author: Pia Deshpande, Data: 2020 Cooperative Election Study")
# The following code is adapted from the Urban Institute's Tutorial
# https://urbaninstitute.github.io/r-at-urban/mapping.html#geom_tile()

# create a custom geofacet grid
# This was constructed by the Urban Institute! I am using it here.
# It tells R how to draw the grid, and how to name each square.
# I am grateful someone else wrote this and not me!

urban_grid <- tibble(
  row = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
          4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 
          7, 7, 8, 8, 8),
  col = c(1, 11, 6, 10, 11, 1, 2, 3, 4, 5, 6, 7, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 4, 5, 6, 7, 8, 1, 4, 9),
  code = c("AK", "ME", "WI", "VT", "NH", "WA", "ID", "MT", "ND", "MN", "IL", "MI", "NY", "MA", "OR", "NV", "WY", "SD", "IA", "IN", "OH", "PA", "NJ", "CT", "RI", "CA", "UT", "CO", "NE", "MO", "KY", "WV", "VA", "MD", "DE", "AZ", "NM", "KS", "AR", "TN", "NC", "SC", "DC", "OK", "LA", "MS", "AL", "GA", "HI", "TX", "FL"),
  name = c("Alaska", "Maine", "Wisconsin", "Vermont", "New Hampshire", "Washington", "Idaho", "Montana", "North Dakota", "Minnesota", "Illinois", "Michigan", "New York", "Massachusetts", "Oregon", "Nevada", "Wyoming", "South Dakota", "Iowa", "Indiana", "Ohio", "Pennsylvania", "New Jersey", "Connecticut", "Rhode Island", "California", "Utah", "Colorado", "Nebraska", "Missouri", "Kentucky", "West Virginia", "Virginia", "Maryland", "Delaware", "Arizona", "New Mexico", "Kansas", "Arkansas", "Tennessee", " North Carolina", "South Carolina", " District of Columbia", "Oklahoma", "Louisiana", "Mississippi", "Alabama", "Georgia", "Hawaii", "Texas", "Florida")
)


contact_map %>%
  
  # remember when I defined xdimension and ydimension a while ago?
  # It's coming in handy here!
  ggplot(aes(x = xdimension, y = ydimension, fill = contact)) +

  # We're making a tile map!
  geom_tile() +
  
  # I am defining some display text here. I want
  # to print the value of "contact," which is the 
  # estimated percentage of Americans in a certain
  # state who were contacted by a political campaign!
  
  # However, this number is a decimal. Another way
  # of formatting percentages is by multiplying them by 100
  # and using the round function (I wanted no decimal places).
  # I then use the paste0 function to add a pretty percentage sign!
  geom_text(aes(label = paste0(round(contact*100,0), "%")),
            
            # this makes the text white!
            color = "white") +

  # Using our grid from before. Thanks again to the Urban Institute!
  # I also want to facet by state_abbv (That is, make a new tile) for
  # each state abbreviation.
  facet_geo(facets = ~state_abbv, grid = urban_grid) +
  labs(title = "Percentage of Americans Contacted by a Political Campaign in 2020",
       subtitle = "Adapted from Code from the Urban Institute",
       caption = "Graph by Pia Deshpande \n Data from the 2020 CES",
       x = NULL,
       y = NULL) +
  scale_fill_gradient(
    
  # Purple all the way down!
  low = "plum",
  high = "mediumpurple4",
  
  # No legend title please
  ""
)+
  theme(plot.background = element_rect(colour = "white"),
        panel.grid = element_blank(),
        panel.grid.major = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.line = element_blank(),
        panel.spacing = unit(0L, "pt"),
        legend.position = "none",
        strip.text.x = element_text(size = 9L))