Plotting trends over time with the CES
Published:
What you’ll learn
- How to use parts of the
survey
package (specificallysvyby
) - How to use the
dataverse
package to pull data - How to use
ggplot2
formatting to create a line chart
Sounds like a lot, but by the end of this you’ll be able to make some pretty interesting graphs with good representations of uncertainty. Who knows what other cool things you’ll find in our data?
It is also important to note what you won’t learn in this tutorial. You won’t be walked through how to combine CES data from multiple years, and the more complex data transformations sometimes required to make comparisons accross time. This tutorial focuses on starting with combined data and getting a useful plot! Others focused on data manipulation are incoming — please let us know what you’d like to see!
But before we get ahead of ourselves — packages! You’ll need the following to execute my code, which you can install with install.packages("name")
. So, if you want to install the survey
package, you would type install.packages("survey")
. You’ll then need to load them with library, like I do here:
#### LOAD PACKAGES ####
library(haven) # loading in DTA data
library(dplyr) # data transformation and column mutations
library(tidyverse) # drop_na function
library(dataverse) # pull data from Dataverse
library(survey) # use survey weights
Getting the data
Obtaining data is sometimes the toughest part of doing social science research. Luckily, the CES has you covered. If you’re using CES data, you can load in all relevant datasets with the dataverse
package in R. There is also a cumulative CES dataset created by the Shiro Kuriwaki and cumulative dataset on policy preferences by Angelo Dagonel.
For this tutorial, we will be using a cumulative file on political participation that I created with CES data. It covers the years 2008 to 2020. Data files can often be large and unwieldly, so instead of downloading onto my computer, I pull the data from Dataverse.
##### LOADING DATA ####
<- get_dataframe_by_name(
ces_participation filename = "ces_participation.tab",
dataset = "10.7910/DVN/JUX8KA",
original = TRUE,
.f = haven::read_dta,
server = "dataverse.harvard.edu"
)
All you need to pull data is the file’s name, its dataset identifier, the server you are pulling from, and instructions on how to pull the data. You can find all of this at the file’s Dataverse link. Here is the link for ces_participation project. Once you click on the link, click again on the data file you’re interested in pulling (projects often have more than one data file!). There, under the tab “Metadata,” you’ll find the information we used to fill in the above command. Here is the link to the specific file.
Now, we’ve located the file. . .but how do we know what to put in the .f argument? That argument is asking us what package and function we want to use to pull in the data. Since we are pulling in a dta file, we use haven::read_dta
. You may notice that the file actually has a .tab extension. This is because Dataverse stores large dta files as tab files, but the dataverse package recognizes its original format if you write original = TRUE,
and will pull the dta file.
Using survey weights
Weights are extremely important to most survey analysis. Though weighting algorithms can be complex, the reason for weights is simple: the people who answer our survey can be, in sum, quite different from the general population we are interested in. To fix this, we use weighting. I won’t go into more detail here. For a more detailed overview of weighting options out there, see Andrew Mercer, Arnold Lau, and Courtney Kennedy’s writeup for Pew on the topic.
Doing analysis over multiple years can sometimes make weighting difficult. This is a basic tutorial, so I will not be using any time-series adjustments on our data, and we will be using the weights available. For 2016 onwards, the CES has pre and post-election weights available, and we will use the post-election ones. For previous years, one weight (caseweight) is provided, and is used. All of these weights have been combined into one column I called weights
for this project.
To do anything with our weights, we first want to make a survey object! We can do this using the svydesign
function, specifying the dataframe name and the name of the weights column:
<- svydesign(ids = ~0,
survey data = ces_participation,
weights = ~weight)
A survey object will do all the complicated math for us if we treat it correctly. For example, I’m interested in looking at what percentage of respondents attended a political meeting in the past year, so what I really want is a crosstab by pol_meet_recode
and year
.
# Plot participation over time
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet ~year,
survey, na.rm = TRUE)) svymean,
Let’s take a look at what we found.
# Plot participation over time
pol_meet
## year pol_meet_recode se
## 2008 2008 0.13610472 0.002397565
## 2010 2010 0.14410290 0.002162191
## 2012 2012 0.11841966 0.002170678
## 2014 2014 0.11515314 0.002081936
## 2016 2016 0.10642031 0.002050197
## 2018 2018 0.12255463 0.002096260
## 2020 2020 0.07589015 0.001676549
Well, it seems like there’s a notable decline in local political meeting attendance in 2020. That makes sense! We were all locked in (or, we likely should have been if we weren’t an essential worker). The other differences we see are kind of small, and I haven’t even thought about dealing with those standard errors yet . . .
It’s at this point that I usually make a graph. If you are that special kind of genius that can calculate 95% confidence intervals in your head, I applaud you. I am not you, and I hope you stay for the fun plot aesthetics if nothing else. But before making a graph, what if I’m not just interested in overall trends? I’d like to know how these trends look different for Republicans and Democrats. Have no fear! We can do that with the survey
package too; we’ll just need to subset!
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet_rep ~year,
subset(survey, pid3 == 2),
na.rm = T)) %>%
svymean,
# I use the mutate function from dplyr here to make
# a new column (party) and have it contain the value
# "Republican." This way, I'll be able to plot
# the differences between the two groups.
mutate(party = "Republican")
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet_dem ~year,
subset(survey, pid3 == 1),
na.rm = T)) %>%
svymean, mutate(party = "Democrat")
I am not going to print the results of pol_meet_rep
or pol_meet_dem
here, but feel free to look for yourself! You can compare this to the graph we create.
Here ends our love affair with the survey
package (at least for this tutorial). For more on the package, please see this excellent tutorial by Zachary Hertz (a Tufts alum now getting his Master’s at UChicago). There is also an example of survey analysis written up by statistician and R wizard Thomas Lumley, who authored the survey
package. For STATA users (who no doubt are feeling left out during my tutorial), see the CES’ very own Brian Schaffner’s posts on survey analysis.
Astute readers may be confused by my decision to surround our weighted crosstabs with as.data.frame
, and they have every right to be. If you just need to look at the numbers on their own, there is no need to change a crosstab’s format to a data frame, but we need to mess with the table it produces, and the data frame format is best for such chicanery. (Try taking away as.data.frame
from the previous chunk of code and completing the tutorial that way — the error messages you see will likely be instructive).
The last thing we’ll need to do before plotting is combining our two dataframes — this way, information about Republicans and Democrats is in one easy place for ggplot2
.
<- pol_meet_rep %>%
pol_meet_party bind_rows(pol_meet_dem)
Plotting things
We can start by writing a command like the one we have below. It tells the package what dataset we want it to use (pol_meet_party) and defines its aesthetic arguments: aes(x=year, y=pol_meet_recode, group = party, color = party)
.
<- ggplot(pol_meet_party,
plot_party aes(x=year,
y=pol_meet_recode,
group = party,
color = party))
Drum roll please!
plot_party
Okay. . . so not much happened. It turns out we need to tell ggplot what to put on the graph! Let’s try again, but this time tell it that we want points and lines connecting them.
<- ggplot(pol_meet_party,
plot_party aes(x=year,
y=pol_meet_recode,
group = party,
color = party)) +
geom_line() +
geom_point()
plot_party
All right! So that’s better, but not exactly where we want to be. The numbers on the left are hard to read because they’re decimals and not percentages, and there is no good representation of uncertainty on this graph! We don’t want someone to read this without knowing that surveys are fallible, do we? Let’s try sprucing this up:
<- ggplot(pol_meet_party,
plot_party aes(x=year,
y=pol_meet_recode,
group = party,
color = party)) +
geom_line() +
geom_point()+
geom_errorbar(aes(ymin=pol_meet_recode-1.96*se,
ymax=pol_meet_recode+1.96*se),
width=.2,
position=position_dodge(0.05))
plot_party
This is looking a lot better. We added 95% confidence intervals by multiplying the standard errors by 1.96 (the relevant Z-score) and adding and subtracting them from our point estimates.
But there are still some problems. . .let’s do some final changes! This time, I’ll annotate what I’m doing in my code chunk, and we’ll do more steps at once.
<- ggplot(pol_meet_party,
plot_party aes(x=year,
y=pol_meet_recode,
group = party,
color = party)) +
geom_line() +
geom_point()+
geom_errorbar(aes(ymin=pol_meet_recode-1.96*se,
ymax=pol_meet_recode+1.96*se),
width=.2,
position=position_dodge(0.05)) +
theme_minimal() +
# This will get rid of that grey background
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1,
face = "italic"),
legend.title = element_blank()) +
# This sets up some stylistic standards!
# It makes sure our title (if we have one)
# is justified and makes our caption italic.
# It also gets rid of the title for our legend
# (we don't need it--- people will recognize
# the party names).
scale_y_continuous(labels = scales::percent_format(accuracy = 1L),
limits = c(0,0.2)) +
# This fixes our y_axis problem! It will make
# our numbers percentages with percent_format,
# and I've used accuracy = 1L to get rid of decimal
# points--we don't want to make readers assume we
# can be more exact than is true.
#
# I also manually set y-axis limits with
# limits = c(0,0.2). Note that I can't write
# c(0,20), because our data is not in percentages
# (though this command has it display that way).
# Why do I change the y-axis limits?
# Sometimes zooming into a line chart can make
# small changes look massive.
ylab("% attended a local political meeting in the past year") +
# Y-axis title!
labs(caption = "Plot: Pia Deshpande \nData: CES") +
# A handy dandy citation caption! It'll be
# italicized because of what we did earlier.
# The \n mandates a line break.
scale_color_manual(values=c("blue", "red")) +
# Republicans should probably be red
# and Democrats should probably be blue!
ggtitle("")
# You often don't need an additional title!
# But you can always fill one in if you want.
plot_party
And finally. . . here’s the result! It seems like partisans are most motivated to attend local meetings when the opposing party has presidential power. Of course, we don’t exactly know why that is yet. Maybe you’ll fnd out!
If you’re new to ggplot, this might be overwhelming. That is okay. There are plenty of resources to help you get started, and you’ll learn more techniques as you make more plots. I am certainly not a ggplot2
expert, so I’ll cite some you can refer to. Selva Prabhakaran has written a comprehensive tutorial on plotting with the package. The open source book R for Data Science has a chapter on Data visualization that uses ggplot. Here is the offical tidyverse
page for ggplot2
, which has even more resources! It’s likely you may have different aesthetic desires than I do, and that is completely okay. Go wild! I’m excited to see the plots you make.
The whole script
Here is the entirety of my script all in one place! It has a few more plots that you can look at. Try to mess with the ggplot formatting provided to see how it alters the graphs produced.
#### LOAD PACKAGES ####
library(haven) # loading in DTA data
library(dplyr) # data transformation and column mutations
library(tidyverse) # drop_na function
library(dataverse) # pull data from Dataverse
library(survey) # use survey weights
##### LOADING DATA ####
<- get_dataframe_by_name(
ces_participation filename = "ces_participation.tab",
dataset = "10.7910/DVN/JUX8KA",
original = TRUE,
.f = haven::read_dta,
server = "dataverse.harvard.edu"
)
<- svydesign(ids = ~0, data = ces_participation, weights = ~weight)
survey
# pol_meet By Party
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet ~year,
survey,
svymean, na.rm = TRUE))
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet_rep ~year,
subset(survey, pid3 == 2),
svymean, na.rm = T)) %>%
mutate(party = "Republican")
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet_dem ~year,
subset(survey, pid3 == 1),
svymean, na.rm = T)) %>%
mutate(party = "Democrat")
<- pol_meet_rep %>%
pol_meet_party bind_rows(pol_meet_dem)
<- ggplot(pol_meet_party,
plot_party aes(x=year,
y=pol_meet_recode,
group = party,
color = party)) +
geom_line() +
geom_point()+
geom_errorbar(aes(ymin=pol_meet_recode-1.96*se,
ymax=pol_meet_recode+1.96*se),
width=.2,
position=position_dodge(0.05)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1,
face = "italic"),
legend.title = element_blank()) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L)) +
ylab("% attended a local political meeting in the past year") +
labs(caption = "Plot: Pia Deshpande \nData: CES") +
scale_color_manual(values=c("blue", "red")) +
ggtitle("")
# donate_candidate by Party
<- as.data.frame(svyby(~donate_candidate_recode,
donate_candidate ~year,
survey,
svymean, na.rm = TRUE))
<- as.data.frame(svyby(~donate_candidate_recode,
donate_candidate_rep ~year,
subset(survey, pid3 == 2),
na.rm = T)) %>%
svymean, mutate(party = "Republican")
<- as.data.frame(svyby(~donate_candidate_recode,
donate_candidate_dem ~year,
subset(survey, pid3 == 1),
svymean, na.rm = T)) %>%
mutate(party = "Democrat")
<- donate_candidate_rep %>%
donate_candidate_party bind_rows(donate_candidate_dem)
<- ggplot(donate_candidate_party,
donate_party aes(x=year, y=donate_candidate_recode,
group = party,
color = party)) +
geom_line() +
geom_point()+
geom_errorbar(aes(ymin=donate_candidate_recode-1.96*se,
ymax=donate_candidate_recode+1.96*se),
width=.2,
position=position_dodge(0.05)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1,
face = "italic"),
legend.title = element_blank()) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L)) +
ylab("% donated to a candidate in the past year") +
labs(caption = "Plot: Pia Deshpande \nData: CES") +
scale_color_manual(values=c("blue", "red")) +
ggtitle("")
# work_candidate by Party
<- as.data.frame(svyby(~work_candidate_recode,
work_candidate ~year,
survey, na.rm = TRUE))
svymean,
<- as.data.frame(svyby(~work_candidate_recode,
work_candidate_rep ~year,
subset(survey, pid3 == 2),
na.rm = T)) %>%
svymean, mutate(party = "Republican")
<- as.data.frame(svyby(~work_candidate_recode,
work_candidate_dem ~year, subset(survey, pid3 == 1),
na.rm = T)) %>%
svymean, mutate(party = "Democrat")
<- work_candidate_rep %>%
work_candidate_party bind_rows(work_candidate_dem)
<- ggplot(work_candidate_party, aes(x=year,
work_party y=work_candidate_recode,
group = party,
color = party)) +
geom_line() +
geom_point()+
geom_errorbar(aes(ymin=work_candidate_recode-1.96*se,
ymax=work_candidate_recode+1.96*se), width=.2,
position=position_dodge(0.05)) + # these colors are not plotting right now
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1,
face = "italic"),
legend.title = element_blank()) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L))+
ylab("% worked for a candidate in the past year") +
labs(caption = "Plot: Pia Deshpande \nData: CES") +
scale_color_manual(values=c("blue", "red")) +
ggtitle("")
# put_sign by Party
<- as.data.frame(svyby(~put_sign_recode,
put_sign ~year,
survey, na.rm = TRUE))
svymean,
<- as.data.frame(svyby(~put_sign_recode,
put_sign_rep ~year,
subset(survey, pid3 == 2),
na.rm = T)) %>%
svymean, mutate(party = "Republican")
<- as.data.frame(svyby(~put_sign_recode,
put_sign_dem ~year,
subset(survey, pid3 == 1),
na.rm = T)) %>%
svymean, mutate(party = "Democrat")
<- put_sign_rep %>%
put_sign_party bind_rows(put_sign_dem)
<- ggplot(put_sign_party,
put_sign_party aes(x=year,
y=put_sign_recode,
group = party,
color = party)) +
geom_line() +
geom_point()+
geom_errorbar(aes(ymin=put_sign_recode-1.96*se,
ymax=put_sign_recode+1.96*se),
width=.2,
position=position_dodge(0.05)) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1,
face = "italic"),
legend.title = element_blank()) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1L)) +
ylab("% put political sign up in the past year") +
labs(caption = "Plot: Pia Deshpande \nData: CES") +
scale_color_manual(values=c("blue", "red")) +
ggtitle("")