This is a bit of a breakdown for my thought process and the functions for the R package cesR
. The full paper for cesR
is available from SocArXiv Papers and the full code for this project can be found on my Github account.
Introduction
This past summer (2020 to give a date-mark) I had the privilege of working as a co-op student with Dr. Rohan Alexander in the Toronto Data Lab of U of T. While this wasn’t the project I had set out to complete, nor the one I had proposed, it was what ended up being the most rewarding.
The purpose of cesR
is to make accessing Canadian Election Study datasets easier for R users. It follows and was inspired by the work being done in the R community through such packages as the opendatatoronto
package and the Lahman
package. Packages such as these are important to R users as they improve the functionality of working within R by minimizing the number of steps required to load data and increasing the availability of data to R users.
cesR
does this through the use of five functions: get_ces()
, get_cescodes()
, get_preview()
, get_question()
, and get_decon()
.
Functions
get_ces()
When called, the get_ces()
function returns a requested CES survey as a data object and prints to the console the associated citation and URL for the survey dataset repository. The function takes one argument in the form of a character string. This argument is a vector member that has been associated with a CES survey through the body of code in the get_ces()
function that when used calls the download URL for that survey on an associated GitHub repository named ces_data
. If the provided character string argument matches a member of the built-in vector ces_codes
, the associated file is downloaded using the download.file()
function from the utils
R package as a compressed .zip folder and is stored temporarily in inst/extdata
directory in the greater package directory. Upon downloading the file, the compressed folder is unzipped using the unzip()
function from the utils
R package and read into R using either the read_dta()
or read_sav()
functions from the haven
R package depending on the file extension of the downloaded file. A data frame is then assigned using the assign()
function from the base
R package as a data object in the global environment. The downloaded file and file directory are then removed from the computer using the unlink()
function from the base
R package. Finally, the recommended citation for the requested survey dataset and URL of the survey data storage location are printed in the console.
If the provided character string argument does not have a match in the built-in vector, then the function process is stopped and a warning message stating Error in get_ces(): Warning: Code not in table
is printed in the RStudio console.
Below is an example of the get_ces()
code.
# 'get_ces' function, uses one variable 'srvy'
get_ces <- function(srvy){
# if 'srvy' is in 'ces_codese' vector
if(srvy %in% ces_codes){
# if 'srvy' is equal to 'ces2019_web'
if(srvy == "ces2019_web"){
# if the file does not exist
if(!file.exists("inst/extdata/ces2019_web/ces2019_web.dta")){
# assign download url
cesfile <- "https://raw.github.com/hodgettsp/ces_data/master/extdata/CES2019-web.zip"
# create temporary file name holder with extension .zip
hldr <- tempfile(fileext = ".zip")
# download the file from the url and assign temporary name
download.file(cesfile, hldr, quiet = TRUE)
# unzip the compressed folder to the given directory
unzip(hldr, exdir = "inst/extdata/ces2019_web")
# assign the data file to a globally available variable
assign("ces2019_web", haven::read_dta(hldr), envir = .GlobalEnv)
# remove the temporary file
unlink(hldr, recursive = TRUE)
# remove the download directory
unlink("inst/extdata/ces2019_web", recursive = TRUE)
# print citation and link
cat(ref2019web)
}
}
}
else{
# if the provided code is not in the 'ces_codes' vector then stop process and print this message
stop("Warning: Code not in table.")
}
}
get_cescodes()
To call a CES survey the user requires an associated survey code. The get_cescodes()
function provides a user with a print out of these survey codes. This function does not take any arguments. Instead, when the function is called it prints to the console a dataframe that contains the survey codes and their associated argument calls.
# get_cescodes function
# creates three vectors of the ces survey codes and associated calls
# converts those vectors to data frames with associated index number for call
# merges the three data frames and renames the columns
# removes the data frame items and prints merged results
# can be used to lookup a survey code and the associated calls.
get_cescodes <- function(){
ces1 <- (c("ces2019_web", "ces2019_phone", "ces2015_web", "ces2015_phone", "ces2015_combo",
"ces2011", "ces2008", "ces2004", "ces0411", "ces0406", "ces2000", "ces1997", "ces1993",
"ces1988", "ces1984", "ces1974", "ces7480", "ces72_jnjl", "ces72_sep", "ces72_nov",
"ces1968", "ces1965"))
ces2 <- c('"ces2019_web"', '"ces2019_phone"', '"ces2015_web"', '"ces2015_phone"', '"ces2015_combo"',
'"ces2011"', '"ces2008"', '"ces2004"', '"ces0411"', '"ces0406"', '"ces2000"', '"ces1997"', '"ces1993"',
'"ces1988"', '"ces1984"', '"ces1974"', '"ces7480"', '"ces72_jnjl"', '"ces72_sep"', '"ces72_nov"',
'"ces1968"', '"ces1965"')
ces1 <- data.frame(ces1)
ces1$index <- seq.int(nrow(ces1))
ces2 <- data.frame(ces2)
ces2$index <- seq.int(nrow(ces2))
ces_calltable <- merge(ces1, ces2, by = "index")
ces_calltable <- dplyr::rename(ces_calltable, ces_survey_code = ces1, get_ces_call_char = ces2)
rm(ces1)
rm(ces2)
print(ces_calltable)
}
get_preview()
Sometimes it can be helpful to have a truncated preview of a dataset to assist in exploratory analysis. Additionally, a truncated dataset provides a resource that can be used by educators in the teaching of exploratory data analysis. The get_preview()
function provides such truncated versions of the CES datasets. The function takes two arguments, a character string to call a survey of the same style as used for the get_ces()
function and a numerical value that sets the number of rows returned. If no value is provided for the number of rows, a default of six is returned
# function to call to create previews of the CES surveys
# code for the first section of the function is commented with how the function works,
# all following sections work in the same manner.
get_preview <- function(srvy, x = 6){
# if 'srvy' is in 'ces_codese' vector
if(srvy %in% ces_codes){
# if 'srvy' is equal to 'ces2019_web'
if(srvy == "ces2019_web"){
# if the file does not exist
if(!file.exists("inst/extdata/ces2019_web/ces2019_web.dta")){
# assign download url
cesfile <- "https://raw.github.com/hodgettsp/ces_data/master/extdata/CES2019-web.zip"
# create temporary file name holder with extension .zip
hldr <- tempfile(fileext = ".zip")
# download the file from the url and assign temporary name
download.file(cesfile, hldr, quiet = TRUE)
# unzip the compressed folder to the given directory
unzip(hldr, exdir = "inst/extdata/ces2019_web")
# create a locally available variable
survey_read <- haven::read_dta(hldr)
# assign the data file to a globally available variable
assign("ces2019_web_preview", head(labelled::to_factor(survey_read), x), envir = .GlobalEnv)
# remove the temporary file
unlink(hldr, recursive = TRUE)
# remove the download directory
unlink("inst/extdata/ces2019_web", recursive = TRUE)
# remove the local variable
rm(survey_read)
}
}
}
else{
# if the provided code is not in the 'ces_codes' vector then stop process and print this message
stop("Warning: Code not in table.")
}
}
get_question()
The get_question()
function provides users with the ability to look up a survey question associated with a given column name. The function takes two arguments in the form of character strings, those being the name of a data object and the name of a column in the given data object. The function works such that it checks whether the given data object exists using the exists()
function from the base
package. If the object does not exist, the function will print out a warning in the console stating Warning: Data object does not exist
. If the object does exist, get_question()
will check if the given column name exists in the given data object. This is done using a combination of the hasName()
function from the utils
package and the get()
function from the base
package. The hasName()
function checks if the given column name is in the given data object. Because the arguments are given as character strings the get()
function is used to return the actual data object instead of the provided character string. Otherwise, the hasName()
function would only check if the given column name argument occurred in the given character string argument and not the actual data object. If the column does not exist in the data object a warning is printed in the console stating Warning: Variable is not in dataset
. If the given column exists in the given data object, get_question()
will print the variable label of the given column to the console using a combination of the var_label()
function from the labelled
package and the get()
function from the base
package.
As a side note, I provide a step-by-step breakdown of this function in this post.
# function to produce the column label for requested dataset and variable
# takes two parameters as character strings
# 'do' data object and 'q' question
get_question <- function(do, q){
if(exists(do)){ # if data object exists
if(hasName(get(do), q)){ # if data object has the name of the given question
cat(labelled::var_label(get(q, get(do)))) # print out concatenation of the column label
# the get function is required because it
# returns the object from the provided character string
}
else{
cat("Warning: Variable is not in dataset") # else, print this warning if question does not exist
# cat is used instead of stop because stop breaks the function
}
}
else{
cat("Warning: Data object does not exist") # else, print this warning if data object does not exist
}
}
get_decon()
The last, but not least, function of the cesR
package is the get_decon()
function. When called, creates a subset of the 2019 CES online survey under the name decon
(demographics and economics) that provides a tool for educators in the teaching of the analysis of large survey datasets. The get_decon()
function takes no arguments. The function first checks the global environment if an object named decon
exists using the exists()
function from the base
package . This prevents the decon
dataset from being recreated if the object exists. If the get_decon()
function is run when an object with the name decon
already exists a warning will print in the console stating Error in get_decon() : Warning: File already exists.
If a situation arises in which the decon
dataset needs to be recreated, then the best course of action is to use the rm()
function from the base
package to remove the decon
object and then run the get_decon()
function again.
# function to create 'decon' dataset
# does not use any variable calls
get_decon <- function(){
# if object does not exist in global environment
if(!exists("decon")){
# assign url to 'cesfile'
cesfile <- "https://raw.github.com/hodgettsp/ces_data/master/extdata/CES2019-web.zip"
# assign temporary file with .zip extension to placeholder variable
hldr <- tempfile(fileext = ".zip")
# download the file from url assigned to 'cesfile' with file extension from the temporary placeholder
download.file(cesfile, hldr, quiet = TRUE)
# unzip the placeholder file to given directory
unzip(hldr, exdir = "inst/extdata/ces2019_hldr")
# assign data file to temporary data object
ces2019_hldr <- haven::read_dta(hldr)
# create new data object with selected columns from temporary data object
decon <- dplyr::select(ces2019_hldr, c(5:6, 8:10, 69,76, 194, 223:227, 245, 250:251, 258, 123:125))
# rename columns in new data object
decon <- dplyr::rename(decon,
citizenship = 1, # rename column 1 to citizenship
yob = 2, # rename column 2 to yob
gender = 3, # rename column 3 to gender
province_territory = 4, # rename column 4 to province_territory
education = 5, # rename column 5 to education
lr_bef = 6, # rename column 6 to lr_bef
lr_aft = 7, # rename column 7 to lr_aft
religion = 8, # rename column 8 to religion
sexuality_selected = 9, # rename column 9 to sexuality_selected
sexuality_text = 10, # rename column 10 to sexuality_text
language_eng = 11, # rename column 11 to language_eng
language_fr = 12, # rename column 12 to language_fr
language_abgl = 13, # rename column 13 to language_abgl
employment = 14, # rename column 14 to employment
income = 15, # rename column 15 to income
income_cat = 16, # rename column 16 to income_cat
marital = 17, # rename column 17 to marital
econ_retro = 18, # rename column 18 to econ_retro
econ_fed = 19, # rename column 19 to econ_fed
econ_self = 20) # rename column 20 to econ_self
decon <- labelled::to_factor(decon) # convert variables to factors
decon <- dplyr::mutate(decon, lr_bef = as.character(lr_bef)) # reassign values in lr_bef column as characters for uniting
decon <- dplyr::mutate(decon, lr_aft = as.character(lr_aft)) # reassign values in lr_aft column as characters for uniting
decon <- tidyr::unite(decon, "lr", lr_bef:lr_aft, na.rm = TRUE, remove = FALSE) # unite lr_bef and lr_aft columns into new column lr
decon <- dplyr::mutate_if(decon, is.character, list(~dplyr::na_if(., ""))) # replaces empty cells in new lr column with NA
assign("decon", dplyr::mutate(decon, ces_code = "ces2019_web", .before = 1), envir = .GlobalEnv)
# remove temporary data object
rm(ces2019_hldr)
# remove the temporary placeholder
unlink(hldr, recursive = TRUE, force = TRUE)
# remove temporary directory
unlink("inst/extdata/ces2019_hldr", recursive = TRUE, force = TRUE)
# print out a concatenation of the survey citation
cat("TO CITE THIS SURVEY FILE: Stephenson, Laura B; Harell, Allison; Rubenson, Daniel; Loewen, Peter John, 2020, '2019 Canadian Election Study - Online Survey',
https://doi.org/10.7910/DVN/DUS88V, Harvard Dataverse, V1\nLINK: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DUS88V")
}
else{
# if the file does exist stop process and print this message
stop("Warning: File already exists.")
}
}
Takeaways
Here are a few things that I learned or found useful while creating this package.
Using the normal assign function <-
in a function does not assign a globally available object. Instead, you need to use the assign()
function.
If you are going to use functions from another package in your package, use the ::
method of calling the function. This way there will be no confusion between functions of the same name from different packages.
The document
and roxygenise
functions are your friends. I found that a lot of issues I was having when testing the cesR
function was because I had not run either of these functions.
The R Packages book from Hadley Wickham is one of the best resources you can find for creating an R package. I cannot recommend it enough.
Installation
If you would like to use cesR
, you can install the current version of this package using:
devtools::install_github("hodgettsp/cesR")