Categories

cesR: An R Package for the Canadian Election Study

This is a bit of a breakdown for my thought process and the functions for the R package cesR. The full paper for cesR is available from SocArXiv Papers and the full code for this project can be found on my Github account.


Introduction

This past summer (2020 to give a date-mark) I had the privilege of working as a co-op student with Dr. Rohan Alexander in the Toronto Data Lab of U of T. While this wasn’t the project I had set out to complete, nor the one I had proposed, it was what ended up being the most rewarding.

The purpose of cesR is to make accessing Canadian Election Study datasets easier for R users. It follows and was inspired by the work being done in the R community through such packages as the opendatatoronto package and the Lahman package. Packages such as these are important to R users as they improve the functionality of working within R by minimizing the number of steps required to load data and increasing the availability of data to R users.

cesR does this through the use of five functions: get_ces(), get_cescodes(), get_preview(), get_question(), and get_decon().



Functions

get_ces()

When called, the get_ces() function returns a requested CES survey as a data object and prints to the console the associated citation and URL for the survey dataset repository. The function takes one argument in the form of a character string. This argument is a vector member that has been associated with a CES survey through the body of code in the get_ces() function that when used calls the download URL for that survey on an associated GitHub repository named ces_data. If the provided character string argument matches a member of the built-in vector ces_codes, the associated file is downloaded using the download.file() function from the utils R package as a compressed .zip folder and is stored temporarily in inst/extdata directory in the greater package directory. Upon downloading the file, the compressed folder is unzipped using the unzip() function from the utils R package and read into R using either the read_dta() or read_sav() functions from the haven R package depending on the file extension of the downloaded file. A data frame is then assigned using the assign() function from the base R package as a data object in the global environment. The downloaded file and file directory are then removed from the computer using the unlink() function from the base R package. Finally, the recommended citation for the requested survey dataset and URL of the survey data storage location are printed in the console.

If the provided character string argument does not have a match in the built-in vector, then the function process is stopped and a warning message stating Error in get_ces(): Warning: Code not in table is printed in the RStudio console.

Below is an example of the get_ces() code.

# 'get_ces' function, uses one variable 'srvy'
get_ces <- function(srvy){
  # if 'srvy' is in 'ces_codese' vector
  if(srvy %in% ces_codes){
    # if 'srvy' is equal to 'ces2019_web'
    if(srvy == "ces2019_web"){
      # if the file does not exist
      if(!file.exists("inst/extdata/ces2019_web/ces2019_web.dta")){
        # assign download url
        cesfile <- "https://raw.github.com/hodgettsp/ces_data/master/extdata/CES2019-web.zip"
        # create temporary file name holder with extension .zip
        hldr <- tempfile(fileext = ".zip")
        # download the file from the url and assign temporary name
        download.file(cesfile, hldr, quiet = TRUE)
        # unzip the compressed folder to the given directory
        unzip(hldr, exdir = "inst/extdata/ces2019_web")
        # assign the data file to a globally available variable
        assign("ces2019_web", haven::read_dta(hldr), envir = .GlobalEnv)
        # remove the temporary file
        unlink(hldr, recursive = TRUE)
        # remove the download directory
        unlink("inst/extdata/ces2019_web", recursive = TRUE)
        # print citation and link
        cat(ref2019web)
      }
    }
  }
  else{
    # if the provided code is not in the 'ces_codes' vector then stop process and print this message
    stop("Warning: Code not in table.")
  }
}



get_cescodes()

To call a CES survey the user requires an associated survey code. The get_cescodes() function provides a user with a print out of these survey codes. This function does not take any arguments. Instead, when the function is called it prints to the console a dataframe that contains the survey codes and their associated argument calls.

# get_cescodes function
# creates three vectors of the ces survey codes and associated calls
# converts those vectors to data frames with associated index number for call
# merges the three data frames and renames the columns
# removes the data frame items and prints merged results
# can be used to lookup a survey code and the associated calls.
get_cescodes <- function(){
  ces1 <- (c("ces2019_web", "ces2019_phone", "ces2015_web", "ces2015_phone", "ces2015_combo",
                "ces2011", "ces2008", "ces2004", "ces0411", "ces0406", "ces2000", "ces1997", "ces1993",
                "ces1988", "ces1984", "ces1974", "ces7480", "ces72_jnjl", "ces72_sep", "ces72_nov",
                "ces1968", "ces1965"))
  ces2 <- c('"ces2019_web"', '"ces2019_phone"', '"ces2015_web"', '"ces2015_phone"', '"ces2015_combo"',
                '"ces2011"', '"ces2008"', '"ces2004"', '"ces0411"', '"ces0406"', '"ces2000"', '"ces1997"', '"ces1993"',
                '"ces1988"', '"ces1984"', '"ces1974"', '"ces7480"', '"ces72_jnjl"', '"ces72_sep"', '"ces72_nov"',
                '"ces1968"', '"ces1965"')
  ces1 <- data.frame(ces1)
  ces1$index <- seq.int(nrow(ces1))
  ces2 <- data.frame(ces2)
  ces2$index <- seq.int(nrow(ces2))
  ces_calltable <- merge(ces1, ces2, by = "index")
  ces_calltable <- dplyr::rename(ces_calltable, ces_survey_code = ces1, get_ces_call_char = ces2)
  rm(ces1)
  rm(ces2)
  print(ces_calltable)
}



get_preview()

Sometimes it can be helpful to have a truncated preview of a dataset to assist in exploratory analysis. Additionally, a truncated dataset provides a resource that can be used by educators in the teaching of exploratory data analysis. The get_preview() function provides such truncated versions of the CES datasets. The function takes two arguments, a character string to call a survey of the same style as used for the get_ces() function and a numerical value that sets the number of rows returned. If no value is provided for the number of rows, a default of six is returned

# function to call to create previews of the CES surveys
# code for the first section of the function is commented with how the function works,
# all following sections work in the same manner.
get_preview <- function(srvy, x = 6){
  # if 'srvy' is in 'ces_codese' vector
  if(srvy %in% ces_codes){
    # if 'srvy' is equal to 'ces2019_web'
    if(srvy == "ces2019_web"){
      # if the file does not exist
      if(!file.exists("inst/extdata/ces2019_web/ces2019_web.dta")){
        # assign download url
        cesfile <- "https://raw.github.com/hodgettsp/ces_data/master/extdata/CES2019-web.zip"
        # create temporary file name holder with extension .zip
        hldr <- tempfile(fileext = ".zip")
        # download the file from the url and assign temporary name
        download.file(cesfile, hldr, quiet = TRUE)
        # unzip the compressed folder to the given directory
        unzip(hldr, exdir = "inst/extdata/ces2019_web")
        # create a locally available variable
        survey_read <- haven::read_dta(hldr)
        # assign the data file to a globally available variable
        assign("ces2019_web_preview", head(labelled::to_factor(survey_read), x), envir = .GlobalEnv)
        # remove the temporary file
        unlink(hldr, recursive = TRUE)
        # remove the download directory
        unlink("inst/extdata/ces2019_web", recursive = TRUE)
        # remove the local variable
        rm(survey_read)
      }
    }
  }
  else{
    # if the provided code is not in the 'ces_codes' vector then stop process and print this message
    stop("Warning: Code not in table.")
  }
}



get_question()

The get_question() function provides users with the ability to look up a survey question associated with a given column name. The function takes two arguments in the form of character strings, those being the name of a data object and the name of a column in the given data object. The function works such that it checks whether the given data object exists using the exists() function from the base package. If the object does not exist, the function will print out a warning in the console stating Warning: Data object does not exist. If the object does exist, get_question() will check if the given column name exists in the given data object. This is done using a combination of the hasName() function from the utils package and the get() function from the base package. The hasName() function checks if the given column name is in the given data object. Because the arguments are given as character strings the get() function is used to return the actual data object instead of the provided character string. Otherwise, the hasName() function would only check if the given column name argument occurred in the given character string argument and not the actual data object. If the column does not exist in the data object a warning is printed in the console stating Warning: Variable is not in dataset. If the given column exists in the given data object, get_question() will print the variable label of the given column to the console using a combination of the var_label() function from the labelled package and the get() function from the base package.

As a side note, I provide a step-by-step breakdown of this function in this post.

# function to produce the column label for requested dataset and variable
# takes two parameters as character strings
# 'do' data object and 'q' question
get_question <- function(do, q){
  if(exists(do)){                                                     # if data object exists
    if(hasName(get(do), q)){                                          # if data object has the name of the given question
      cat(labelled::var_label(get(q, get(do))))                       # print out concatenation of the column label
                                                                      # the get function is required because it
                                                                      # returns the object from the provided character string
    }
    else{
      cat("Warning: Variable is not in dataset")                      # else, print this warning if question does not exist
                                                                      # cat is used instead of stop because stop breaks the function
    }
  }
  else{
    cat("Warning: Data object does not exist")                        # else, print this warning if data object does not exist
  }
}



get_decon()

The last, but not least, function of the cesR package is the get_decon() function. When called, creates a subset of the 2019 CES online survey under the name decon (demographics and economics) that provides a tool for educators in the teaching of the analysis of large survey datasets. The get_decon() function takes no arguments. The function first checks the global environment if an object named decon exists using the exists() function from the base package . This prevents the decon dataset from being recreated if the object exists. If the get_decon() function is run when an object with the name decon already exists a warning will print in the console stating Error in get_decon() : Warning: File already exists. If a situation arises in which the decon dataset needs to be recreated, then the best course of action is to use the rm() function from the base package to remove the decon object and then run the get_decon() function again.

# function to create 'decon' dataset
# does not use any variable calls
get_decon <- function(){
    # if object does not exist in global environment
    if(!exists("decon")){
       # assign url to 'cesfile'
       cesfile <- "https://raw.github.com/hodgettsp/ces_data/master/extdata/CES2019-web.zip"
       # assign temporary file with .zip extension to placeholder variable
       hldr <- tempfile(fileext = ".zip")
       # download the file from url assigned to 'cesfile' with file extension from the temporary placeholder
       download.file(cesfile, hldr, quiet = TRUE)
       # unzip the placeholder file to given directory
       unzip(hldr, exdir = "inst/extdata/ces2019_hldr")
       # assign data file to temporary data object
       ces2019_hldr <- haven::read_dta(hldr)
       # create new data object with selected columns from temporary data object
       decon <- dplyr::select(ces2019_hldr, c(5:6, 8:10, 69,76, 194, 223:227, 245, 250:251, 258, 123:125))
       # rename columns in new data object
       decon <- dplyr::rename(decon,
                              citizenship = 1,                                # rename column 1 to citizenship
                              yob = 2,                                        # rename column 2 to yob
                              gender = 3,                                     # rename column 3 to gender
                              province_territory = 4,                         # rename column 4 to province_territory
                              education = 5,                                  # rename column 5 to education
                              lr_bef = 6,                                     # rename column 6 to lr_bef
                              lr_aft = 7,                                     # rename column 7 to lr_aft
                              religion = 8,                                   # rename column 8 to religion
                              sexuality_selected = 9,                         # rename column 9 to sexuality_selected
                              sexuality_text = 10,                            # rename column 10 to sexuality_text
                              language_eng = 11,                              # rename column 11 to language_eng
                              language_fr = 12,                               # rename column 12 to language_fr
                              language_abgl = 13,                             # rename column 13 to language_abgl
                              employment = 14,                                # rename column 14 to employment
                              income = 15,                                    # rename column 15 to income
                              income_cat = 16,                                # rename column 16 to income_cat
                              marital = 17,                                   # rename column 17 to marital
                              econ_retro = 18,                                # rename column 18 to econ_retro
                              econ_fed = 19,                                  # rename column 19 to econ_fed
                              econ_self = 20)                                 # rename column 20 to econ_self
       decon <- labelled::to_factor(decon)                                    # convert variables to factors
       decon <- dplyr::mutate(decon, lr_bef = as.character(lr_bef))           # reassign values in lr_bef column as characters for uniting
       decon <- dplyr::mutate(decon, lr_aft = as.character(lr_aft))           # reassign values in lr_aft column as characters for uniting
       decon <- tidyr::unite(decon, "lr", lr_bef:lr_aft, na.rm = TRUE, remove = FALSE)   # unite lr_bef and lr_aft columns into new column lr
       decon <- dplyr::mutate_if(decon, is.character, list(~dplyr::na_if(., "")))        # replaces empty cells in new lr column with NA
       assign("decon", dplyr::mutate(decon, ces_code = "ces2019_web", .before = 1), envir = .GlobalEnv)
       # remove temporary data object
       rm(ces2019_hldr)
       # remove the temporary placeholder
       unlink(hldr, recursive = TRUE, force = TRUE)
       # remove temporary directory
       unlink("inst/extdata/ces2019_hldr", recursive = TRUE, force = TRUE)
       # print out a concatenation of the survey citation
       cat("TO CITE THIS SURVEY FILE: Stephenson, Laura B; Harell, Allison; Rubenson, Daniel; Loewen, Peter John, 2020, '2019 Canadian Election Study - Online Survey',
           https://doi.org/10.7910/DVN/DUS88V, Harvard Dataverse, V1\nLINK: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DUS88V")
    }
    else{
        # if the file does exist stop process and print this message
        stop("Warning: File already exists.")
    }
}



Takeaways

Here are a few things that I learned or found useful while creating this package.

Using the normal assign function <- in a function does not assign a globally available object. Instead, you need to use the assign() function.

If you are going to use functions from another package in your package, use the :: method of calling the function. This way there will be no confusion between functions of the same name from different packages.

The document and roxygenise functions are your friends. I found that a lot of issues I was having when testing the cesR function was because I had not run either of these functions.

The R Packages book from Hadley Wickham is one of the best resources you can find for creating an R package. I cannot recommend it enough.




Installation

If you would like to use cesR, you can install the current version of this package using:

devtools::install_github("hodgettsp/cesR")