Iterating with Election Data

Objectives

Last time, we used some “brute force” approaches to cleaning up the election data. Lots of repetition and lots of copy-and-pasting. That’s not ideal - it’s not particularly readable (think about reading the same sentence over and over again with one word change) and it’s pretty error-prone. We’re going to clean that up by:

Develop our own, custom functions and use iteration to eliminate redundancy.

You will still be working with the FEC contribution receipts for our four member of Congress, but this time we’ll do it in a more principled way.

Getting Set Up

First, join the repository. Then, we’ll have to get our credentials and login to a new RStudio session. Finally, create a new, version control, git-based project (using the link to the repo you just created).

Once you’ve done that, you’ll need to load your packages. In this case, it’s just the tidyverse.

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Reading in the Data

Last time, we copied the read_delim function 4x to read in all the data:

Code

simpson_receipts <- read_delim("data/original/simpson_fec_17_26.csv",
                               delim = ",")

fulcher_receipts <- read_delim("data/original/fulcher_fec_17_26.csv",
                               delim = ",")

crapo_receipts <- read_delim("data/original/crapo_fec_17_26.csv",
                               delim = ",")

risch_receipts <- read_delim("data/original/risch_fec_17_26.csv",
                               delim = ",")

That’s not ideal. We can do the same thing with map using two commands. First, we’ll use list.files() to get a vector of filenames. Then, we’ll pass that to map

Code

contribution_files <- list.files(here::here("data/original/"), pattern = "17_26.csv", full.names = TRUE)

contribution_data <- map(contribution_files,
                         function(x) read_delim(x, delim=","))

Rows: 8124 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (15): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 3732 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (17): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 4001 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 4257 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I did a few things here. First, I used here::here() to make sure that R used the correct project directory when looking for my files. Then, I passed “17_26.csv” as the parameter for the pattern argument. The pattern argument excepts characters and regular expressions as a way to make sure that I only get the files that match our election data (this is why file naming is important!!). Finally, I passed TRUE to the full.names argument to ensure that I get the entire path (not just the filename). Once I’ve got the contribution_files I pass that to map. I had to write out the function(x) because I wanted to pass additional arguments to read_delim.

You’ve probably noticed that there’s an even more compact way to do this using a pipe:

Code

contribution_data <- list.files(here::here("data/original/"), 
                                 pattern = "17_26.csv", 
                                 full.names = TRUE) %>%  
  map(.,
                         function(x) read_delim(x, delim=","))

Rows: 8124 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (15): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 3732 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (17): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 4001 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 4257 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Better still, we can replace read_delim with read_csv and eliminate the need to pass additional arguments.

Code

contribution_data <- list.files(here::here("data/original/"), 
                                 pattern = "17_26.csv", 
                                 full.names = TRUE) %>%  
  map(., read_csv)

Rows: 8124 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (15): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 3732 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (17): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 4001 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 4257 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

This is helpful, but I would like to keep track of who the data belongs to, so I’m going to step back and write my own 1-time function inside of map.

Code

contribution_data <- list.files(here::here("data/original/"), 
                                 pattern = "17_26.csv", 
                                 full.names = TRUE) %>%  
  map(~ read_csv(.x) %>%
    mutate(representative = str_extract(basename(.x), "^[^_]+(?=_fec_)"))
  )

Rows: 8124 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (15): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 3732 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (17): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 4001 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (51): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 4257 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (50): committee_id, committee_name, report_type, filing_form, line_numb...
dbl  (10): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (16): recipient_committee_org_type, is_individual, memo_code_full, cand...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I’ve added a mutate step to the pipeline to create a column that tracks the representative. The str_extract function is from stringr and relies on regular expressions to decide which characters to keep. This particular expression takes all of the characters before “_fec” in the filename. Regular expressions are often complicated and frustrating, but they can be extremely powerful. One of the places I rely on AI in my coding is to get these right!

Now, let’s inspect the resulting object:

Code

class(contribution_data)

[1] "list"

This is the first time we’ve really encountered the list data class. Lists can hold multiple data types so long as the data exists in its own slot. So a list could have 1 slot that contains a vector, another slot that contains a matrix, and another that contains a dataframe. Lists are extremely flexible, but take a little getting used to.

One of the tricky bits is actually getting to data in each slot. We can do this by using the [[ notation.

Code

glimpse(contribution_data[[1]])

Rows: 8,124
Columns: 79
$ committee_id                          <chr> "C00330886", "C00330886", "C0033…
$ committee_name                        <chr> "MIKE CRAPO FOR US SENATE", "MIK…
$ report_year                           <dbl> 2018, 2018, 2018, 2018, 2018, 20…
$ report_type                           <chr> "YE", "Q2", "Q2", "Q2", "Q2", "Q…
$ image_number                          <dbl> 2.019042e+17, 2.019072e+17, 2.01…
$ filing_form                           <chr> "F3", "F3", "F3", "F3", "F3", "F…
$ link_id                               <dbl> 4.04152e+18, 1.09302e+18, 1.0930…
$ line_number                           <chr> "11AI", "12", "11AI", "12", "12"…
$ transaction_id                        <chr> "A60DF5DC6A9CF4EA1A7E", NA, NA, …
$ file_number                           <dbl> 1325146, 1339467, 1339467, 13394…
$ entity_type                           <chr> "IND", "IND", "IND", "IND", "IND…
$ entity_type_desc                      <chr> "INDIVIDUAL", "INDIVIDUAL", "IND…
$ unused_contbr_id                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ contributor_prefix                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ contributor_name                      <chr> "WILSON, KEVIN", "LUNDY, GARY L.…
$ recipient_committee_type              <chr> "S", "S", "S", "S", "S", "S", "S…
$ recipient_committee_org_type          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ recipient_committee_designation       <chr> "P", "P", "P", "P", "P", "P", "P…
$ contributor_first_name                <chr> "KEVIN", "GARY", "JAMES", "RICHA…
$ contributor_middle_name               <chr> NA, "L.", NA, "B.", "VALENTINE",…
$ contributor_last_name                 <chr> "WILSON", "LUNDY", "GROSFELD", "…
$ contributor_suffix                    <chr> NA, NA, NA, NA, "III", NA, NA, N…
$ contributor_street_1                  <chr> "1311 ALPS DR", "507 W CRESTVIEW…
$ contributor_street_2                  <chr> NA, NA, "STE 1600", NA, NA, NA, …
$ contributor_city                      <chr> "MC LEAN", "PITTSBURG", "SOUTHFI…
$ contributor_state                     <chr> "VA", "KS", "MI", "KS", "TN", "I…
$ contributor_zip                       <chr> "221021501", "667626286", "48076…
$ contributor_employer                  <chr> "FEDERAL RESERVE BOARD", "WATCO …
$ contributor_occupation                <chr> "ANALYST", "CHAIRMAN AND EXECUTI…
$ contributor_id                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ is_individual                         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FA…
$ receipt_type                          <chr> "15J", "15J", "15", "15J", "15J"…
$ receipt_type_desc                     <chr> "MEMO (FILER'S % OF CONTRIBUTION…
$ receipt_type_full                     <chr> "REATTRIBUTION TO SPOUSE", NA, N…
$ memo_code                             <chr> "X", "X", NA, "X", "X", NA, NA, …
$ memo_code_full                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ memo_text                             <chr> NA, NA, NA, NA, NA, "BANK INTERE…
$ contribution_receipt_date             <dttm> 2018-10-01, 2019-03-07, 2019-06…
$ contribution_receipt_amount           <dbl> -400.00, 100.00, 100.00, 100.00,…
$ contributor_aggregate_ytd             <dbl> 5400.00, 3940.00, 4700.00, 3440.…
$ candidate_id                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_name                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_first_name                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_last_name                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_middle_name                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_prefix                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_suffix                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_full                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_state                <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_state_full           <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_district             <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_id                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_name                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_street1             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_street2             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_city                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_state               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_zip                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ donor_committee_name                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ national_committee_nonfederal_account <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ election_type                         <chr> "P2022", "P2022", "P2022", "P202…
$ election_type_full                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ fec_election_type_desc                <chr> "PRIMARY", "PRIMARY", "PRIMARY",…
$ fec_election_year                     <dbl> 2022, 2022, 2022, 2022, 2022, 20…
$ two_year_transaction_period           <dbl> 2018, 2018, 2018, 2018, 2018, 20…
$ amendment_indicator                   <chr> "N", "A", "A", "A", "A", "A", "A…
$ amendment_indicator_desc              <chr> "NO CHANGE", "ADD", "ADD", "ADD"…
$ schedule_type                         <chr> "SA", "SA", "SA", "SA", "SA", "S…
$ schedule_type_full                    <chr> "ITEMIZED RECEIPTS", "ITEMIZED R…
$ increased_limit                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ load_date                             <dttm> 2019-04-19 05:39:15, 2019-10-01…
$ sub_id                                <dbl> 4.04182e+18, 1.09302e+18, 1.0930…
$ original_sub_id                       <dbl> NA, 4.07242e+18, 4.07242e+18, 4.…
$ back_reference_transaction_id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ back_reference_schedule_name          <chr> NA, "SA12", NA, "SA12", "SA12", …
$ pdf_url                               <chr> "http://docquery.fec.gov/cgi-bin…
$ line_number_label                     <chr> "Contributions From Individuals/…
$ representative                        <chr> "crapo", "crapo", "crapo", "crap…

By using [[1]] I’m telling glimpse to run on the object that’s in the first slot of contribution_data. Once we’re inside the list the data behaves the same way it would if it didn’t exist in the list.

Choosing the Correct Columns

Last time, we used a pretty clunky way to narrow down all of the data.

Code

select_vars <- c("is_individual", "contributor_name",  "contribution_receipt_amount", "committee_name", "report_year")

simpson_subset <- simpson_receipts %>% 
  select(all_of(select_vars)) %>% 
  filter(is_individual == TRUE)

fulcher_subset <- fulcher_receipts %>% 
  select(all_of(select_vars)) %>% 
  filter(is_individual == TRUE)

crapo_subset <- crapo_receipts %>% 
  select(all_of(select_vars)) %>% 
  filter(is_individual == TRUE)

risch_subset <- risch_receipts %>% 
  select(all_of(select_vars)) %>% 
  filter(is_individual == TRUE)

Let’s clean this up creating our own function:

Code

select_ind_donors <- function(df, vars, donor_type){
  df %>% 
    select(all_of({{ vars }})) %>% 
    filter(is_individual == donor_type)
}

Our function select_ind_donors, has a descriptive name and takes 3 arguments (df, vars, and donor_type). The one thing to notice is our use of embracing to pass our variables into the function. Let’s test it on 1 slot of our list.

Code

select_vars <- c("is_individual", "contributor_name",  "contribution_receipt_amount", "committee_name", "report_year",
                 "representative")

test_fun <- select_ind_donors(df = contribution_data[[1]],
                              vars = select_vars,
                              donor_type = TRUE)
glimpse(test_fun)

Rows: 3,979
Columns: 6
$ is_individual               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ contributor_name            <chr> "WILSON, KEVIN", "LUNDY, GARY L.", "GROSFE…
$ contribution_receipt_amount <dbl> -400, 100, 100, 100, 140, 175, 175, 200, 2…
$ committee_name              <chr> "MIKE CRAPO FOR US SENATE", "MIKE CRAPO FO…
$ report_year                 <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
$ representative              <chr> "crapo", "crapo", "crapo", "crapo", "crapo…

Looks like that worked! Now let’s use iteration to do that to all of the slots in our list.

Code

select_vars <- c("is_individual", "contributor_name",  "contribution_receipt_amount", "committee_name", "report_year",
                 "representative")

ind_donors <- map(contribution_data, 
                  \(x) select_ind_donors(df = x,
                              vars = select_vars,
                              donor_type = TRUE))

Here you can see the power of lists combined with map: each slot gets passed to our function so that we don’t have to extract each dataframe to run our function.

The only problem left is to get this into a single dataframe. We can do that easily enough using bind_rows:

Code

select_vars <- c("is_individual", "contributor_name",  "contribution_receipt_amount", "committee_name", "report_year",
                 "representative")

ind_donors <- map(contribution_data, 
                  \(x) select_ind_donors(df = x,
                              vars = select_vars,
                              donor_type = TRUE)) %>% 
  bind_rows()

Final Thoughts

This is just the tip of the iceberg. We’ll continue practicing building functions and iterating as we move throug the rest of the semester. For now, I just want you to get some practice building intution for how this works.