Introducing your Data

Objectives

Practice getting data into R from different sources
Learn to identify common import errors
Generate summaries of imported data

A Note about Project Management

You’ll notice that every repository we use for assignments in this class has a set folder structure (with data, docs, etc.). This helps ensure that once anyone has cloned the repository all of the paths to files, code, etc will be the same regardless of who’s running the code. For this lesson, we’ll be working within the data folder. You’ll notice that within the data folder, there is a subfolder for original and one for processed. The original folder is reserved for unmodified data. This could be your initial spreadsheet of data, the version of a dataset that you downloaded for an analysis, or any other file that you will eventually modify for your analysis. If you make any changes (rename variables, filter observations, modify values), those changes should be saved to an object in the processed folder

Warning

For your analysis to be reproducible, any filtering, cleaning, or modification of that original data should be documented in your scripts or Quarto document and the outputs stored in the processed folder.

Let’s load some packages

Before you get too far into this, it’s a good idea to load all of your packages.

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Code

library(tigris)

To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.

Code

library(sf)

Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE

Common Sources of Data

There are, in general, three types of data that you will encounter that meet our original data criteria. Data you’ve collected, data you’ve downloaded from somehere else, or data you’ve accessed via a package.

It’s yours!

So far, this is the type of data that most students bring into class. It’s usually some sort of spreadsheet that contains all of that hard-earned field data. Although there are R packages for dealing with Microsoft Excel spreadsheets, we won’t focus on those for two reasons: 1) Excel is a proprietary software and so may not be available to all users and 2) Excel makes a lot of formatting choices for you that may not actually be helpful in your analysis.

Instead, we’ll focus on a more general idea, the “delimited” text file. A delimited text file is flat (i.e., there’s only one “sheet”) and uses a consistent character (like a , or a tab) to denote when column breaks should occur. Delimited text files can be created and read in a variety of free software making them more accessible to others. It’s also a fairly trivial exercise to save an Excel spreadsheet into a .csv.

There are a variety of functions in base R that will read delimited files (read.csv, read.table, and read.delim are just a few examples), but we’re going to use the readr package from the tidyverse because it will help you get used to some of the tidyverse conventions and because it automates more of the data import process.

We’ll talk more about the tidyverse next week, but for the time being it’s worth knowing that the general structure of tidyverse functions is to combine a verb with and object. So in order to read a delimited file, we might use the read_delim() function where read is the verb and delim is the object

For this example, I’ve downloaded a file from the Federal Elections Committee depicting the campaign contributions to one of our Congressman in the 2025-2026 fiscal year. It’s located in your data/original folder. You can read it into your environment using read_delim and specifying a , for the delim argument (if you have tab delimited .txt file you would use \\t). Because we’ll want to look at this data later, I’m assigning it to an object called election_data using the assignment operator <-. If you look at the help file for read_delim (by typing ?read_delim), you’ll see that there are a variety of other options (like read_csv or read_tsv) that allow you to eliminate specifying the delim argument. I’ll demonstrate that here, too.

Code

election_data_1 <- read_delim("data/original/fec_reciepts.csv", 
                            delim = ",")

Rows: 233 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (40): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (27): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

election_data_2 <- read_csv("data/original/fec_reciepts.csv")

Rows: 233 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (40): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (27): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A Note About Parsing

You’ll notice that both read_ functions return a bit of output telling you the names and data types of each column in the dataset. This is one of the features of using readr::read_ - it attempts to guess what datatype each column should be based on the value you give to the guess_max argument. You can see from the helpfile that the default value for guess_max is min(1000, n_max) meaning that it will look at the first n_max or 1000 rows, whichever is smaller. This can be helpful for large datasets, but it can also introduce some challenges as the different versions of read_ assign column types a little differently. You can see this by running:

Code

identical(election_data_1, election_data_2)

[1] FALSE

Code

all.equal(election_data_1, election_data_2)

[1] TRUE

Despite the fact that the two objects were created from exactly the same file, identical returns FALSE while all.equal returns TRUE. This is an indication that while the data is exactly the same between both objects, there is something a little different about how R is storing the objects (identical is very strict). We don’t need to worry about that now, but I’m pointing it out as you may run into places where this causes errors that are difficult to interpret. For now, we’ll just be excited that the data is in R!

You download it

Occasionally, you’ll find data that is directly downloadable from a webpage (meaning the webaddress points directly to a .csv or .txt file). When that’s the case, you can still use the read_ functions to download and assign the data to an object. Like this:

Code

election_data_web <- read_csv("https://raw.githubusercontent.com/BSU-Spatial-Data-In-R-Fall2025/inclass-04/refs/heads/main/data/original/fec_reciepts.csv")

Rows: 233 Columns: 78
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (40): committee_id, committee_name, report_type, filing_form, line_numb...
dbl   (9): report_year, image_number, link_id, file_number, contribution_rec...
lgl  (27): contributor_prefix, recipient_committee_org_type, is_individual, ...
dttm  (2): contribution_receipt_date, load_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note

When you call the webpage inside of read_ the data is not automatically saved. If you’ve assigned it to an object (like election_data_web) it will be stored there until you decide to save it (which we’ll do next week). If you haven’t assigned it to an object, the data will just be printed to the screen.

There are more complicated workflows for .zip files (using download.file) or Google Drive files (using the googledrive package) which we’ll introduce later in the course. Those approaches add a bit more syntax on the front-end to get the file into your data/original folder, but after that the read_ step is the same.

The more common way of downloading data from the web is via Application Programming Interfaces (APIs). Although there are lots of APIs in the world, the typical application for getting data is a web-service that expects a particular set of inputs and then returns (possibly for download) a set of outputs matching your query. For example, the US Census has an API that allows you to access all of the Decennial Census and American Community Survey data by providing the state, county, year, and dataset that your are interested in. There are a lot of R packages designed to make these API calls easier. For example, the tidycensus package in R allows easy downloading of Census data, the FedData package allows you to download a variety of federally created spatial datasets, and the elevatr package allows easy download of global elevation datasets. We’ll explore these more in the future, but for now, we’ll use a simple example with the tigris package. The tigris package is a means of accessing the US TIGER (Topologically Integrated Geographic and Referencing System) files. The TIGER datsets contains US roads, state and county boundaries, and a variety of other data related to the US Census. Here’s a simple bit of code to download Idaho county boundaries.

Code

id_counties <- counties(state="ID", year = 2024, progress_bar = FALSE)

id_counties

Simple feature collection with 44 features and 18 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -117.243 ymin: 41.98818 xmax: -111.0435 ymax: 49.00085
Geodetic CRS:  NAD83
First 10 features:
    STATEFP COUNTYFP COUNTYNS GEOID        GEOIDFQ       NAME          NAMELSAD
63       16      051 00399751 16051 0500000US16051  Jefferson  Jefferson County
150      16      055 00395661 16055 0500000US16055   Kootenai   Kootenai County
190      16      045 00395442 16045 0500000US16045        Gem        Gem County
228      16      041 00395585 16041 0500000US16041   Franklin   Franklin County
234      16      065 00394803 16065 0500000US16065    Madison    Madison County
258      16      033 00399755 16033 0500000US16033      Clark      Clark County
306      16      053 00395662 16053 0500000US16053     Jerome     Jerome County
309      16      037 00399758 16037 0500000US16037     Custer     Custer County
457      16      019 00395407 16019 0500000US16019 Bonneville Bonneville County
574      16      049 00395699 16049 0500000US16049      Idaho      Idaho County
    LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT       ALAND    AWATER
63    06      H1 G4020   292  26820     <NA>        A  2832619467  31166194
150   06      H1 G4020   518  17660     <NA>        A  3205836476 184444918
190   06      H1 G4020   147  14260     <NA>        A  1449806864  12481361
228   06      H1 G4020  <NA>  30860     <NA>        A  1717211138  12153472
234   06      H1 G4020   292  39940     <NA>        A  1215396274  10500950
258   06      H1 G4020  <NA>   <NA>     <NA>        A  4566524479   2476453
306   06      H1 G4020  <NA>  46300     <NA>        A  1547629732  12908802
309   06      H1 G4020  <NA>   <NA>     <NA>        A 12748378762  42527190
457   06      H1 G4020   292  26820     <NA>        A  4832814957  88912977
574   06      H1 G4020  <NA>   <NA>     <NA>        A 21956620663  67532942
       INTPTLAT     INTPTLON                       geometry
63  +43.7969649 -112.3185879 MULTIPOLYGON (((-111.8045 4...
150 +47.6759569 -116.6959192 MULTIPOLYGON (((-117.0423 4...
190 +44.0614727 -116.3987839 MULTIPOLYGON (((-116.7124 4...
228 +42.1736093 -111.8229653 MULTIPOLYGON (((-111.9336 4...
234 +43.7886140 -111.6569925 MULTIPOLYGON (((-111.9835 4...
258 +44.2902180 -112.3546128 MULTIPOLYGON (((-112.3085 4...
306 +42.6913953 -114.2620858 MULTIPOLYGON (((-113.932 42...
309 +44.2733510 -114.2522675 MULTIPOLYGON (((-115.305 44...
457 +43.3951708 -111.6218783 MULTIPOLYGON (((-112.5201 4...
574 +45.8496440 -115.4673371 MULTIPOLYGON (((-116.4809 4...

Here we are providing the API with a state and a year which counties converts into an API call to the census page. There are more complicated versions of this that we’ll explore down the road once you’re more comfortable with the spatial packages.

It comes with a package

One final option for obtaining data is that it “ships” with a package. That is, when you install the package, you get the data along with the functions. You’re not likely to use this much for your own analysis, but it can be critical when you’re trying to get help with a coding problem. Most help sites (e.g., StackOverflow, Posit Community) require a minimally reproducible example. Minimally reproducible examples allow others to diagnose your coding problem without you having to share your dataset and without them needing to run all of the cleanup steps. You can type library(help = "datasets") to get a list of a variety of example datasets. We’ll load the iris dataset here just so you can see how it works should you need it.

Code

data(iris)
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Checking yourself (or at least `R`)

Ok, so you’ve got your data into R, the first thing you need to do is to make sure that the import was successful. Before you do anything in R it’s worth familiarizing yourself with the metadata (data about the data) and building your intuition for how the data should look. Let’s take a look at the page where the data was exported.

Did you get it all?

The first thing we might want to check is whether we actually got all of the date. Based on a quick look at the data it appears that there are 233 observations and columns. We can check that our data matches that by using the dim function (short for “dimensions”).

Code

dim(election_data_1)

[1] 233  78

This returns the number of rows and then columns. Based on this we can see that are 233 observations (rows) and 78(!) columns. Where did all of these columns come from?? It’s not immediately obvious. You’ll notice that if you click on an individual record there’s an option to “View Image”. If you do that, you’ll see the actual tax form that is entered into the database. That form actually has 78 boxes so it would appear that we’re good on that front.

Is it meaningful?

One of the things that readr::read_ does is to try and parse the column names in your data and assign it to a particular data types. That doesn’t guarantee, however, that it got it right. We should inspect that. The first thing we might do is take a look at the column names (using the colnames function from base R).

Code

colnames(election_data_1)

 [1] "committee_id"                         
 [2] "committee_name"                       
 [3] "report_year"                          
 [4] "report_type"                          
 [5] "image_number"                         
 [6] "filing_form"                          
 [7] "link_id"                              
 [8] "line_number"                          
 [9] "transaction_id"                       
[10] "file_number"                          
[11] "entity_type"                          
[12] "entity_type_desc"                     
[13] "unused_contbr_id"                     
[14] "contributor_prefix"                   
[15] "contributor_name"                     
[16] "recipient_committee_type"             
[17] "recipient_committee_org_type"         
[18] "recipient_committee_designation"      
[19] "contributor_first_name"               
[20] "contributor_middle_name"              
[21] "contributor_last_name"                
[22] "contributor_suffix"                   
[23] "contributor_street_1"                 
[24] "contributor_street_2"                 
[25] "contributor_city"                     
[26] "contributor_state"                    
[27] "contributor_zip"                      
[28] "contributor_employer"                 
[29] "contributor_occupation"               
[30] "contributor_id"                       
[31] "is_individual"                        
[32] "receipt_type"                         
[33] "receipt_type_desc"                    
[34] "receipt_type_full"                    
[35] "memo_code"                            
[36] "memo_code_full"                       
[37] "memo_text"                            
[38] "contribution_receipt_date"            
[39] "contribution_receipt_amount"          
[40] "contributor_aggregate_ytd"            
[41] "candidate_id"                         
[42] "candidate_name"                       
[43] "candidate_first_name"                 
[44] "candidate_last_name"                  
[45] "candidate_middle_name"                
[46] "candidate_prefix"                     
[47] "candidate_suffix"                     
[48] "candidate_office"                     
[49] "candidate_office_full"                
[50] "candidate_office_state"               
[51] "candidate_office_state_full"          
[52] "candidate_office_district"            
[53] "conduit_committee_id"                 
[54] "conduit_committee_name"               
[55] "conduit_committee_street1"            
[56] "conduit_committee_street2"            
[57] "conduit_committee_city"               
[58] "conduit_committee_state"              
[59] "conduit_committee_zip"                
[60] "donor_committee_name"                 
[61] "national_committee_nonfederal_account"
[62] "election_type"                        
[63] "election_type_full"                   
[64] "fec_election_type_desc"               
[65] "fec_election_year"                    
[66] "two_year_transaction_period"          
[67] "amendment_indicator"                  
[68] "amendment_indicator_desc"             
[69] "schedule_type"                        
[70] "schedule_type_full"                   
[71] "increased_limit"                      
[72] "load_date"                            
[73] "sub_id"                               
[74] "original_sub_id"                      
[75] "back_reference_transaction_id"        
[76] "back_reference_schedule_name"         
[77] "pdf_url"                              
[78] "line_number_label"

Nothing seems obviously wrong here. The names all seem readable and broken into distinct categories which suggests taht the delimitter worked and we didn’t end up with oddball columns.

Does `R` recognize it?

Now to take a look at whether readr::read_ correctly guessed the type of data contained in each column. We can use dplyr::glimpse from the tidyverse or str (short for structure) from base R to get a quick look at the data.

Code

glimpse(election_data_1)

Rows: 233
Columns: 78
$ committee_id                          <chr> "C00331397", "C00331397", "C0033…
$ committee_name                        <chr> "SIMPSON FOR CONGRESS", "SIMPSON…
$ report_year                           <dbl> 2025, 2025, 2025, 2025, 2025, 20…
$ report_type                           <chr> "Q1", "Q1", "Q1", "Q1", "Q2", "Q…
$ image_number                          <dbl> 2.025041e+17, 2.025041e+17, 2.02…
$ filing_form                           <chr> "F3", "F3", "F3", "F3", "F3", "F…
$ link_id                               <dbl> 4.04082e+18, 4.04082e+18, 4.0408…
$ line_number                           <chr> "11AI", "11AI", "11AI", "11AI", …
$ transaction_id                        <chr> "AF23BB60932664D8EACA", "A7BD67B…
$ file_number                           <dbl> 1884461, 1884461, 1884461, 18844…
$ entity_type                           <chr> "IND", "ORG", "IND", "ORG", "IND…
$ entity_type_desc                      <chr> "INDIVIDUAL", "ORGANIZATION", "I…
$ unused_contbr_id                      <chr> "C00694323", NA, "C00694323", NA…
$ contributor_prefix                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ contributor_name                      <chr> "BINGER, KEVIN", "WINRED", "SLAT…
$ recipient_committee_type              <chr> "H", "H", "H", "H", "H", "H", "H…
$ recipient_committee_org_type          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ recipient_committee_designation       <chr> "P", "P", "P", "P", "P", "P", "P…
$ contributor_first_name                <chr> "KEVIN", NA, "LINDSAY", NA, "MIT…
$ contributor_middle_name               <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ contributor_last_name                 <chr> "BINGER", NA, "SLATER", NA, "BUT…
$ contributor_suffix                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ contributor_street_1                  <chr> "12910 CREAMERY HILL DR", "PO BO…
$ contributor_street_2                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ contributor_city                      <chr> "GERMANTOWN", "ARLINGTON", "WASH…
$ contributor_state                     <chr> "MD", "VA", "DC", "VA", "MD", "V…
$ contributor_zip                       <chr> "208746338", "222191891", "20002…
$ contributor_employer                  <chr> "CASSIDY & ASSOCIATES", NA, "TRO…
$ contributor_occupation                <chr> "SENIOR VICE PRESIDENT", NA, "VI…
$ contributor_id                        <chr> "C00694323", NA, "C00694323", NA…
$ is_individual                         <lgl> TRUE, FALSE, TRUE, FALSE, TRUE, …
$ receipt_type                          <chr> "15E", NA, "15E", NA, "15E", NA,…
$ receipt_type_desc                     <chr> "EARMARKED CONTRIBUTION", NA, "E…
$ receipt_type_full                     <chr> "EARMARKED (NON-DIRECTED) THROUG…
$ memo_code                             <chr> NA, "X", NA, "X", NA, "X", NA, "…
$ memo_code_full                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ memo_text                             <chr> NA, "TOTAL EARMARKED THROUGH CON…
$ contribution_receipt_date             <dttm> 2025-02-05, 2025-02-05, 2025-02…
$ contribution_receipt_amount           <dbl> 250, 250, 250, 250, 250, 250, 25…
$ contributor_aggregate_ytd             <dbl> 250, 31900, 250, 31900, 250, 554…
$ candidate_id                          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_name                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_first_name                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_last_name                   <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_middle_name                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_prefix                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_suffix                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_full                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_state                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_state_full           <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ candidate_office_district             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_id                  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_name                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_street1             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_street2             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_city                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_state               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ conduit_committee_zip                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ donor_committee_name                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, …
$ national_committee_nonfederal_account <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ election_type                         <chr> "P2026", "P2026", "P2026", "P202…
$ election_type_full                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ fec_election_type_desc                <chr> "PRIMARY", "PRIMARY", "PRIMARY",…
$ fec_election_year                     <dbl> 2026, 2026, 2026, 2026, 2026, 20…
$ two_year_transaction_period           <dbl> 2026, 2026, 2026, 2026, 2026, 20…
$ amendment_indicator                   <chr> "A", "A", "A", "A", "A", "A", "A…
$ amendment_indicator_desc              <chr> "ADD", "ADD", "ADD", "ADD", "ADD…
$ schedule_type                         <chr> "SA", "SA", "SA", "SA", "SA", "S…
$ schedule_type_full                    <chr> "ITEMIZED RECEIPTS", "ITEMIZED R…
$ increased_limit                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ load_date                             <dttm> 2025-04-12 04:16:48, 2025-04-12…
$ sub_id                                <dbl> 4.04112e+18, 4.04112e+18, 4.0411…
$ original_sub_id                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, …
$ back_reference_transaction_id         <chr> NA, "AF23BB60932664D8EACA", NA, …
$ back_reference_schedule_name          <chr> NA, "SA11AI", NA, "SA11AI", NA, …
$ pdf_url                               <chr> "https://docquery.fec.gov/cgi-bi…
$ line_number_label                     <chr> "Contributions From Individuals/…

You’ll notice that each column shows up after the $ operator, followed by the datatype enclosed in <>, followed by the first few observations from the dataset. The nice part about glimpse is the colored highlighting of NA values which can help draw your attention to potential mistakes. For now, we’re going to focus on the data types. You’ll see <chr> for many of the columns indicating that the data in those columns is of the character data type, you’ll also see <dbl> indicating that the column is a numeric data type with a “double float” precision which refers to the number of decimal points R will track for a numeric value. You’ll also notice the <lgl> or logical datatype referring to data that is either TRUE/FALSE and the <dttm> for dates and times. One thing you might also notice is that for many columns the first set of entries that glimpse shows us are entirely NA. When you’re working with your own data, hopefully you’ll know whether or not NAs are appropriate, but here, because this is “found” data, it’s a good idea here to check a few of the images to see whether these NAs are expected. The other thing you’ll notice is that a variety of “ID” fields (e.g., sub_id, file_id, image_id) were parsed as a double data type. This could be a problem as id columns are often meant to denote individuals and so act as a label (i.e., character) rather than a number. If there are leading 0’s (i.e., 00134) R will drop those when converting it to a numeric data type. Because we’re just looking at the data, we won’t worry about it now, but I wanted to draw your attention to it.

Exploring your new data

Okay, you’ve got the data into R and you’ve checked that things look correct. Now it’s time to get a sense for what the data actually has in it. We might want some information of the distribution of numeric values, frequencies of categorical values, and maybe to look for values that fall outside of the expected range.

Basic stats

One quick (and ugly) way to get a sense for the range of values in your data (along with summary stats for numeric data) is to use the summary function.

Code

summary(election_data_1)

 committee_id       committee_name      report_year   report_type       
 Length:233         Length:233         Min.   :2025   Length:233        
 Class :character   Class :character   1st Qu.:2025   Class :character  
 Mode  :character   Mode  :character   Median :2025   Mode  :character  
                                       Mean   :2025                     
                                       3rd Qu.:2025                     
                                       Max.   :2025                     
  image_number       filing_form           link_id          line_number       
 Min.   :2.025e+17   Length:233         Min.   :4.041e+18   Length:233        
 1st Qu.:2.025e+17   Class :character   1st Qu.:4.041e+18   Class :character  
 Median :2.025e+17   Mode  :character   Median :4.041e+18   Mode  :character  
 Mean   :2.025e+17                      Mean   :4.053e+18                     
 3rd Qu.:2.025e+17                      3rd Qu.:4.071e+18                     
 Max.   :2.025e+17                      Max.   :4.071e+18                     
 transaction_id      file_number      entity_type        entity_type_desc  
 Length:233         Min.   :1884461   Length:233         Length:233        
 Class :character   1st Qu.:1884461   Class :character   Class :character  
 Mode  :character   Median :1884461   Mode  :character   Mode  :character  
                    Mean   :1891304                                        
                    3rd Qu.:1901069                                        
                    Max.   :1901069                                        
 unused_contbr_id   contributor_prefix contributor_name  
 Length:233         Mode:logical       Length:233        
 Class :character   NA's:233           Class :character  
 Mode  :character                      Mode  :character  
                                                         
                                                         
                                                         
 recipient_committee_type recipient_committee_org_type
 Length:233               Mode:logical                
 Class :character         NA's:233                    
 Mode  :character                                     
                                                      
                                                      
                                                      
 recipient_committee_designation contributor_first_name contributor_middle_name
 Length:233                      Length:233             Length:233             
 Class :character                Class :character       Class :character       
 Mode  :character                Mode  :character       Mode  :character       
                                                                               
                                                                               
                                                                               
 contributor_last_name contributor_suffix contributor_street_1
 Length:233            Length:233         Length:233          
 Class :character      Class :character   Class :character    
 Mode  :character      Mode  :character   Mode  :character    
                                                              
                                                              
                                                              
 contributor_street_2 contributor_city   contributor_state  contributor_zip   
 Length:233           Length:233         Length:233         Length:233        
 Class :character     Class :character   Class :character   Class :character  
 Mode  :character     Mode  :character   Mode  :character   Mode  :character  
                                                                              
                                                                              
                                                                              
 contributor_employer contributor_occupation contributor_id     is_individual  
 Length:233           Length:233             Length:233         Mode :logical  
 Class :character     Class :character       Class :character   FALSE:147      
 Mode  :character     Mode  :character       Mode  :character   TRUE :86       
                                                                               
                                                                               
                                                                               
 receipt_type       receipt_type_desc  receipt_type_full   memo_code        
 Length:233         Length:233         Length:233         Length:233        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 memo_code_full  memo_text         contribution_receipt_date    
 Mode:logical   Length:233         Min.   :2025-01-08 00:00:00  
 NA's:233       Class :character   1st Qu.:2025-03-10 00:00:00  
                Mode  :character   Median :2025-03-29 00:00:00  
                                   Mean   :2025-04-09 11:38:22  
                                   3rd Qu.:2025-05-23 00:00:00  
                                   Max.   :2025-06-30 00:00:00  
 contribution_receipt_amount contributor_aggregate_ytd candidate_id  
 Min.   : 250                Min.   :  250             Mode:logical  
 1st Qu.: 500                1st Qu.: 1000             NA's:233      
 Median :1000                Median : 2000                           
 Mean   :1479                Mean   :10836                           
 3rd Qu.:2000                3rd Qu.: 7500                           
 Max.   :5000                Max.   :55400                           
 candidate_name candidate_first_name candidate_last_name candidate_middle_name
 Mode:logical   Mode:logical         Mode:logical        Mode:logical         
 NA's:233       NA's:233             NA's:233            NA's:233             
                                                                              
                                                                              
                                                                              
                                                                              
 candidate_prefix candidate_suffix candidate_office candidate_office_full
 Mode:logical     Mode:logical     Mode:logical     Mode:logical         
 NA's:233         NA's:233         NA's:233         NA's:233             
                                                                         
                                                                         
                                                                         
                                                                         
 candidate_office_state candidate_office_state_full candidate_office_district
 Mode:logical           Mode:logical                Mode:logical             
 NA's:233               NA's:233                    NA's:233                 
                                                                             
                                                                             
                                                                             
                                                                             
 conduit_committee_id conduit_committee_name conduit_committee_street1
 Mode:logical         Mode:logical           Mode:logical             
 NA's:233             NA's:233               NA's:233                 
                                                                      
                                                                      
                                                                      
                                                                      
 conduit_committee_street2 conduit_committee_city conduit_committee_state
 Mode:logical              Mode:logical           Mode:logical           
 NA's:233                  NA's:233               NA's:233               
                                                                         
                                                                         
                                                                         
                                                                         
 conduit_committee_zip donor_committee_name
 Mode:logical          Length:233          
 NA's:233              Class :character    
                       Mode  :character    
                                           
                                           
                                           
 national_committee_nonfederal_account election_type      election_type_full
 Mode:logical                          Length:233         Mode:logical      
 NA's:233                              Class :character   NA's:233          
                                       Mode  :character                     
                                                                            
                                                                            
                                                                            
 fec_election_type_desc fec_election_year two_year_transaction_period
 Length:233             Min.   :2026      Min.   :2026               
 Class :character       1st Qu.:2026      1st Qu.:2026               
 Mode  :character       Median :2026      Median :2026               
                        Mean   :2026      Mean   :2026               
                        3rd Qu.:2026      3rd Qu.:2026               
                        Max.   :2026      Max.   :2026               
 amendment_indicator amendment_indicator_desc schedule_type     
 Length:233          Length:233               Length:233        
 Class :character    Class :character         Class :character  
 Mode  :character    Mode  :character         Mode  :character  
                                                                
                                                                
                                                                
 schedule_type_full increased_limit   load_date                  
 Length:233         Mode:logical    Min.   :2025-04-12 04:16:48  
 Class :character   NA's:233        1st Qu.:2025-04-12 04:16:48  
 Mode  :character                   Median :2025-04-12 04:16:48  
                                    Mean   :2025-05-21 17:29:56  
                                    3rd Qu.:2025-07-17 04:06:49  
                                    Max.   :2025-07-17 04:06:49  
     sub_id          original_sub_id back_reference_transaction_id
 Min.   :4.041e+18   Mode:logical    Length:233                   
 1st Qu.:4.041e+18   NA's:233        Class :character             
 Median :4.041e+18                   Mode  :character             
 Mean   :4.054e+18                                                
 3rd Qu.:4.072e+18                                                
 Max.   :4.072e+18                                                
 back_reference_schedule_name   pdf_url          line_number_label 
 Length:233                   Length:233         Length:233        
 Class :character             Class :character   Class :character  
 Mode  :character             Mode  :character   Mode  :character

We can look at the contribution_receipt_amount and the contributor_aggregate_ytd columns as those are the only true numeric values. We can see that the Max contribution amount for any single observation is 5000 whereas the Max aggregate amount is 55400. Looking at the aggregate value a bit more you’ll see that the Mean is actually larger than the 3rd quartile values which suggests that there are a handful of very larger contributors relative to the rest. If you were planning to analyze this data, you’ll have to think about how to deal with those observations carefully.

Oddballs

The other thing that summary can help with is identifying places where data is incorrect. For example, this should only be for this FY so the fact that all of the report_yearvalues are the same is not surprising. Another thing you might now is that individual contributions are capped at 5000, which we know to be true from the previous check. Finally, summary returns the number of NAs in each column.

Data frequency

One last thing we might check before we decide what we want to do with this data is to look at the frequency of different categories. We can use tableto look at a few of those things. We might take a look at how many of these contributions are from individuals.

Code

table(election_data_1$is_individual)


FALSE  TRUE 
  147    86

Based on this we can see that the number of contributions by groups outnumbers individuals by almost 2:1. What other categorical variables might you look at?

That’s all for now. We’ll learn more complicated ways to evaluate and modify data in the coming weeks, but this is a standard “gut-check” anytime you’re bringing data into R.

Objectives

A Note about Project Management

Let’s load some packages

Common Sources of Data

It’s yours!

You download it

It comes with a package

Checking yourself (or at least R)

Did you get it all?

Is it meaningful?

Does R recognize it?

Exploring your new data

Basic stats

Oddballs

Data frequency

Checking yourself (or at least `R`)

Does `R` recognize it?