11 min read

Is this the original revised data or the revised revised data

Keeping track of the provenance of data can be a challenge, especially when drawing on published sources. Keeping a record of the origin, the date accessed, the transformations applied (e.g., converting from .xls to cvs and converting character strings such as “$1,250,321.21” to floats or date strings to date objects), subsequent changes, who handled the data object and where it can be found in a repository are all things that enhance the analyst’s own ability to reproduce results.

Unfortunately, notes go missing, files get mis-filed and all the other hazards that can befall research can happen. Often, one wishes for R objects with built-in metadata for that purpose.

Using mostattributes() to do this (now preferred)

Update 2015-08-01: Scott Chamerlain at ropensci.org brought **attr* to my attention, which is the built-in way I was looking for originally. He also pointed me to EML, a much more elaborate approach suited for publication projects.

A minimal example

  • Create data frame and a separate metadata list
fips <- read.csv("https://tuva.s3-us-west-2.amazonaws.com/state_fips_postal.csv", header = FALSE)
fips
##                      V1 V2 V3
## 1               Alabama  1 AL
## 2                Alaska  2 AK
## 3               Arizona  4 AZ
## 4              Arkansas  5 AR
## 5            California  6 CA
## 6              Colorado  8 CO
## 7           Connecticut  9 CT
## 8              Delaware 10 DE
## 9  District of Columbia 11 DC
## 10              Florida 12 FL
## 11              Georgia 13 GA
## 12               Hawaii 15 HI
## 13                Idaho 16 ID
## 14             Illinois 17 IL
## 15              Indiana 18 IN
## 16                 Iowa 19 IA
## 17               Kansas 20 KS
## 18             Kentucky 21 KY
## 19            Louisiana 22 LA
## 20                Maine 23 ME
## 21             Maryland 24 MD
## 22        Massachusetts 25 MA
## 23             Michigan 26 MI
## 24            Minnesota 27 MN
## 25          Mississippi 28 MS
## 26             Missouri 29 MO
## 27              Montana 30 MT
## 28             Nebraska 31 NE
## 29               Nevada 32 NV
## 30        New Hampshire 33 NH
## 31           New Jersey 34 NJ
## 32           New Mexico 35 NM
## 33             New York 36 NY
## 34       North Carolina 37 NC
## 35         North Dakota 38 ND
## 36                 Ohio 39 OH
## 37             Oklahoma 40 OK
## 38               Oregon 41 OR
## 39         Pennsylvania 42 PA
## 40         Rhode Island 44 RI
## 41       South Carolina 45 SC
## 42         South Dakota 46 SD
## 43            Tennessee 47 TN
## 44                Texas 48 TX
## 45                 Utah 49 UT
## 46              Vermont 50 VT
## 47             Virginia 51 VA
## 48           Washington 53 WA
## 49        West Virginia 54 WV
## 50            Wisconsin 55 WI
## 51              Wyoming 56 WY
names(fips) = c("state", "fip", 'id')
require(rjson) # easier to use JSON to write metadata
## Loading required package: rjson
meta <- fromJSON(file = "data/meta.json")
meta
## $Accessed
## [1] "2015-08-01"
## 
## $Analyst
## [1] "Richard Careaga"
## 
## $Contact
## [1] "technocrat@twitter"
## 
## $Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
## 
## $Source
## [1] "http://www.theguardian.com/thecounted"

Associate the metadata with the data frame using attr

# invisibly with attr
x <- fips
attr(x, "meta") <- meta

By default, metadata is not displayed

x
##                   state fip id
## 1               Alabama   1 AL
## 2                Alaska   2 AK
## 3               Arizona   4 AZ
## 4              Arkansas   5 AR
## 5            California   6 CA
## 6              Colorado   8 CO
## 7           Connecticut   9 CT
## 8              Delaware  10 DE
## 9  District of Columbia  11 DC
## 10              Florida  12 FL
## 11              Georgia  13 GA
## 12               Hawaii  15 HI
## 13                Idaho  16 ID
## 14             Illinois  17 IL
## 15              Indiana  18 IN
## 16                 Iowa  19 IA
## 17               Kansas  20 KS
## 18             Kentucky  21 KY
## 19            Louisiana  22 LA
## 20                Maine  23 ME
## 21             Maryland  24 MD
## 22        Massachusetts  25 MA
## 23             Michigan  26 MI
## 24            Minnesota  27 MN
## 25          Mississippi  28 MS
## 26             Missouri  29 MO
## 27              Montana  30 MT
## 28             Nebraska  31 NE
## 29               Nevada  32 NV
## 30        New Hampshire  33 NH
## 31           New Jersey  34 NJ
## 32           New Mexico  35 NM
## 33             New York  36 NY
## 34       North Carolina  37 NC
## 35         North Dakota  38 ND
## 36                 Ohio  39 OH
## 37             Oklahoma  40 OK
## 38               Oregon  41 OR
## 39         Pennsylvania  42 PA
## 40         Rhode Island  44 RI
## 41       South Carolina  45 SC
## 42         South Dakota  46 SD
## 43            Tennessee  47 TN
## 44                Texas  48 TX
## 45                 Utah  49 UT
## 46              Vermont  50 VT
## 47             Virginia  51 VA
## 48           Washington  53 WA
## 49        West Virginia  54 WV
## 50            Wisconsin  55 WI
## 51              Wyoming  56 WY

Metadata has to be invoked by name

attr(x, "meta")
## $Accessed
## [1] "2015-08-01"
## 
## $Analyst
## [1] "Richard Careaga"
## 
## $Contact
## [1] "technocrat@twitter"
## 
## $Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
## 
## $Source
## [1] "http://www.theguardian.com/thecounted"

Associate the metadata with the data frame using mostattributes

x <- fips
mostattributes(x) <- list(meta = meta)

The metadata now displays with the data frame

x
## [[1]]
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"             
## 
## [[2]]
##  [1]  1  2  4  5  6  8  9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [26] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45 46 47 48 49 50 51 53 54 55
## [51] 56
## 
## [[3]]
##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [16] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
## [31] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VT" "VA" "WA" "WV" "WI" "WY"
## 
## attr(,"meta")
## attr(,"meta")$Accessed
## [1] "2015-08-01"
## 
## attr(,"meta")$Analyst
## [1] "Richard Careaga"
## 
## attr(,"meta")$Contact
## [1] "technocrat@twitter"
## 
## attr(,"meta")$Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
## 
## attr(,"meta")$Source
## [1] "http://www.theguardian.com/thecounted"

My now deprecated approach

Metadata are surprisingly easy to create through S4 classes, which was my original thought.

The Rise of the Objects

After encountering S41 classes in the sp2 and related packages, I hoped to find an R package that provided a similar way of attaching metadata to data frames as the geospatial tools used to attach features to a coordinates data frame. The search was unsuccessful, so the next step was to look into S4 to write something on my own. Several sources helped: Gatto, Genolini and the omnipresent Wickham.

A Simple Example Use

In using U.S. state map data, polygons are sometimes keyed to state names, sometimes to postal codes and sometimes to a census FIPS code. A data frame with all three and the merge function allows easily adding a different key to state identity to data to match the merge field to be used with the related geospatial file.

fips <- read.csv("https://tuva.s3-us-west-2.amazonaws.com/state_fips_postal.csv", header = FALSE)

fips
##                      V1 V2 V3
## 1               Alabama  1 AL
## 2                Alaska  2 AK
## 3               Arizona  4 AZ
## 4              Arkansas  5 AR
## 5            California  6 CA
## 6              Colorado  8 CO
## 7           Connecticut  9 CT
## 8              Delaware 10 DE
## 9  District of Columbia 11 DC
## 10              Florida 12 FL
## 11              Georgia 13 GA
## 12               Hawaii 15 HI
## 13                Idaho 16 ID
## 14             Illinois 17 IL
## 15              Indiana 18 IN
## 16                 Iowa 19 IA
## 17               Kansas 20 KS
## 18             Kentucky 21 KY
## 19            Louisiana 22 LA
## 20                Maine 23 ME
## 21             Maryland 24 MD
## 22        Massachusetts 25 MA
## 23             Michigan 26 MI
## 24            Minnesota 27 MN
## 25          Mississippi 28 MS
## 26             Missouri 29 MO
## 27              Montana 30 MT
## 28             Nebraska 31 NE
## 29               Nevada 32 NV
## 30        New Hampshire 33 NH
## 31           New Jersey 34 NJ
## 32           New Mexico 35 NM
## 33             New York 36 NY
## 34       North Carolina 37 NC
## 35         North Dakota 38 ND
## 36                 Ohio 39 OH
## 37             Oklahoma 40 OK
## 38               Oregon 41 OR
## 39         Pennsylvania 42 PA
## 40         Rhode Island 44 RI
## 41       South Carolina 45 SC
## 42         South Dakota 46 SD
## 43            Tennessee 47 TN
## 44                Texas 48 TX
## 45                 Utah 49 UT
## 46              Vermont 50 VT
## 47             Virginia 51 VA
## 48           Washington 53 WA
## 49        West Virginia 54 WV
## 50            Wisconsin 55 WI
## 51              Wyoming 56 WY
names(fips) = c("state", "fip", "id")

Documenting something so simple seems like overkill until the inevitable question arises is that even right? I used a simple JSON file to make3 a record

require(rjson)
meta <- fromJSON(file = "data/meta.json")
meta
## $Accessed
## [1] "2015-08-01"
## 
## $Analyst
## [1] "Richard Careaga"
## 
## $Contact
## [1] "technocrat@twitter"
## 
## $Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
## 
## $Source
## [1] "http://www.theguardian.com/thecounted"

The functions rquired for creating an S4 class are in {base}, so there are no more dependencies.

# create the class
Mframe <- setClass("Mframe",slots = c(meta = "list", data = "data.frame"))
# instantiate it
mf <- Mframe(data = fips, meta = meta)
# show it
str(mf)
## Formal class 'Mframe' [package ".GlobalEnv"] with 2 slots
##   ..@ meta:List of 5
##   .. ..$ Accessed     : chr "2015-08-01"
##   .. ..$ Analyst      : chr "Richard Careaga"
##   .. ..$ Contact      : chr "technocrat@twitter"
##   .. ..$ Preprocessing: chr "2015 U.S. police caused deaths through July, reported by The Guardian"
##   .. ..$ Source       : chr "http://www.theguardian.com/thecounted"
##   ..@ data:'data.frame': 51 obs. of  3 variables:
##   .. ..$ state: chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ fip  : int [1:51] 1 2 4 5 6 8 9 10 11 12 ...
##   .. ..$ id   : chr [1:51] "AL" "AK" "AZ" "AR" ...

Elaborating this approach to an entire dataset of related objects would be more challenging but this simple application imposes minimal extra work beyond slightly different functions to access the data payload.

mf@data
##                   state fip id
## 1               Alabama   1 AL
## 2                Alaska   2 AK
## 3               Arizona   4 AZ
## 4              Arkansas   5 AR
## 5            California   6 CA
## 6              Colorado   8 CO
## 7           Connecticut   9 CT
## 8              Delaware  10 DE
## 9  District of Columbia  11 DC
## 10              Florida  12 FL
## 11              Georgia  13 GA
## 12               Hawaii  15 HI
## 13                Idaho  16 ID
## 14             Illinois  17 IL
## 15              Indiana  18 IN
## 16                 Iowa  19 IA
## 17               Kansas  20 KS
## 18             Kentucky  21 KY
## 19            Louisiana  22 LA
## 20                Maine  23 ME
## 21             Maryland  24 MD
## 22        Massachusetts  25 MA
## 23             Michigan  26 MI
## 24            Minnesota  27 MN
## 25          Mississippi  28 MS
## 26             Missouri  29 MO
## 27              Montana  30 MT
## 28             Nebraska  31 NE
## 29               Nevada  32 NV
## 30        New Hampshire  33 NH
## 31           New Jersey  34 NJ
## 32           New Mexico  35 NM
## 33             New York  36 NY
## 34       North Carolina  37 NC
## 35         North Dakota  38 ND
## 36                 Ohio  39 OH
## 37             Oklahoma  40 OK
## 38               Oregon  41 OR
## 39         Pennsylvania  42 PA
## 40         Rhode Island  44 RI
## 41       South Carolina  45 SC
## 42         South Dakota  46 SD
## 43            Tennessee  47 TN
## 44                Texas  48 TX
## 45                 Utah  49 UT
## 46              Vermont  50 VT
## 47             Virginia  51 VA
## 48           Washington  53 WA
## 49        West Virginia  54 WV
## 50            Wisconsin  55 WI
## 51              Wyoming  56 WY
#Note @ delimiter; mf$state would also work, but hasn't been set up in the class definition
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 20.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rjson_0.2.20
## 
## loaded via a namespace (and not attached):
##  [1] compiler_4.0.0  magrittr_1.5    bookdown_0.20   tools_4.0.0    
##  [5] htmltools_0.5.0 yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3  
##  [9] blogdown_0.20   knitr_1.29      stringr_1.4.0   digest_0.6.25  
## [13] xfun_0.16       rlang_0.4.7     evaluate_0.14

  1. See S3-style Objects and S4-class Objects in the R documentation with help(S4)↩︎

  2. Pebesma, E.J., R.S. Bivand, 2005. Classes and methods for spatial data in R. R News 5 (2), http://cran.r-project.org/doc/Rnews/. Roger S. Bivand, Edzer Pebesma, Virgilio Gomez-Rubio, 2013. Applied spatial data analysis with R, Second edition. Springer, NY. asdar↩︎

  3. Alex Couture-Beil (2014). rjson: JSON for R. R package version 0.2.15. rjson↩︎