Keeping track of the provenance of data can be a challenge, especially when drawing on published sources. Keeping a record of the origin, the date accessed, the transformations applied (e.g., converting from .xls to cvs and converting character strings such as “$1,250,321.21” to floats or date strings to date objects), subsequent changes, who handled the data object and where it can be found in a repository are all things that enhance the analyst’s own ability to reproduce results.
Unfortunately, notes go missing, files get mis-filed and all the other hazards that can befall research can happen. Often, one wishes for R objects with built-in metadata for that purpose.
Using mostattributes() to do this (now preferred)
Update 2015-08-01: Scott Chamerlain at ropensci.org brought **attr* to my attention, which is the built-in way I was looking for originally. He also pointed me to EML, a much more elaborate approach suited for publication projects.
A minimal example
- Create data frame and a separate metadata list
fips <- read.csv("https://tuva.s3-us-west-2.amazonaws.com/state_fips_postal.csv", header = FALSE)
fips
## V1 V2 V3
## 1 Alabama 1 AL
## 2 Alaska 2 AK
## 3 Arizona 4 AZ
## 4 Arkansas 5 AR
## 5 California 6 CA
## 6 Colorado 8 CO
## 7 Connecticut 9 CT
## 8 Delaware 10 DE
## 9 District of Columbia 11 DC
## 10 Florida 12 FL
## 11 Georgia 13 GA
## 12 Hawaii 15 HI
## 13 Idaho 16 ID
## 14 Illinois 17 IL
## 15 Indiana 18 IN
## 16 Iowa 19 IA
## 17 Kansas 20 KS
## 18 Kentucky 21 KY
## 19 Louisiana 22 LA
## 20 Maine 23 ME
## 21 Maryland 24 MD
## 22 Massachusetts 25 MA
## 23 Michigan 26 MI
## 24 Minnesota 27 MN
## 25 Mississippi 28 MS
## 26 Missouri 29 MO
## 27 Montana 30 MT
## 28 Nebraska 31 NE
## 29 Nevada 32 NV
## 30 New Hampshire 33 NH
## 31 New Jersey 34 NJ
## 32 New Mexico 35 NM
## 33 New York 36 NY
## 34 North Carolina 37 NC
## 35 North Dakota 38 ND
## 36 Ohio 39 OH
## 37 Oklahoma 40 OK
## 38 Oregon 41 OR
## 39 Pennsylvania 42 PA
## 40 Rhode Island 44 RI
## 41 South Carolina 45 SC
## 42 South Dakota 46 SD
## 43 Tennessee 47 TN
## 44 Texas 48 TX
## 45 Utah 49 UT
## 46 Vermont 50 VT
## 47 Virginia 51 VA
## 48 Washington 53 WA
## 49 West Virginia 54 WV
## 50 Wisconsin 55 WI
## 51 Wyoming 56 WY
names(fips) = c("state", "fip", 'id')
require(rjson) # easier to use JSON to write metadata
## Loading required package: rjson
meta <- fromJSON(file = "data/meta.json")
meta
## $Accessed
## [1] "2015-08-01"
##
## $Analyst
## [1] "Richard Careaga"
##
## $Contact
## [1] "technocrat@twitter"
##
## $Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
##
## $Source
## [1] "http://www.theguardian.com/thecounted"
Associate the metadata with the data frame using attr
# invisibly with attr
x <- fips
attr(x, "meta") <- meta
By default, metadata is not displayed
x
## state fip id
## 1 Alabama 1 AL
## 2 Alaska 2 AK
## 3 Arizona 4 AZ
## 4 Arkansas 5 AR
## 5 California 6 CA
## 6 Colorado 8 CO
## 7 Connecticut 9 CT
## 8 Delaware 10 DE
## 9 District of Columbia 11 DC
## 10 Florida 12 FL
## 11 Georgia 13 GA
## 12 Hawaii 15 HI
## 13 Idaho 16 ID
## 14 Illinois 17 IL
## 15 Indiana 18 IN
## 16 Iowa 19 IA
## 17 Kansas 20 KS
## 18 Kentucky 21 KY
## 19 Louisiana 22 LA
## 20 Maine 23 ME
## 21 Maryland 24 MD
## 22 Massachusetts 25 MA
## 23 Michigan 26 MI
## 24 Minnesota 27 MN
## 25 Mississippi 28 MS
## 26 Missouri 29 MO
## 27 Montana 30 MT
## 28 Nebraska 31 NE
## 29 Nevada 32 NV
## 30 New Hampshire 33 NH
## 31 New Jersey 34 NJ
## 32 New Mexico 35 NM
## 33 New York 36 NY
## 34 North Carolina 37 NC
## 35 North Dakota 38 ND
## 36 Ohio 39 OH
## 37 Oklahoma 40 OK
## 38 Oregon 41 OR
## 39 Pennsylvania 42 PA
## 40 Rhode Island 44 RI
## 41 South Carolina 45 SC
## 42 South Dakota 46 SD
## 43 Tennessee 47 TN
## 44 Texas 48 TX
## 45 Utah 49 UT
## 46 Vermont 50 VT
## 47 Virginia 51 VA
## 48 Washington 53 WA
## 49 West Virginia 54 WV
## 50 Wisconsin 55 WI
## 51 Wyoming 56 WY
Metadata has to be invoked by name
attr(x, "meta")
## $Accessed
## [1] "2015-08-01"
##
## $Analyst
## [1] "Richard Careaga"
##
## $Contact
## [1] "technocrat@twitter"
##
## $Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
##
## $Source
## [1] "http://www.theguardian.com/thecounted"
Associate the metadata with the data frame using mostattributes
x <- fips
mostattributes(x) <- list(meta = meta)
The metadata now displays with the data frame
x
## [[1]]
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
##
## [[2]]
## [1] 1 2 4 5 6 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [26] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45 46 47 48 49 50 51 53 54 55
## [51] 56
##
## [[3]]
## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [16] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
## [31] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VT" "VA" "WA" "WV" "WI" "WY"
##
## attr(,"meta")
## attr(,"meta")$Accessed
## [1] "2015-08-01"
##
## attr(,"meta")$Analyst
## [1] "Richard Careaga"
##
## attr(,"meta")$Contact
## [1] "technocrat@twitter"
##
## attr(,"meta")$Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
##
## attr(,"meta")$Source
## [1] "http://www.theguardian.com/thecounted"
My now deprecated approach
Metadata are surprisingly easy to create through S4 classes, which was my original thought.
The Rise of the Objects
After encountering S41 classes in the sp2 and related packages, I hoped to find an R package that provided a similar way of attaching metadata to data frames as the geospatial tools used to attach features to a coordinates data frame. The search was unsuccessful, so the next step was to look into S4 to write something on my own. Several sources helped: Gatto, Genolini and the omnipresent Wickham.
A Simple Example Use
In using U.S. state map data, polygons are sometimes keyed to state names, sometimes to postal codes and sometimes to a census FIPS code. A data frame with all three and the merge function allows easily adding a different key to state identity to data to match the merge field to be used with the related geospatial file.
fips <- read.csv("https://tuva.s3-us-west-2.amazonaws.com/state_fips_postal.csv", header = FALSE)
fips
## V1 V2 V3
## 1 Alabama 1 AL
## 2 Alaska 2 AK
## 3 Arizona 4 AZ
## 4 Arkansas 5 AR
## 5 California 6 CA
## 6 Colorado 8 CO
## 7 Connecticut 9 CT
## 8 Delaware 10 DE
## 9 District of Columbia 11 DC
## 10 Florida 12 FL
## 11 Georgia 13 GA
## 12 Hawaii 15 HI
## 13 Idaho 16 ID
## 14 Illinois 17 IL
## 15 Indiana 18 IN
## 16 Iowa 19 IA
## 17 Kansas 20 KS
## 18 Kentucky 21 KY
## 19 Louisiana 22 LA
## 20 Maine 23 ME
## 21 Maryland 24 MD
## 22 Massachusetts 25 MA
## 23 Michigan 26 MI
## 24 Minnesota 27 MN
## 25 Mississippi 28 MS
## 26 Missouri 29 MO
## 27 Montana 30 MT
## 28 Nebraska 31 NE
## 29 Nevada 32 NV
## 30 New Hampshire 33 NH
## 31 New Jersey 34 NJ
## 32 New Mexico 35 NM
## 33 New York 36 NY
## 34 North Carolina 37 NC
## 35 North Dakota 38 ND
## 36 Ohio 39 OH
## 37 Oklahoma 40 OK
## 38 Oregon 41 OR
## 39 Pennsylvania 42 PA
## 40 Rhode Island 44 RI
## 41 South Carolina 45 SC
## 42 South Dakota 46 SD
## 43 Tennessee 47 TN
## 44 Texas 48 TX
## 45 Utah 49 UT
## 46 Vermont 50 VT
## 47 Virginia 51 VA
## 48 Washington 53 WA
## 49 West Virginia 54 WV
## 50 Wisconsin 55 WI
## 51 Wyoming 56 WY
names(fips) = c("state", "fip", "id")
Documenting something so simple seems like overkill until the inevitable question arises is that even right? I used a simple JSON file to make3 a record
require(rjson)
meta <- fromJSON(file = "data/meta.json")
meta
## $Accessed
## [1] "2015-08-01"
##
## $Analyst
## [1] "Richard Careaga"
##
## $Contact
## [1] "technocrat@twitter"
##
## $Preprocessing
## [1] "2015 U.S. police caused deaths through July, reported by The Guardian"
##
## $Source
## [1] "http://www.theguardian.com/thecounted"
The functions rquired for creating an S4 class are in {base}, so there are no more dependencies.
# create the class
Mframe <- setClass("Mframe",slots = c(meta = "list", data = "data.frame"))
# instantiate it
mf <- Mframe(data = fips, meta = meta)
# show it
str(mf)
## Formal class 'Mframe' [package ".GlobalEnv"] with 2 slots
## ..@ meta:List of 5
## .. ..$ Accessed : chr "2015-08-01"
## .. ..$ Analyst : chr "Richard Careaga"
## .. ..$ Contact : chr "technocrat@twitter"
## .. ..$ Preprocessing: chr "2015 U.S. police caused deaths through July, reported by The Guardian"
## .. ..$ Source : chr "http://www.theguardian.com/thecounted"
## ..@ data:'data.frame': 51 obs. of 3 variables:
## .. ..$ state: chr [1:51] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## .. ..$ fip : int [1:51] 1 2 4 5 6 8 9 10 11 12 ...
## .. ..$ id : chr [1:51] "AL" "AK" "AZ" "AR" ...
Elaborating this approach to an entire dataset of related objects would be more challenging but this simple application imposes minimal extra work beyond slightly different functions to access the data payload.
mf@data
## state fip id
## 1 Alabama 1 AL
## 2 Alaska 2 AK
## 3 Arizona 4 AZ
## 4 Arkansas 5 AR
## 5 California 6 CA
## 6 Colorado 8 CO
## 7 Connecticut 9 CT
## 8 Delaware 10 DE
## 9 District of Columbia 11 DC
## 10 Florida 12 FL
## 11 Georgia 13 GA
## 12 Hawaii 15 HI
## 13 Idaho 16 ID
## 14 Illinois 17 IL
## 15 Indiana 18 IN
## 16 Iowa 19 IA
## 17 Kansas 20 KS
## 18 Kentucky 21 KY
## 19 Louisiana 22 LA
## 20 Maine 23 ME
## 21 Maryland 24 MD
## 22 Massachusetts 25 MA
## 23 Michigan 26 MI
## 24 Minnesota 27 MN
## 25 Mississippi 28 MS
## 26 Missouri 29 MO
## 27 Montana 30 MT
## 28 Nebraska 31 NE
## 29 Nevada 32 NV
## 30 New Hampshire 33 NH
## 31 New Jersey 34 NJ
## 32 New Mexico 35 NM
## 33 New York 36 NY
## 34 North Carolina 37 NC
## 35 North Dakota 38 ND
## 36 Ohio 39 OH
## 37 Oklahoma 40 OK
## 38 Oregon 41 OR
## 39 Pennsylvania 42 PA
## 40 Rhode Island 44 RI
## 41 South Carolina 45 SC
## 42 South Dakota 46 SD
## 43 Tennessee 47 TN
## 44 Texas 48 TX
## 45 Utah 49 UT
## 46 Vermont 50 VT
## 47 Virginia 51 VA
## 48 Washington 53 WA
## 49 West Virginia 54 WV
## 50 Wisconsin 55 WI
## 51 Wyoming 56 WY
#Note @ delimiter; mf$state would also work, but hasn't been set up in the class definition
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 20.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rjson_0.2.20
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.0 magrittr_1.5 bookdown_0.20 tools_4.0.0
## [5] htmltools_0.5.0 yaml_2.2.1 stringi_1.4.6 rmarkdown_2.3
## [9] blogdown_0.20 knitr_1.29 stringr_1.4.0 digest_0.6.25
## [13] xfun_0.16 rlang_0.4.7 evaluate_0.14
See S3-style Objects and S4-class Objects in the R documentation with help(S4)↩︎
Pebesma, E.J., R.S. Bivand, 2005. Classes and methods for spatial data in R. R News 5 (2), http://cran.r-project.org/doc/Rnews/. Roger S. Bivand, Edzer Pebesma, Virgilio Gomez-Rubio, 2013. Applied spatial data analysis with R, Second edition. Springer, NY. asdar↩︎
Alex Couture-Beil (2014). rjson: JSON for R. R package version 0.2.15. rjson↩︎