3 min read

Metadata for datasets

Where does this dataset come from?

Is this the original revised data or the revised revised data?

Keeping track of the provenance of data can be a challenge, especially when drawing on published sources. Keeping a record of the origin, the date accessed, the transformations applied (e.g., converting from .xls to cvs and converting character strings such as “$1,250,321.21” to floats or date strings to date objects), subsequent changes, who handled the data object and where it can be found in a repository are all things that enhance the analyst’s own ability to reproduce results.

Unfortunately, notes go missing, files get mis-filed and all the other hazards that can befall research can happen. Often, one wishes for R objects with built-in metadata for that purpose.

Using mostattributes() to do attach metadata

Scott Chamerlain at ropensci.org brought attr to my attention, which is the built-in way I was looking for originally. He also pointed me to EML, a much more elaborate approach suited for publication projects.

A minimal example

Create data frame and a separate metadata list

## Loading required package: jsonlite
fips <- read.csv("https://tuva.s3-us-west-2.amazonaws.com/state_fips_postal.csv", header = FALSE)
colnames(fips) = c("state", "fip", 'id')
require(jsonlite) # easier to use JSON to write metadata
meta <- fromJSON("https://tuva.s3-us-west-2.amazonaws.com/2015-07-31-meta.json")

The json source file looks like this

[
    {
    "Accessed": "2015-07-31",
    "GitBlame": "Richard Careaga",
    "Contact": "technocrat@twitter",
    "Preprocessing": "FIPS Codes for the States and District of Columbia table captured manually and converted to cvs file",
    "Source": "https://www.census.gov/geo/reference/ansi_statetables.html",
    "Repository": "unassigned",
    "Version": "1.0"
    }
]

Associate the metadata with the data frame using mostattributes

x <- fips
mostattributes(x) <- list(meta = meta)

Now metadata is displayed by default

x
## [[1]]
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"             
## 
## [[2]]
##  [1]  1  2  4  5  6  8  9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [26] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45 46 47 48 49 50 51 53 54 55
## [51] 56
## 
## [[3]]
##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA" "HI" "ID" "IL" "IN"
## [16] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH"
## [31] "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VT" "VA" "WA" "WV" "WI" "WY"
## 
## attr(,"meta")
##     Accessed        GitBlame            Contact
## 1 2015-07-31 Richard Careaga technocrat@twitter
##                                                                                          Preprocessing
## 1 FIPS Codes for the States and District of Columbia table captured manually and converted to cvs file
##                                                       Source Repository Version
## 1 https://www.census.gov/geo/reference/ansi_statetables.html unassigned     1.0