The compiler will always tell you about source code errors that prevent compiling. It can’t advise you if your code solves the problem that it was supposed to solve, even if you are confident in what that problem is. But have a thought for the user who posed the problem, and keep in mind two famous quotations.
Thus, programs must be executed for humans to read, only incidentially for machines to execute. – Harold Abramson, Gerald Jay Sussman, Structure and Interpretation of Computer Programs - 2nd Edition, The MIT Press, 1966
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it? – Brian Kernighan and P.J. Plauger, The Elements of Programming Style - 2nd Edition, McGraw-Hill, 1978
I’ve been helping a medical researcher who was given the following perfectly syntactical R
code:
#1. Read in the necessary data.
setwd("~")
ReceivedOp <- list()
for (i in 1:14) { ## csv files - T/F for every group - based on whether or not they received an operation
ReceivedOp[[i]] <- read.csv(paste("./receivedOperationByYear/receivedOperation", (2002:2015)[i], ".csv", sep = ""))
}
rm(i)
filePathsDCODEs <- paste("./RDS AY ", 2002:2015, "/RDS_DCODE.csv", sep = "") ## The names of each file path
D_CODES <- list()
for (i in 1:14) { ## Diagnostic codes (IP or EP)
D_CODES[[i]] <- read.csv(filePathsDCODEs[i])
}
rm(i)
filePathsDEMOs <- paste("./RDS AY ", 2002:2015, "/RDS_DEMO.csv", sep = "") ## The names of each file path
DEMO <- list()
for (i in 1:14) {
DEMO[[i]] <- read.csv(filePathsDEMOs[i])
} ## Demographic data (i.e. ages)
rm(i)
DISCHARGE <- list()
for (i in 1:9) { ## Mortality information by INC_KEY
DISCHARGE[[i]] <- read.csv(paste("./RDS AY ", (2007:2015)[i], "/DISCHARGE", (2007:2015)[i], ".csv", sep = ""))
}
rm(i)
for (i in 6:14) { ## Change integer division to regular division
DEMO[[i]]$AGE[DEMO[[i]]$AGEU == "Days"] <- DEMO[[i]]$AGE[DEMO[[i]]$AGEU == "Days"] / 365
DEMO[[i]]$AGE[DEMO[[i]]$AGEU == "Months"] <- DEMO[[i]]$AGE[DEMO[[i]]$AGEU == "Months"] / 12
DEMO[[i]]$AGEU[DEMO[[i]]$AGEU == "Days" | DEMO[[i]]$AGEU == "Months"] <- "Years"
}
# 2. Collect data of interest
RuptureTypeList <- list()
InteractionTerms <- list()
DemoList <- list()
AgexRuptureTypexOperationTypeInteractions <- list()
AgexOperationTypeInteractions <- list()
AgexRuptureTypexMortalityInteractions <- list()
OrganizedDischarge <- list()
AgexRupturexOpxMort <- list()
pelvicFractures <- list()
for (i in 1:14) { ## Pre-allocate memory
RuptureType <- rep(NA, nrow(ReceivedOp[[i]]))
RuptureTypeList[[i]] <- RuptureType
InteractionTerms[[i]] <- RuptureType
DemoList[[i]] <- RuptureType
AgexRuptureTypexOperationTypeInteractions[[i]] <- RuptureType
AgexOperationTypeInteractions[[i]] <- RuptureType
AgexRuptureTypexMortalityInteractions[[i]] <- RuptureType
OrganizedDischarge[[i]] <- RuptureType
AgexRupturexOpxMort[[i]] <- RuptureType
pelvicFractures[[i]] <- matrix(rep(RuptureType, 16), ncol = 16)
}
rm(RuptureType, i)
for (i in 1:14) { ## The code here is going to determine whether there is an EP or IP rupture, by converting to numeric before checking, difference between 867 and 867.0 is.
for (j in 1:nrow(ReceivedOp[[i]])) { ## Going through by INC_KEY in the ReceivedOp
if (867.1 %in% suppressWarnings(as.numeric(as.character(D_CODES[[i]]$DCODE[D_CODES[[i]]$INC_KEY == ReceivedOp[[i]]$INC_KEY[j]])))) {
RuptureTypeList[[i]][j] <- 867.1
}
else if (867 %in% suppressWarnings(as.numeric(as.character(D_CODES[[i]]$DCODE[D_CODES[[i]]$INC_KEY == ReceivedOp[[i]]$INC_KEY[j]])))) {
RuptureTypeList[[i]][j] <- 867
}
}
}
rm(i, j)
for (i in 1:14) { ## Whether they had a bladder operation and whether their rupture was EP or IP.
InteractionTerms[[i]] <- interaction(ReceivedOp[[i]]$BLADDEROP, RuptureTypeList[[i]], drop = T) ## The interaction function is very useful.
}
rm(i)
for (i in 1:14) { ## Transfer each patient's age
for (j in 1:nrow(ReceivedOp[[i]])) {
if (!is.na(DEMO[[i]]$AGE[DEMO[[i]]$INC_KEY == ReceivedOp[[i]]$INC_KEY[j]][1] > 0) & DEMO[[i]]$AGE[DEMO[[i]]$INC_KEY == ReceivedOp[[i]]$INC_KEY[j]][1] > (1/12)) {
## Condition as such after examining the data outside the admissable range and finding all values were either 0 or negative (the latter representing missing data)
DemoList[[i]][j] <- DEMO[[i]]$AGE[DEMO[[i]]$INC_KEY == ReceivedOp[[i]]$INC_KEY[j]][1]
}
}
}
for (i in 1:14) { ## Convert ages into categories
DemoList[[i]] <- ifelse(17 >= DemoList[[i]], "Child", "Adult")
}
for (i in 1:14) { ## Contains information about age, EP vs IP, and whether they received an operation.
AgexRuptureTypexOperationTypeInteractions[[i]] <- interaction(ReceivedOp[[i]]$BLADDEROP, RuptureTypeList[[i]], DemoList[[i]], drop = T)
}
for (i in 1:14) { ## Whether they received an operation and their age
AgexOperationTypeInteractions[[i]] <- interaction(ReceivedOp[[i]]$BLADDEROP, DemoList[[i]], drop = T)
}
xFromNA <- function(item, x) {
if (is.na(item)) {
item <- x
}
item
}
for (i in 1:9) { ## Orders INC_KEYs for those who mortality data according to whether they received an operation or not, the outside loop goes by year.
for (j in 1:nrow(ReceivedOp[[i]])) { ## The inside loop goes by INC_KEY from the ReceivedOp list
if (sum(DISCHARGE[[i]]$DECEASED[DISCHARGE[[i]]$INC_KEY == ReceivedOp[[i+5]]$INC_KEY[j]], na.rm = T) > 0) {
OrganizedDischarge[[i+5]][j] <- TRUE
}
else if ((xFromNA(sum(DISCHARGE[[i]]$DECEASED[DISCHARGE[[i]]$INC_KEY == ReceivedOp[[i+5]]$INC_KEY[j]]), -1) == 0)) {
OrganizedDischarge[[i+5]][j] <- FALSE
}
}
}
for (i in 1:9) { ## What age group, whether they had an EP or IP rupture, and whether they died.
AgexRuptureTypexMortalityInteractions[[i]] <- interaction(DemoList[[i+5]], RuptureTypeList[[i+5]], OrganizedDischarge[[i+5]])
}
for (i in 1:9) {
AgexRupturexOpxMort[[i]] <- interaction(OrganizedDischarge[[i+5]], AgexRuptureTypexOperationTypeInteractions[[i+5]])
}
pelvicDCODES <- c(808.0, 808.1, 808.2, 808.3, 808.4, 808.41, 808.42, 808.43, 808.49, 808.5, 808.51, 808.52, 808.59, 808.8, 808.9)
for (i in 1:14) { ## Here we are getting whether it is an EP or IP rupture, by converting to numeric before checking, we remove the difference between 867 and 867.0.
for (j in 1:nrow(ReceivedOp[[i]])) { ## Going through by INC_KEY in the ReceivedOp
for (k in 1:16) {
if (pelvicDCODES[k] %in% suppressWarnings(as.numeric(as.character(D_CODES[[i]]$DCODE[D_CODES[[i]]$INC_KEY == ReceivedOp[[i]]$INC_KEY[j]])))) { ## Try to see what happens if you mess around with this condition in a separate file so you understand why it is the way it is.
pelvicFractures[[i]][j, k] <- TRUE
}
else {
pelvicFractures[[i]][j, k] <- FALSE
}
}
}
}
pelvisInteractions <- rep(list(rep(list(), 14)), 15)
pelvicDCODES <- as.character(pelvicDCODES)
for (j in 1:14) { ## Pre-allocate memory
RuptureType <- rep(NA, nrow(ReceivedOp[[j]]))
for (i in 1:15) {
pelvisInteractions[[i]][[j]] <- RuptureType
}
}
names(pelvisInteractions) <- pelvicDCODES
for (i in 1:15) {
for (j in 1:9) {
pelvisInteractions[[i]][[j+5]] <- interaction(AgexRuptureTypexOperationTypeInteractions[[j+5]], as.logical(pelvicFractures[[j+5]][ , i]))
}
}
pelvisInteractions <- lapply(pelvisInteractions, function(x) 'names<-'(x, 2002:2015))
pelvisInteractions <- rapply(pelvisInteractions, table, how = "replace")
# 3. Collect counts
EPvIP <- t(data.frame(lapply(RuptureTypeList, table)))[-(1+(1:13)*2), ] ## Calculate frequencies by year, same for all the rest
colnames(EPvIP) <- EPvIP[1, ]
EPvIP <- EPvIP[-1, ]
rownames(EPvIP) <- 2002:2015 ## Put into nice format for writing to file, same for all the rest
ReceivedOperationAndRuptureType <- t(data.frame(lapply(InteractionTerms, table)))[-(1+(1:13)*2), ]
colnames(ReceivedOperationAndRuptureType) <- ReceivedOperationAndRuptureType[1, ]
ReceivedOperationAndRuptureType <- ReceivedOperationAndRuptureType[-1, ]
rownames(ReceivedOperationAndRuptureType) <- 2002:2015
ReceivedOperationAndRuptureTypeAndAge <- t(data.frame(lapply(AgexRuptureTypexOperationTypeInteractions, table)))[-(1+(1:13)*2), ]
colnames(ReceivedOperationAndRuptureTypeAndAge) <- ReceivedOperationAndRuptureTypeAndAge[1, ]
ReceivedOperationAndRuptureTypeAndAge <- ReceivedOperationAndRuptureTypeAndAge[-1, ]
rownames(ReceivedOperationAndRuptureTypeAndAge) <- 2002:2015
ReceivedOperationAndAge <- t(data.frame(lapply(AgexOperationTypeInteractions, table)))[-(1+1:13*2), ]
colnames(ReceivedOperationAndAge) <- ReceivedOperationAndAge[1, ]
ReceivedOperationAndAge <- ReceivedOperationAndAge[-1, ]
rownames(ReceivedOperationAndAge) <- 2002:2015
AgeAndRuptureTypeAndMortality <- t(data.frame(lapply(AgexRuptureTypexMortalityInteractions[1:9], table)))[-(1+1:9*2), ]
colnames(AgeAndRuptureTypeAndMortality) <- AgeAndRuptureTypeAndMortality[1, ]
AgeAndRuptureTypeAndMortality <- AgeAndRuptureTypeAndMortality[-1, ]
rownames(AgeAndRuptureTypeAndMortality) <- 2007:2015
Mortality <- t(data.frame(lapply(OrganizedDischarge[6:14], table)))[-(1+1:9*2), ]
colnames(Mortality) <- Mortality[1, ]
Mortality <- Mortality[-1, ]
rownames(Mortality) <- 2007:2015
ReceivedAgexRupturexOpxMort <- t(data.frame(lapply(AgexRupturexOpxMort[1:9], table)))[-(1+1:9*2), ]
colnames(ReceivedAgexRupturexOpxMort) <- ReceivedAgexRupturexOpxMort[1, ]
ReceivedAgexRupturexOpxMort <- ReceivedAgexRupturexOpxMort[-1, ]
rownames(ReceivedAgexRupturexOpxMort) <- 2007:2015
pelvisInteractions <- lapply(pelvisInteractions, function(x) x[-(1:5)])
for (i in 1:length(pelvisInteractions)) {
for (j in 1:length(pelvisInteractions[[i]])) {
pelvisResults <- t(data.frame(pelvisInteractions[[i]]))[-(1+(1:13)*2), ]
colnames(pelvisResults) <- pelvisResults[1, ]
pelvisResults <- pelvisResults[-1, ]
rownames(pelvisResults) <- 2007:2015
}
write.csv(pelvisResults, paste0("pelvisdcode", names(pelvisInteractions)[i], "xEverything.csv"))
}
# 4. Write to files
write.csv(EPvIP, "resultsEPvIP.csv")
write.csv(ReceivedOperationAndRuptureType, "resultsOperationAndRupture.csv")
write.csv(ReceivedOperationAndRuptureTypeAndAge, "resultsOperationAndRuptureAndAge.csv")
write.csv(ReceivedOperationAndAge, "resultsOperationAndAge.csv")
write.csv(Mortality, "resultsMortality.csv")
write.csv(AgeAndRuptureTypeAndMortality, "resultsAgeRuptureTypeMortality.csv")
write.csv(ReceivedAgexRupturexOpxMort, "resultsOperationAndRuptureAndAgeAndMortality.csv")
His request? Help me understand what this code does, because I don’t know how it got from the source data to the output data.
He had two obstacles to overcome that made mystification inevitable. He had no coding experience in the imperative/procedural style of C, C++, Java, Python
and most other languages, and he was new to R
. Now, he was reading code in R
, a functional language, written primarily in an imperative style over 233 lines.
I thought that a more idiomatic version would clarify the data transformation. We talked of which csv files were of interest, what variables were important, what the output should look like and what the plans were for analyzing the data.
As a first cut, I returned 72 lines of more idiomatic R
in the tidyverse
style to look at a single year of the data.
#import libraries
library(tidyverse)
setwd("/Users/rc/projects/working directory/RDS AY 2015")
#set print display options
options(pillar.sigfig = 10) #control display of tibble
#Get patient demographics
patients <- as.tibble(read.csv("RDS_DEMO.csv", stringsAsFactors = FALSE))
patients <- patients %>% select(INC_KEY, AGE, GENDER)
patients <- patients %>% select(INC_KEY, AGE, GENDER) %>% mutate(ADULT = ifelse(AGE > 17, "TRUE", "FALSE"))
#eliminate duplicates
patients <- distinct(patients)
#Get discharge status
discharges <- as.tibble(read.csv("RDS_DISCHARGE.csv", stringsAsFactors = FALSE))
discharges <- discharges %>% select(INC_KEY,HOSPDISP) %>% filter(HOSPDISP != "Not Applicable BIU 1" & HOSPDISP != "Not Known/Not Recorded BIU 2") %>% mutate(Expired = ifelse(HOSPDISP == "Deceased/Expired", "TRUE", "FALSE")) %>% select(-HOSPDISP)
#Get intervention codes
p_codes <- as.tibble(read.csv("RDS_PCODE.csv", stringsAsFactors = FALSE))
p_codes <- p_codes %>% select(INC_KEY, PCODE)
pelvic <- p_codes %>% filter(PCODE >= 57.6 & PCODE <= 57.89 | PCODE == 57.93) %>% mutate(PCODE = as.logical(PCODE))
colnames(pelvic) <- c("INC_KEY","PELVIC") # 57.93 56.6:57.89 & 57.93
#Diagnostic codes
d_codes <- as.tibble(read.csv("RDS_DCODE.csv", stringsAsFactors = FALSE))
d_codes <- d_codes %>% select(INC_KEY, DCODE)
Rupture <- d_codes %>% filter(DCODE > 866.99 & DCODE < 867.2)
#Join patients, discharge, then patients, treatment, then T/F pelvic then DCODES, add year and reorder
patients <- inner_join(patients, discharges, by = "INC_KEY")
#Treatment codes
#ID patients with no pelvic procedure
pelvic_false <- setdiff(patients$INC_KEY, pelvic$INC_KEY)
#Add field for PELVIC t/f
patients <- patients %>% mutate(PELVIC = ifelse(INC_KEY %in% pelvic_false, "FALSE", "TRUE")) %>% mutate(PELVIC = as.logical(PELVIC))
#Add fields for 867 series
patients <- inner_join(patients, Rupture, by = "INC_KEY")
#Add year
patients <- patients %>% mutate(YEAR = 2015)
#Rename columns
colnames(patients) <- c("INC_KEY", "Age", "Sex", "Adult", "Expired", "Intervention", "Rupture", "Year")
#Change Expired to logical
patients <- patients %>% mutate(Expired = as.logical(Expired))
#Censor anomolies
patients <- patients %>% filter(Age >= 0)
#SAVE
write.csv(patients, "../combined/patients_2015.csv")
That was something he was able to understand, and we have been working together to refine, possible because we now share a mutually intelligible language.
I should emphasize that I have no criticism of the programmer, the style of programming or the process that left the user in the dark. The lesson for everyone in the process of converting facts to data to data sets to programming objects is that mutual intelligibility is key.
We have run into some challenges that we are working to correct, inconsistent column naming conventions, data coding and similar routine issues. The major problem was the underlying data structure in which the same unit of observation, a patient, had multiple observations in multiple files, classic untidy
data. Pulling all of the source data into a single data structure profilerated duplicate patient records that required a method to collapse into a single row.
That proved challenging for a surprising reason1, about which more when the tentative solution has been more thoroughly tested.
The
tibble
, as opposed to its constituentvector
columns is difficult to iterate over as a whole when varying numbers of rows interfere with the expectations of the standard tools to apply functions to vectors.↩︎