Introduction

Metadata

bllflow supports metadata from DDI and from the R packages hmisc and rjlabelled.

DDI metadata

The Data Documentation Intitiatve (DDI) is is an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences. There is metadata for variables labels, but also important data provenance information.

Get the description of the study data from DDI and add the description to bllflow object

Example 1: Read the DDI xml file for your data.

pbcDDI <- ReadDDI(file.path(getwd(), "../inst/extdata"), "pbcDDI.xml")

Your metadata is now stored in the pbcDDI object in two formats:

  1. Variable and format labels can be accessed in pbcDDI$variableMetaData. All DDI metadata is brought into the pbcDDI object. For example, variable labels can be accessed as follows:
str(pbcDDI$variableMetaData$varlab) # variable labels
##  NULL
str(pbcDDI$variableMetaData$vallab) # value labels for categorial variables
##  NULL

variableMetaData is generated using the DDIwR package.

  1. All metadata from the DDI document pbcDDI$ddiObject. Careful Get the varialbe metadata for variableDetails sheet from DDI and add that the bllFlow MSW-DDI object. Careful printing: there can be a lot of information! And there can be a lot of nested lists.

Get the name of the data.

cat("Dataset name: \n")
## Dataset name:
## [[1]]
## [1] "\nMayo Clinic Primary Biliary Cirrhosis Data\n"

Example 2: Use DDI to add variable labels and other information for your model variables.

The MSW variables and variableDetail.csv file contains the variables in your model. Use the BLLFlow() function with the MSW files as attributes to add labels to the BLLFlow object, (variables and variableDetails and the DDI object (pbcDDI).

# read the MSW files
variables <- read.csv(file.path(getwd(), '../inst/extdata/PBC-variables.csv'))
variableDetails <- read.csv(file.path(getwd(), '../inst/extdata/PBC-variableDetails.csv'))

# create a BLLFlow object and add labels.
pbcModel <- BLLFlow(pbc, variables, variableDetails, pbcDDI)
## Warning: Row 2 : valueLabelStart column has value " male " but DDI value is
## " Male ". Not overwriting
## Warning: Row 2 : from column has value " 1 " but DDI value is " 1:1 ". Not
## overwriting
## Warning: Row 1 : valueLabelStart column has value " female " but DDI value
## is " Female ". Not overwriting
## Warning: Row 1 : from column has value " 2 " but DDI value is " 2:2 ". Not
## overwriting
## Warning: Row 9 : valueLabelStart column has value " No edema " but DDI
## value is " edema despite diuretic therapy ". Not overwriting
## Warning: Row 9 : label column has value " Edema " but DDI value is " edema
## ". Not overwriting
## Warning: Row 9 : from column has value " 0 " but DDI value is " 1:1 ". Not
## overwriting

Metadata is added to variables and variableDetails. If labels and other metadata already were in the files, the DDI metadata is added to startLabel, startType, catStartValue, catStartLabel, startLow, startHigh — if that data is in the DDI file.

cat("Variable labels in the original MSW\n")
## Variable labels in the original MSW
## [1] Age (years)      Sex              Bilirubin        Albumin         
## [5] Prothrombin time Edema           
## Levels: Age (years) Albumin Bilirubin Edema Prothrombin time Sex
cat("\nNo variable labels from DDI in variable details\n")
## 
## No variable labels from DDI in variable details
##  [1] Sex         Sex         Sex         Age (years) Age (years)
##  [6] Age (years) Age (years) Age (years) Edema       Edema      
## [11] Edema       Edema       Age group   Age group   Age group  
## [16] Age group   Age group  
## Levels: Age (years) Age group Edema Sex
cat("\nDDI variable lables added to variableDetailsWithDDI\n")
## 
## DDI variable lables added to variableDetailsWithDDI
##  [1] Sex         Sex         Sex         Age (years) Age group  
##  [6] Age (years) Age group   Age (years) Age group   Age (years)
## [11] Age group   Age (years) Age group   in years    Edema      
## [16] Edema       Edema       Edema       edema       edema      
## Levels: Age (years) Age group Edema Sex in years edema

Example 3: Write MSW_DDI to MSW.csv

Use WriteDDIPopulatedMSW() to create a MSW variable details CSV file. (change to new name variableDetailsWithDDI()) using variableDetailsWithDDI(). Then export using WriteDDIPopulatedMSW() as a CSV file to help further develop your study protocol.

WriteDDIPopulatedMSW() is a handy function at the beginning of your study. First, identify thes ariable you need in your study. Then create a ‘bane bones’ MSW variables CSV sheet, import into R and BLLFlow() to add additional information from the DDI reference file such as labels, variable types, categories.

Alternatively, 1) directly add metadata to variables in a CSV file (see general utility functions in Example X); or, 2) export a CSV with metadata from an R variable list and a DDI file.

WriteDDIPopulatedMSW(pbcModel, "../inst/extdata/", "newMSWvariableDetails.csv")

Creates a directory and file (if they don’t already exists). newMSWvariableDetails.csv will be overwritten if the file already exists. newMSWvariableDetails.csv is created from variableDetailsWithDDI. First create variableDetailsWithDDI with either BLLFLow() or GetDDIVariables().

Create a new file (newName.csv) directly from an existing variableDetails.csv.

Example 4: Update BBLFLow with new variables

Do we need or want this? Rusty, can you write in a vignette? Replaces the models msw with all or either variables or variableDetails. Passing variableDetails also updates the populatedVariableDetails

pbcModel <- UpdateMSW(pbcModel, variables, variableDetails)
## Warning: Row 2 : valueLabelStart column has value " male " but DDI value is
## " Male ". Not overwriting
## Warning: Row 2 : from column has value " 1 " but DDI value is " 1:1 ". Not
## overwriting
## Warning: Row 1 : valueLabelStart column has value " female " but DDI value
## is " Female ". Not overwriting
## Warning: Row 1 : from column has value " 2 " but DDI value is " 2:2 ". Not
## overwriting
## Warning: Row 9 : valueLabelStart column has value " No edema " but DDI
## value is " edema despite diuretic therapy ". Not overwriting
## Warning: Row 9 : label column has value " Edema " but DDI value is " edema
## ". Not overwriting
## Warning: Row 9 : from column has value " 0 " but DDI value is " 1:1 ". Not
## overwriting

Example 5. More general DDI utily functions

Described previously,ReadDDI() returns a dataframe with all DDI metadata from a DDI.xml file.

There two additional DDI utility functions that return the two main parts of the DDI file.

GetDDIDescription() returns a dataframe with DDI ‘header’ information from a DDI file. The header information includes the dataset ID, name, creatation date and other important information to preserve the provenance of your study data.

GetDDIVariables() returns a dataframe with variable DDI metadata for a list of variables. DDI variable metadata is typical codebook information such as labels, types, valid and invalid categories, category labels, and descriptive statistics, such as number valid responses, missing responses, minimum, maximum and mean values.

The bllflow workflow creates an ongoing, updated codebook as you perform your study. The oringial data DDI metadata is the starting point for keeping the updated codebook. The bllflow then creates and modifies an new instance of the codebook as you perform analyses. In addition, a log is created that describes the data cleaning and data transformation steps.

getDDIDescription(DDIPath = ddiPath, DDIFile = ddiFile) # same as calling BLLFlow(pbc, getDDI = c(ddiPath, ddiFile, “description”), but would not add the info to the bllFlow object.