Recode with Table

Recoding variables is a common task in research that is time consuming and error prone. RecWTable compliments sjmisc::rec() and other recoding methods with the following features:

  • Rules (syntax) for recoding are contained in a data.table (default <- variableDetails), thereby allowing rules for recoding multiple variables.
  • Each row in the data.table comprises one category in a newly transformed variable.
  • The syntax for recoding is the same as sjmisc::rec() – see below – and contained in variableDetails$from
  • Included in each row in the data.table is the variable label, value label and name of the new categorical variable. Defaults:
    • variable labels <- variableDetails$label
    • value (factor) labels <- variableDetails$valueLabel
    • variable units <- variableDetils$units
  • Recoding from a continuous or categorical variabe to new categorical variable.
  • Recoding from a categorical variable to either a new continuous or catecategorical variable (todo).
  • Recoding to a common target variable from different variables in different databases. See below ‘recoding to a common target variable’.
  • VariableDetails labels and other metadata can be imported from an accompanying metadata file (DDI supported).
  • A log of transformations can be printed to console.
  • attr for new variable set. recodeFrom to the original variable. (for discussion but I would like to include).
  • Additional features are available if variableDetails is a bllflow class. That is, variableDetailsis passed as an attribute when creating a bllflow object using bllflow()– see Speciying your model
    • ‘variables’ sheet is used to identify which variables are recoded. variableDetails provides the rules for how to recode.
    • the log of recoding is appended to the bllflow object. see….

Syntax for variable recoding

The syntax is the same syntax as sjmisc.

The recode-pattern, i.e. which new values should replace the old values, is defined using the rec variable in variableDetails data.frame. This argument has a specific “syntax”:

the pairs are obtained from the RecFrom and RecTo columns

  • recode pairs: Each recode pair is row. e.g. rec = "1=1", "2=4", "3=2", "4=3"

  • multiple values: Multiple old values that should be recoded into a new single value may be separated with comma, e.g. “1,2=1”, “3,4=2”`

  • value range: A value range is indicated by a colon, e.g. rec = "1:4=1", "5:8=2" (recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)

  • value range for doubles: For double vectors (with fractional part), all values within the specified range are recoded; e.g. rec = "1:2.5=1", "2.6:3=2" recodes 1 to 2.5 into 1 and 2.6 to 3 into 2, "but 2.55 would not be recoded (since it’s not included in any of the specified ranges).

Different from sjmisc::rec(), there is the ability to define intervals uising interval. The default interval, [,) which corresponds to the common math notation where a closed interval is denoted with a closed bracket [ or ] and an open interval is denoted with an open bracket ( or ). A closed interval is an interval which includes all it limit points. For example, [0,1] means greater than or equal to 0 and less than or equal to 1. For example,from “1:2.5=1”` recodes to the default interval, where any value greater than or equal to 1 and less than 2.5 to the new value 1.

  • “min” and “max”: Minimum and maximum values are indicates by min (or lo) and max (or hi), e.g. from = "min:4=1", "5:max=2" (recodes all values from minimum values of x to 4 into 1, and from 5 to maximum values of x into 2) (for discussion….You can also use min or max to recode a value into the minimum or maximum value of a variable, e.g. rec = "min:4=1" "5:7=max" (recodes all values from minimum values of x to 4 into 1, and from 5 to 7 into the maximum value of x).

  • “else”: All other values, which have not been specified yet, are indicated by else, e.g. rec = "3=1", "1=2", "else=3" (recodes 3 into 1, 1 into 2 and all other values into 3)

  • “copy”: The "else"-token can be combined with "copy", indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. rec = "3=1; 1=2; else=copy" (recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied.

  • NA’s: NA values are allowed both as old and new value, e.g. rec = "NA=1", "3:5=NA" (recodes all NA into 1, and all values from 3 to 5 into NA in the new variable)

[note from Doug: these descriptors for rev and direct value labels will be modified in our final documentation. Indicated here to identify how bllflow differs from sjmisc::rec().

rev is available in sjmisc, but not available in bllflow. * “rev”: "rev" is a special token that reverses the value order.

Direct value label is avaiable in sjmis. In bllflow, value labelling is performed using ‘valueLabel’ for the corresponding row in the variableDetails data.table. * direct value labelling: Value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. rec = "15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"

  • non-captured values: Non-matching values will be set to NA (default), unless captured by the "else"- or "copy"-token.


Notes: 1) If a startVariable is not present in the data, should we consider whether it is an intermediary variable, and then transform (recode) that variable? An intermediary variable is a variable that exists as a variable variableDetails (it is created by variables in data and then used in a transformation with other)

For initial version, give warnings and errors only. If missing intermediary variable: “Error: recoding {variable} requires {startVariable} variable. Recode available in {variableDetails}. Suggest first recoding {startVariable} variable, then try again.”

If missing startVariable (altogether): “Error: missing required starting variable(s): {startingVariable}”

  1. Check to make sure all possible values are recoded. What should we do if values cannot be recoded? See outOfRange = NA

  2. Return error if any required fields are missing, including: startVarible, type, etc.

  3. Log example: “{variable} created from: startVariables. Observations: {n} type: {continuous, factor, etc.} (if continuous:) min: {min}, max: {min}, NA: {n of missing} (if factor:) {n} factors, NA: {n of missing}”