vignettes/recode-with-table.Rmd
recode-with-table.Rmd
Recode with Table
Recoding variables is a common task in research that is time consuming and error prone. RecWTable compliments sjmisc::rec() and other recoding methods with the following features:
variableDetails
), thereby allowing rules for recoding multiple variables.attr
for new variable set. recodeFrom
to the original variable. (for discussion but I would like to include).variableDetails
is a bllflow class. That is, variableDetails
is passed as an attribute when creating a bllflow object using bllflow()
– see Speciying your model
variableDetails
provides the rules for how to recode.The syntax is the same syntax as sjmisc.
The recode-pattern, i.e. which new values should replace the old values, is defined using the rec
variable in variableDetails
data.frame. This argument has a specific “syntax”:
the pairs are obtained from the RecFrom and RecTo columns
recode pairs: Each recode pair is row. e.g. rec = "1=1", "2=4", "3=2", "4=3"
multiple values: Multiple old values that should be recoded into a new single value may be separated with comma, e.g. “1,2=1”, “3,4=2”`
value range: A value range is indicated by a colon, e.g. rec = "1:4=1", "5:8=2"
(recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)
value range for doubles: For double vectors (with fractional part), all values within the specified range are recoded; e.g. rec = "1:2.5=1", "2.6:3=2"
recodes 1 to 2.5 into 1 and 2.6 to 3 into 2, "but 2.55 would not be recoded (since it’s not included in any of the specified ranges).
Different from sjmisc::rec(), there is the ability to define intervals uising interval
. The default interval, [,)
which corresponds to the common math notation where a closed interval is denoted with a closed bracket [
or ]
and an open interval is denoted with an open bracket (
or )
. A closed interval is an interval which includes all it limit points. For example, [0,1] means greater than or equal to 0 and less than or equal to 1. For example,
from “1:2.5=1”` recodes to the default interval, where any value greater than or equal to 1 and less than 2.5 to the new value 1.
“min” and “max”: Minimum and maximum values are indicates by min
(or lo
) and max
(or hi
), e.g. from = "min:4=1", "5:max=2"
(recodes all values from minimum values of x to 4 into 1, and from 5 to maximum values of x into 2) (for discussion….You can also use min
or max
to recode a value into the minimum or maximum value of a variable, e.g. rec = "min:4=1" "5:7=max"
(recodes all values from minimum values of x to 4 into 1, and from 5 to 7 into the maximum value of x).
“else”: All other values, which have not been specified yet, are indicated by else, e.g. rec = "3=1", "1=2", "else=3"
(recodes 3 into 1, 1 into 2 and all other values into 3)
“copy”: The "else"
-token can be combined with "copy"
, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. rec = "3=1; 1=2; else=copy"
(recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied.
NA’s: NA
values are allowed both as old and new value, e.g. rec = "NA=1", "3:5=NA"
(recodes all NA
into 1, and all values from 3 to 5 into NA in the new variable)
[note from Doug: these descriptors for rev and direct value labels will be modified in our final documentation. Indicated here to identify how bllflow differs from sjmisc::rec().
rev is available in sjmisc, but not available in bllflow. * “rev”: "rev"
is a special token that reverses the value order.
Direct value label is avaiable in sjmis. In bllflow, value labelling is performed using ‘valueLabel’ for the corresponding row in the variableDetails
data.table. * direct value labelling: Value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. rec = "15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"
NA
(default), unless captured by the "else"
- or "copy"
-token.————-
Notes: 1) If a startVariable is not present in the data, should we consider whether it is an intermediary variable, and then transform (recode) that variable? An intermediary variable is a variable that exists as a variable variableDetails (it is created by variables in data
and then used in a transformation with other)
For initial version, give warnings and errors only. If missing intermediary variable: “Error: recoding {variable} requires {startVariable} variable. Recode available in {variableDetails}. Suggest first recoding {startVariable} variable, then try again.”
If missing startVariable (altogether): “Error: missing required starting variable(s): {startingVariable}”
Check to make sure all possible values are recoded. What should we do if values cannot be recoded? See outOfRange = NA
Return error if any required fields are missing, including: startVarible, type, etc.
Log example: “{variable} created from: startVariables. Observations: {n} type: {continuous, factor, etc.} (if continuous:) min: {min}, max: {min}, NA: {n of missing} (if factor:) {n} factors, NA: {n of missing}”