Imagine you want to learn about the causes of disability in Canada.
You could structure your questionnaire in one of two ways.
- Do you have an activity-limiting long-term condition resulting from injury?
- Do you have an activity-limiting long-term condition resulting from illness or disease?
- Do you have an activity-limiting long-term condition resulting from aging?
- Do you have an activity-limiting long-term condition that existed at birth or is genetic?
- Do you have an activity-limiting long-term condition resulting from an accident at work?
There is an obvious problem with this kind of questionnaire structure: it imposes a large respondent burden. You’re spending time asking people questions that are probably not relevant to their situation. They will get annoyed, and may slam down the phone.
An alternative is to use a “hierarchical” questionnaire structure, as is done in the Canadian Community Health Survey. First, ask a question to identify whether or not a person has an activity-limiting long-term condition. Next, ask only the people who have an activity-limiting long-term condition about the cause of that condition.
The hierarchical design makes life much easier for respondents. Yet it means that care needs to be taken when analyzing the survey results. For example, suppose the question of interest is, “What percentage of the population has long-term activity-limiting conditions that existed at birth or were genetic?” To answer that question, we need to combine the answers to the two questions: “Does a long-term physical condition or mental condition or health problem, reduce the amount or the kind of activity you can do?” And “What is the cause of that condition?”
People who have no long-term conditions at all clearly have no conditions that existed at birth or were genetic, so we can code them as zeros:
gen disabledfrombirth = .
replace disabledfrombirth = 0 if (RAC_1==3 & RAC_2A==3 & (RAC_2B1==3 | RAC_2B1==4) & (RAC_2B2==3 | RAC_2B2==4) & RAC_2C==3)
The long "if" condition is necessary because there are five questions asking about the existence of long-term conditions. If the answer to all five questions is “never” (3) or "not applicable" (4) we can be sure that the person was not disabled from birth. There is probably a quicker way of identifying the people who have no long-term conditions, but I haven't worked it out yet.
Then the next step is separate out people who have long-term conditions resulting from injury, illness, ageing, etc:
replace disabledfrombirth = 0 if (RACG5<=3 | RACG5==5 | RACG5==6)
Note that | stands for “or”
And finally, we identify the people whose long-term condition is genetic or existed from birth:
replace disabledfrombirth = 1 if RACG5==4
Does it make a difference?
You bet. Simply looking at those who have a long-term condition
tab RACG5 [aweight = WTS_M]
tells us that about 10 percent of these conditions have existed from birth or are genetic. (The “aweight” in square brackets is there to adjust for the fact that some people are more likely to be sampled by the CCHS than others. It’s important when calculating means – without weights, the percentage of existing from birth/genetic conditions drop to 9.44 percent. Weighting is not so important when doing regression analysis).
Cause of health problem: |
||
2007-8 CCHS, weighted results |
||
Freq. |
Percent |
|
INJURY |
9,649.55 |
20.87 |
DISEASE OR ILLNESS |
13,940.10 |
30.15 |
AGEING |
9,037.27 |
19.55 |
EXISTED AT BIRTH |
4,825.44 |
10.44 |
WORK CONDITIONS |
4,462.93 |
9.65 |
OTHER |
4,318.71 |
9.34 |
Total |
46,234 |
100 |
But if we look at the population as a whole:
tab disabledfrombirth [aweight = WTS_M]
A smaller percentage – 3.11 percent – experience long-term conditions that existed from birth.
In class last week we had a debate about recoding data. Most of the students were somewhat uncomfortable with the idea of taking a “missing” value and recoding it to a “zero.”
Yet with a heirarchical survey design, a missing response is not the same as missing information. The information we are looking for may be gathered in another question. In this case, the right thing to do is to use as much information as possible by recoding data as necessary.
Frances: interesting. Is this similar to Item Response Theory?
I know that IRT is used to compress what would normally be hundreds of responses into a couple dozen. Same issue: massively reduces the response burden on the subject.
Posted by: Shangwen | October 11, 2011 at 10:08 PM
Shangwen, my first impression is no, but I really don't know what kind of thinking goes into the design of huge surveys like the Canadian Community Health Survey. With over a hundred thousand respondents and data collection every two or three years it is a truly impressive dataset - I think people are just starting to realize what an amazing resource it is. Of course, according to the terms of the Data Liberation Initiative, I can only use it for teaching and research (this post counts as teaching) - so even if it would take me 2 minutes to answer some question that you as a health professional are interested in, I shouldn't do so.
Posted by: Frances Woolley | October 11, 2011 at 10:23 PM
That language is not pretty.
Magic numbers all over the place. Inconsistent ones at that.
If RAC2B1 and RAC2B2 do not have valid answers greater than 4 (which I think I can infer from the other answers) then you could use >2 instead of the OR.
Upon reflection that might be a useful idiom. Assuming all questions that have never/not applicable use numbers greater than MAX_VALID then coding for the combination with >MAX_VALID (whatever that is) would be a consistent way of testing for that (I expect) common test.
Pace Emerson consistency in these matters is important. It reduces errors.
Posted by: Jim Rootham | October 12, 2011 at 01:31 AM
Jim - I'm not sure what you mean by "Magic numbers all over the place. Inconsistent ones at that."
The quick way to eliminate that long if statement would be to convert the RACG5 "not applicable" responses from "missing" to "valid". Then everyone with a valid 'not applicable' could be coded as not having a long-term condition.
The reason that I wrote that if statement the way I did is that RACG5 was asked to people who had invalid responses to the earlier questions. The codebook says that RACG5 was asked to:
Respondents who answered RAC_1 = (1, 2, 7 or 8) or RAC_2A = (1, 2, 7 or 8) or RAC_2B1 = (1, 2, 7 or 8) or RAC_2B2 = (1, 2, 7 or 8) or RAC_2C = (1, 2, 7 or 8)
Can you think of a quicker way to recode that particular statement?
Posted by: Frances Woolley | October 12, 2011 at 06:40 AM
In computer progamming (which is what you are up to here) a magic number is a number used in code that has no obvious meaning (one description was anything but 0, 1, and sometimes 2). In this case just looking at the code fragment cannot tell you what is intended, you must look at the code book at the same time.
As far as inconsistency goes, for example, sometimes 4 is one of the valid answers for a question, sometimes it means "not applicable".
Does the language have set membership expressions? I think it should, it would clarify things.
Outside of that, the only shortening I can see is what I suggested above, and care should be taken to write statements that are comprehensible after the fact. A long statement that matches the codebook description is probably better than a short one that does not.
Posted by: Jim Rootham | October 12, 2011 at 08:00 PM
Using "magic numbers" when dealing with survey data is a very dangerous thing for a researcher. The datasets have codebooks, so an analyst should be handling the valid responses explicitly, rather than trying to infer shortcuts. Set membership functions like the "in" operator in SAS and SQL definitely make it easier to handle situations like this where there's more than one in-scope value for a given variable.
Posted by: gordon | October 12, 2011 at 11:52 PM