You can follow this conversation by subscribing to the comment feed for this post.

Frances: interesting. Is this similar to Item Response Theory?

I know that IRT is used to compress what would normally be hundreds of responses into a couple dozen. Same issue: massively reduces the response burden on the subject.

Shangwen, my first impression is no, but I really don't know what kind of thinking goes into the design of huge surveys like the Canadian Community Health Survey. With over a hundred thousand respondents and data collection every two or three years it is a truly impressive dataset - I think people are just starting to realize what an amazing resource it is. Of course, according to the terms of the Data Liberation Initiative, I can only use it for teaching and research (this post counts as teaching) - so even if it would take me 2 minutes to answer some question that you as a health professional are interested in, I shouldn't do so.

That language is not pretty.

Magic numbers all over the place. Inconsistent ones at that.

If RAC2B1 and RAC2B2 do not have valid answers greater than 4 (which I think I can infer from the other answers) then you could use >2 instead of the OR.

Upon reflection that might be a useful idiom. Assuming all questions that have never/not applicable use numbers greater than MAX_VALID then coding for the combination with >MAX_VALID (whatever that is) would be a consistent way of testing for that (I expect) common test.

Pace Emerson consistency in these matters is important. It reduces errors.

Jim - I'm not sure what you mean by "Magic numbers all over the place. Inconsistent ones at that."

The quick way to eliminate that long if statement would be to convert the RACG5 "not applicable" responses from "missing" to "valid". Then everyone with a valid 'not applicable' could be coded as not having a long-term condition.

The reason that I wrote that if statement the way I did is that RACG5 was asked to people who had invalid responses to the earlier questions. The codebook says that RACG5 was asked to:

Respondents who answered RAC_1 = (1, 2, 7 or 8) or RAC_2A = (1, 2, 7 or 8) or RAC_2B1 = (1, 2, 7 or 8) or RAC_2B2 = (1, 2, 7 or 8) or RAC_2C = (1, 2, 7 or 8)

Can you think of a quicker way to recode that particular statement?

In computer progamming (which is what you are up to here) a magic number is a number used in code that has no obvious meaning (one description was anything but 0, 1, and sometimes 2). In this case just looking at the code fragment cannot tell you what is intended, you must look at the code book at the same time.

As far as inconsistency goes, for example, sometimes 4 is one of the valid answers for a question, sometimes it means "not applicable".

Does the language have set membership expressions? I think it should, it would clarify things.
Outside of that, the only shortening I can see is what I suggested above, and care should be taken to write statements that are comprehensible after the fact. A long statement that matches the codebook description is probably better than a short one that does not.

Using "magic numbers" when dealing with survey data is a very dangerous thing for a researcher. The datasets have codebooks, so an analyst should be handling the valid responses explicitly, rather than trying to infer shortcuts. Set membership functions like the "in" operator in SAS and SQL definitely make it easier to handle situations like this where there's more than one in-scope value for a given variable.

The comments to this entry are closed.

• WWW