When playing around with some data the other day, I noticed something odd.
I was trying to figure out where my respondents lived, so I typed "tab PROV" and was surprised to see that about four percent lived in Newfoundland. That's the number in the distribution of respondents column in the table below.
Distribution of respondents by province: unweighted and weighted, January 2009 Labour Force Survey. Calculated by author.
Distribution of respondents (%) |
Weighted distribution (%) |
Relative over/under sampling | |
Newfoundland | 3.89 | 1.58 | 2.46 |
Prince Edward Island | 2.55 | 0.43 | 5.93 |
Nova Scotia | 5.24 | 2.84 | 1.85 |
New Brunswick | 5.12 | 2.29 | 2.24 |
Québec | 17.82 | 23.61 | 0.75 |
Ontario | 29.75 | 39.04 | 0.76 |
Manitoba | 6.85 | 3.38 | 2.03 |
Saskatchewan | 6.88 | 2.86 | 2.41 |
Alberta | 10.13 | 10.43 | 0.97 |
British Columbia | 11.76 | 13.55 | 0.87 |
It's not something a good economist would ever notice, because a good economist would always remember to add [fweight=FWEIGHT] to the tab command. The fweight command does exactly what it sounds like it does: it weights some respondents more than others. Respondents from Quebec get a high weight because there's relatively few of them in the survey, respondents from Prince Edward Island get a low weight because there's lots and lots of them in the survey. "Tab PROV [fweight=FWEIGHT]" produces the distribution shown in the second column of the table, which appears fairly representative of the Canadian population.
My first thought was "wow, that's pretty extreme non-response bias." But in fact the Labour Force Survey hounds people so intensely that it has a 90 percent response rate. ("Interviewers are instructed to make all reasonable attempts to obtain LFS interviews with members of eligible households.")
So what's going on? According to the LFS documentation:
The sample is allocated to provinces and strata within provinces in the way that best meets the need for reliable estimates at various geographic levels....The following guidelines were used in sample allocation: Canada and provinces: estimates of unemployment should not have a coefficient of variation (standard error relative to the estimate) greater than 2 percent for Canada, and 4 to 7 percent for the provinces.
Prince Edward Island accounts for about 0.5 percent of the Canadian population. If only 0.5 percent of Labour Force Survey respondents came from PEI, there would be so few respondents it would be hard to estimate, with any degree of accuracy, PEI's unemployment rate. And since important policy parameters such as the duration of Employment Insurance benefits depend upon regional unemployment rates, policy makers need to know. Fair enough.
I have used the Labour Force Survey as an example here because I have the files on hand. The survey I am actually working with is the Canadian Financial Capability Survey, but its distribution of respondents is similar to that of the LFS.
I am using the CFCS for a study on gender and savings. I want to ask questions such as: How much are single parents saving for retirement? How is the high price of housing in our major cities changing Canadians' wealth portfolios? Are new immigrants accessing Registered Education Savings Plans? Do households where women's earnings account for a larger share of the household income save more? Do households where men make the financial decisions invest in riskier assets?
I have nothing against Prince Edward Island. But from a wealth point of view, its inhabitants are totally boring. The distribution of wealth is highly skewed, and the distribution of financial and business assets even more so, with a relatively small number of people controlling a relatively large fraction of Canada's financial and business assets. To understand patterns of business and financial asset holdings one needs to have a reasonable sized sample of rich people, and the extremes of wealth, like the extremes of poverty, are found in Canada's big cities.
Stephen Gordon has been sounding the alarm about the profound impacts of demographic change for years. I do not believe that individual savings are a panacea that will make demographic challenges go away. But I do believe that, in order to understand what lies ahead, we need to have a pretty good idea of how much young and middle-aged Canadians are saving now.
Unfortunately, our obsession with getting accurate provincial-level estimates compromises our ability to produce accurate national-level estimates, or accurate estimates for important or vulnerable population sub-groups.
There are not many topics that fit into the categories "Canada-Politics" and "Econometrics", but this is one of them.
Can you not do the weighting thing for the CFCS?
The other option for PEI is to do what I've seen journalists in particular do with increasing frequency these days, do up rankings of things like Canada's best premiers and have only nine in the list.
And what's with Newfoundland in the title if PEI is the star of this rant? Afraid we wouldn't give you enough hits?
Posted by: Jim Sentance | January 27, 2011 at 12:53 PM
It's a snow day here by the way.
Posted by: Jim Sentance | January 27, 2011 at 12:54 PM
Jim: you can't pick on PEI; it's not fair!
Wish I could remember stats. "..estimates of unemployment should not have a coefficient of variation (standard error relative to the estimate) greater than 2 percent for Canada, and 4 to 7 percent for the provinces."
I don't get the "4 to 7 percent". Would that mean 4% for Ontario and 7% for PEI? Would that fit with the 0.76 vs 5.93?
Posted by: Nick Rowe | January 27, 2011 at 02:20 PM
Frances:
I would have thought pweight rather than fweight. These are sampling weights (reciprocals of sampling probabilities), not frequency weights. On the other hand, it doesn't make any difference if you're not computing standard errors.
Posted by: thomas | January 27, 2011 at 02:51 PM
Jim - yes, I can weight the CFCS.
But the problem is that I have fewer single parents in big CMAs, fewer recent immigrants, fewer really rich people and fewer people living in cities with totally outrageous housing prices than I would have if CFCS had just picked a random sample of people across the country. I care about the maritimes and the prairies, but I don't care about them *more* than the rest of the country.
I thought the Newfoundland title was catchy at the time but, you're right, I could probably have thought of a better one.
And excellent snow here today, too.
NIck, don't ask me to translate!
Thomas, the second column is calculated with the frequency weights provided by Statistics Canada in the Labour Force Survey public user file. I used the variable called "FWEIGHT" and I'm hoping that it is in fact an fweight! The third column is one that I calculated myself based on the other two. I guess those are probability weights.
Posted by: Frances Woolley | January 27, 2011 at 05:56 PM
Oh I see. The CFCS isn't big enough to give you the sample size you want of a small slice of Canadian society, and you think its because they wasted resources trying to make sure they had reasonably reliable overall samples of all the provinces. Welcome to the federation that is Canada!
Posted by: Jim Sentance | January 27, 2011 at 07:36 PM
Jim: "Welcome to the federation that is Canada!"
The truth is I'd never realized that just about every survey in Canada over sampled the smaller provinces.
Perhaps it's a good decision, perhaps it's a bad one.
But it's something that I have never ever heard discussed: "what compromises are we making here? what are we gaining? what are we losing?" It's a conversation worth having - though I wish I was having it over a beer with you, Jim, instead of electronically.
For the LFS, given the structure of employment insurance, there's a good case to be made for the present design.
For other surveys, it might make more sense to have a representative sample of the whole country. Lots of people, for one reason or another, want to look at a particular slice of society (same sex couples, inter-racial marriages, multigenerational families, whatever), and if it happens to be a slice that's concentrated in the large provinces, the present way of designing surveys means it's harder than it needs to be to get a close look at that particular group.
Posted by: Frances Woolley | January 27, 2011 at 09:10 PM
The real shame is that there might be resource constraints that mean you can't do both.
Agree about the drinking of course.
Posted by: Jim Sentance | January 27, 2011 at 09:25 PM
The truth is I'd never realized that just about every survey in Canada over sampled the smaller provinces.
That strikes me as peculiarly naive. Of course strata with smaller populations will be relatively "over" sampled compared to those with larger populations. That's what's necessary to obtain sufficient precise population estimates; standard errors depend primarily on sample size not population size. Regardless, those of us in smaller provinces aren't somehow less deserving of accurate data from the [b]federal[/b] statistics agency.
Posted by: Josh | January 28, 2011 at 07:08 PM
Frances,
I would have thought pweight too, but like Thomas said it doesnt matter if you don't compute standard error.
According to stata's help file,
"fweights, or frequency weights, are weights that indicate the number of duplicated observations.".
"pweights, or sampling weights, are weights that denote the inverse of the probability that the observation is included because of the sampling design."
According to the LFS page:
Estimation
The final step in the processing of LFS data is the assignment of a weight to each individual record. This process involves several steps. Each record has an initial weight that corresponds to the inverse of the probability of selection . Adjustments are made to this weight to account for non-response that cannot be handled through imputation. In the final weighting step all of the record weights are adjusted so that the aggregate totals will match with independently derived population estimates for various age-sex groups by province and major sub-provincial areas. One feature of the LFS weighting process is that all individuals within a dwelling are assigned the same weight.
In January 2000, the LFS introduced a new estimation method called Regression Composite Estimation. This new method was used to re-base all historical LFS data. It is described in the research paper "Improvements to the Labour Force Survey (LFS)", Catalogue no. 71F0031X. Additional improvements are introduced over time; they are described in different issues of the same publication.
I dont remember ever using fweight.
Posted by: Simon C. | February 17, 2011 at 01:53 PM
Simon - the LFS file that I used has a variable that's labelled FWEIGHT. I would guess it's an FWEIGHT. To give a pweight the variable label FWEIGHT would be simply cruel, because people like me are bound to get confused. Statistics Canada wouldn't do that, would they?
It might be that you and I are using different LFS files - I'm using the public use one. From what you've quoted, it sounds as if pweights are used internally by statistics canada when deriving the fweights that are released in the public use file.
The passages you quote are an excellent illustration of why it is *vital* to have a mandatory census - the final adjustments for non-response etc would be impossible without census information.
Posted by: Frances Woolley | February 17, 2011 at 03:37 PM
Hi Frances,
I'm sorry I meant tosay that I never used the "fweight" function not the "FWEIGHT" variable. I've never used the LFS myself.
I gave a quick look at the documentation and the file named rebased-record-layout.xls says :
FWEIGHT : Final individual or family weight. (Integer)
I'm afraid FWEIGHT could stand for Final Weight or Family Weight.
I'll try to clear that up and comment here later.
Posted by: Simon C. | February 17, 2011 at 07:39 PM