Over the past year or so, cities across Canada have been creating open data portals (see, for example, Vancouver, Calgary, Ottawa, Montreal, and Halifax). But Toronto's is special - it has data on cat and dog licences.
The data reveal the most popular dog breeds in Toronto in 2011:
A spreadsheet with the full list of licensed dogs by primary breed is here.
The City of Toronto also provides data on the number of licensed dogs (and cats, if anyone is interested) in each of Toronto's 95 Forward Sorting Areas (FSAs). An FSA is the first three digits of a postal code, e.g. M3S.
Statistics Canada provides detailed profiles of each FSA in Canada through its Census (and now National Household Survey) profiles. How hard could it be to merge the two data sets, and figure out the determinants of dog ownership?
Much harder, it turns out, than I expected. The .csv file I downloaded from the Statistics Canada web site turned out to be useless, because the data was in just three columns, instead of a nice matrix. The alternative download option, B2020, requires custom software that only runs on Windows. That software does produce a file that's useable, but far from clean. For example, missing values are coded as "-" rather than "." which causes major headaches when importing the data into Stata. Variables aren't properly labelled, for example, the same label, "unemployment rate", is used for the unemployment rate of males, females, males 15-24, and so on. I created a 2006 FSA Census Profile file for Stata that's downloadable here (.dta). It's useable, but not pretty. A slightly less processed spreadsheet file is here (.csv).
Using the 2006 Census data and the 2011 Dog registration data I generated a rough estimate of the number of licenced dogs per capita in each Toronto FSA, and the distribution of dog ownership. I was surprised that the dog to people ratios were so low (and because the population data is out of date, those ratios err on the high side).
I spent a little bit of time playing around with the data to discover some correlates of dog ownership (see the table below).
Given that attitudes towards dogs vary from culture to culture, I included "proportion immigrants" in the FSA as an explanatory variable - it was negative and significant.
Some high rises ban dogs; in any event it's harder to let a dog outside to do its business in a multi-story building. The proportion of households living in apartments five stories or higher is also significantly negatively associated with dog ownership.
I would have expected to have found some sort of relationship between dog ownership and employment, but didn't. I suspect this may be because employment has opposing effects on dog ownership: the income and stability associated with employment would tend to encourage dog ownership, but the demands of work discourage it. I also couldn't find a clear link between family composition and dog ownership, but this may simply be because I wasn't using quite the right measure of family composition.
Median income in an FSA is positively correlated with the number of licenced dogs per capita in that FSA. Whether that is due to a positive association with dog ownership and income or to an association with dog licensing and income is something that cannot be assessed with the data available.
(The merged file with the combined dog/census data is here (.dta), and the .do file I used to generate the regression results is here (doggy do file).)
It's wonderful that Canada's cities are engaging in these open data initiatives. But there's a frustratingly large gap between what the data is and what the data could be. For example, the City of Toronto dog and cat licensing data is, apparently, refreshed annually. I don't want data to be refreshed, I want it to be archived, so I can look at trends and changes over time! The various cities differ, too, in the amount and types of data posted on-line. This is a particular issue in the Lower Mainland, as the City of Vancouver is only one of a number of municipalities in the Greater Vancouver area.
There is an even larger gap between what could be done with the data and what I, personally, am able to achieve. The Everyday Analytics blog, for example, did quite a pretty analysis of the Toronto cats and dogs data. What would be really neat would be to map the FSAs, load in data on the location of Toronto parks, calculate the distance between an FSA and the nearest off-leash (or large, dog-friendly) park, and then use distance from the nearest park to predict dog ownership rates in an FSA.
Unfortunatley I have no idea how to go about doing this. The Everyday Analytics blog uses Tableau mapping software. Is this the best? What are the alternatives? So I'll end this blog post with a bleg - is it worth investing the time and effort that it would take to learn how to use mapping software? What's the best software to use? Any thoughts would be greatly appreciated.
Worth noting that if Toronto is anything like the DC metro area a substantial share of pet owners, even a majority, neglect to license their pets. Whether that biases these results is unclear.
Posted by: Squarely Rooted | May 20, 2013 at 07:39 AM
I guess I would be curious if controlling for the percentage of dogs that were "large" in each FSA might have an impact on dogs per capita. Large dogs eat more and would be more costly to support.
Posted by: Livio Di Matteo | May 20, 2013 at 08:23 AM
Squarely rooted - I'm sure lack of licensing biases all of this. People who don't licence their dogs will differ in income or risk tolerance from those who do. Non-licencing people, therefore, probably buy different dog breeds, and that would affect the breed list. The relationship between income and dog ownership is also probably generated by non-licencing. Policing of dog licencing may be more enthusiastic in some areas than in others - e.g. in my area, by-law officials tend to swop down on dog owners when one of the residents complains about bad dog behaviour. No or little enforcement = no or little need to licence.
Livio, unfortunately the data doesn't break down the breeds at the FSA level. Even if confidentiality prevents a breakdown of breeds at the FSA level, it would still be useful, as you suggest, to have small/medium/large at the FSA level, because the predictors of each are probably different.
Posted by: Frances Woolley | May 20, 2013 at 09:15 AM
"I don't want data to be refreshed, I want it to be archived, so I can look at trends and changes over time!"
Right on!!!!
Posted by: Dave Giles | May 20, 2013 at 11:41 AM
Dave - yup.
One issue, I think, is that a lot of this data comes from spreadsheets generated by administrative personnel whose training is in making tables look pretty, as opposed to making tables usable for statistical analysis. Hence the use of characters such as - for missing data, the use of heading and subheadings instead of unique variable names, and the insertion of random empty rows (I remember once having a long discussion with an administrative person at Carleton on this, trying to convince her that random empty rows are evil - but to no avail. I'm sure she's still producing exactly the same table.)
I don't know if there's any way of breaking down these silos and communicating to admin folks the needs of data users.
B.t.w., do you have any knowledge of those various mapping programs? Are there any that are compatible with stata? What about this geogratis program that NRC has put out?
Posted by: Frances Woolley | May 20, 2013 at 12:26 PM
Frances: Maybe you've seen this already, but this blog seems to be tracking municipal open data projects in Canada:
http://opendataexpert.com/
The prevalence of this kind of information, and some of the quality issues you raise, are typical of the cusp we are on right now in terms of large-scale analytics with unconventional sources (that is, not specifically collected by academics or statistical agencies). There is a ton of information out there that is not otherwise available, or even retrievable through academic initiatives, but its usefulness is a big unknown. A lot of the "big data" talk out there has a hype aspect to it--a common theme is that the sheer size of these databases (online purchases, public transit use, etc.) overcomes any potential bias or other methodological concerns, or at least softens them. However as the recent debate about the Oregon Health Study is reminding us, large numbers are a necessary but not sufficient condition for useability.
Posted by: Shangwen | May 20, 2013 at 03:34 PM
I'm reminded of a presentation by a fellow who measured, amongst other things, the state of the California economy by the number of ladders found by the side of the highway (the intuition being that the number of yahoos who are too dumb to properly tie a ladder to the roof of their truck and who fancy themselves to be contractors is directly proportionate to the state of the housing market. Also, if times were tough, they might be more inclined to come back and look for it). I can't remember much else about the presentation (it was done over the course of a dinner where many very fine bottles of exquisite wine was served), but I remember thinking that that was seriously cool.
Posted by: Bob Smith | May 21, 2013 at 03:12 PM
Frances, those people sure be using Crystal Reports or equivalent to be turning properly formatted data into reports with the white space and headings/subheadings desires (to be human readable).
Posted by: Andrew F | May 21, 2013 at 03:30 PM
Andrew F - one hopes.
Bob, I like that.
Posted by: Frances Woolley | May 21, 2013 at 03:35 PM
wow what happened to my comment?
I did like the article!
Next time I will bark!
Posted by: nottrampis | May 23, 2013 at 01:33 AM