Last year one of my students* was trying to explain why immigrants struggle in the job market. His regressions weren't working, so he switched things round a bit.
Using 2006 Census data, he found that people in Newfoundland are 30 percentage points less likely to be immigrants than people in Ontario. People with PhDs are 21 percentage points more likely to be immigrants than people with a high school education. And people who are divorced are 9 percentage points less likely to be immigrants than people who are married.
He was crushed when I said "You can't use any of this in your paper. It doesn't make sense to have immigrant status as your dependent variable."
Data miners take data on consumer purchases and look for a pattern, any pattern. Sales of strawberry pop-tarts increase during hurricanes. When men buy diapers on Thursdays and Saturdays, they also tend to buy beer. People who shop for baseball bats also shop for....
Data mining is useful for businesses to the extent that it helps them predict what their customers will buy and when.
Likewise, if one's objective is to predict which of 10 people on a bus are most likely to be immigrants, then it is useful to know the correlates of immigrant status. But it's not economics - at least, not my kind of economics.
(Micro) economics is the study of choice: people's choices; firms' choices. Economists want to know how people respond to incentives; how they take advantage of changes in constraints.
The immigration choice is about who decides to immigrate, and why. To understand that choice, one needs data both on people who chose to immigrate and on otherwise similar people who did not. Here's a picture:
The appropriate population to sample for a study of the immigration choice is everyone in green: all the people who could potentially have immigrated from the source country to the destination country, both those who immigrated and those who did not. Only by comparing those who chose to immigrate with those who did not can we learn about the immigration decision. The Canadian census data that my student was using only had information on people in the destination country - the blue area and the small green circle - so it could not inform an understanding of the immigration decision.
I'm not saying one should never run a reverse regression with "immigrant" as a dependent variable. Suppose, for example, one was testing for the existence of discrimination in the job market. If one found that those who were unemployed were more likely to be immigrants, all else being equal, that might be evidence of discrimination against immigrants. But they're called "reverse regressions" for a reason, that is, they're the reverse of what economists generally do in regression analysis.
Data mining - throwing everything you've got at a regression and seeing whether or not it's significant - doesn't provide a way of generalizing beyond a particular time and place. Without a theory about the underlying processes generating the data, there is no way of knowing whether or not the results are generalizable.
Take, for example, the (purported) amazon.de screenshot above. People who buy baseball bats also buy balaclavas and pepper spray. Does this mean that baseball bat manufacturers would be well advised to diversify into the balaclava business? Probably not. Some theorizing, some "critical thinking", can explain why.
There are two uses for baseball bats: playing baseball, and hitting things. In Germany it seems that the latter use dominates, but that will not be true in many places where baseball bats are sold. Once the underlying process generating the results is understood, it may be possible to predict where, that is, in what markets, baseball bats and balaclavas are likely to be purchased together.
Or it may not be. Sometimes economic models lead to definite predictions; sometimes they don't. For example, economists can predict that stock market bubbles will form and then burst, but they cannot predict when crashes will occur.
What turns a regression into economics is the insight it yields into human behaviour. Without a model that connects econometric output to decision-making, we're just a bunch of idiot savants predicting pop-tart sales.
* Details have been changed to preserve student anonymity. I calculated the results reported here myself using the 2006 Census PUMF.