People make elementary errors when they run a regression for the first time. They inadvertently drop large numbers of observations by including a variable, such as spouse's hours of work, which is missing for over half their sample. They include every single observation in their data set, even when it makes no sense to do so. For example, individuals who are below the legal driving age might be included in a regression that is trying to predict who talks on the cell phone while driving. People create specification bias by failing to control for variables which are almost certainly going to matter in their analysis, like the presence of children or marital status.
But it is rare that I will have someone come to my office hours and ask "have I chosen my sample appropriately?" Instead, year after year, students are obsessed about learning how to use probit or logit models, as if their computer would explode, or the god of econometrics would smite them down, if they were to try to explain a 0-1 dependent variable by running an ordinary least squares regression.
I try to explain "look, it doesn't matter. It doesn't make much difference to your results. It's hard to come up with an intuitive interpretation of what logit and probit coefficients mean, and it's a hassle to calculate the marginal effects. You can run logit or probit if you want, but run a linear probability model as well, so I can tell whether or not anything weird is going on with the regression."
But they just don't believe me.
I am happy to concede to Dave Giles that, all else being equal, it is better to use probit than ordinary least squares, and that Stata's margins command is not that difficult for an undergraduate to use.
But all else is not equal. Using probit will not save a regression that combines men and women together into one sample when estimating the impact of having young children on the probability of being employed, and fails to include a gender*children interaction term. (The problem here is that children are associated with a higher probability of being employed for men, and a lower probability of being employed for women. These two effects cancel out in a sample that includes both men and women.)
Once students know how to appropriately define a sample, deal with missing values, spot an obviously endogenous regressor, and figure out which explanatory variables to include in their model, then it might be worth having a conversation about the relative merits of probit and linear probability models. Until then, I'm telling my students to use the regress command and, if it makes them feel better, stick "robust" at the end of it.
They don't listen.
It all comes down to the way that they have been taught econometrics. Most - not all - econometrics classes emphasize statistical theory. Students might run regressions, but often these are canned, ready-made examples, with the parameters of the analysis clearly defined, or straightforward replication exercises.
Econometrics is taught that way for a simple, practical reason: it's easy. When every student downloads his own data, works on his own unique problem, and specifies a novel and original model, each student will need a lot of individual help and attention. The marking cannot be delegated to a TA, because each research question, and each data set, is different, so it is impossible to write down a simple answer key. But spending hours upon hours reading students' first struggling steps at regression analysis is a huge amount of work. It's so much easier to mark a final exam consisting of calculations, short answer questions, and replication of theorems.
No one in my honours seminar this year is taking that easy route - and it's tough going.
Econometrics is a journey. Logit and probit are just one step on the path towards enlightenment. Once one arrives at probit, and can calculate marginal effects with ease, another challenge awaits - bootstrapped standard errors, perhaps, or correction for sample selection bias. The ultimate goal - identification of causal relationships - may never be achieved - but we journey down the path nonetheless.
Students need to discover that all econometricians reach a point where they say "there's only so much I can do, I'm going to stick with this regression, even though I know it's not perfect".
My favourite advice for would-be researchers comes from Alice through the Looking Glass
Begin at the beginning...and go on till you come to the end: then stop.
The beginning of a good piece of applied econometrics is formulation of a theory; a hypothesis about what matters and why. The next step is the identification of the sample - who the model applies to. Then comes the model specification - figuring out some way of establishing a relationship between the explanatory variables and the thing that is being explained. Only then do considerations like the choice between probit and linear probability model come into play.
Begin at the beginning. Go on till the final paper is due. Then stop.