You can follow this conversation by subscribing to the comment feed for this post.

One obvious answer is to use the likelihood. For day t, this is P(t) if R(t)=1 and 1-P(t) if R(t)=0. Then take the product across all the days.

What I would do is make an accuracy score for each prediction, for each day: if it rains the score is the forecaster's probability it will rain, if it doesn't rain the score is 1 - probability. Then you can either sum across all days to get the cumulative performance and compare with the other forecaster, or you can work out the average score as well as the standard deviation of each forecaster and compare that way, the standard deviation allowing you to test how significantly they deviate from each other. Furthermore, it allows you to compare against random chance, by seeing whether the score is above 0.5.

Daniel and Britonomist: thanks. Are your answers the same, except Britonomist is adding up all the things that Daniel is multiplying together?

Or, wouldn't sum squared [R(t)-P(t)] work the same, and be more Least Squaredish?

Well, if you take the nth-root of Daniel's likelihood you have the geometric mean, which could be compared in a similar fashion to using my arithmetic mean.

Maybe I'm stupid, but...

...how do you assess the accuracy of P(t)? It either rains or it doesn't, so unless P(t) is a perfect model, it will almost always be wrong, even in cases when P(t)=0.95 and it rains (because P(t1)=1.0).

Is the assumption that the 0.95 forecast was "5% incorrect" on the day it rained? If so, then I'm pretty sure the P(t) model will always appear to look more inaccurate than the R(t) because P(t) is inaccurate even when it's accurate.

I can't demonstrate this, but I'm putting it out there in hopes that the person who solves this problem confirms my suspicion.

Or, what about a forecaster who is really perfect, but who underestimates his own ability. So whenever he says "P=0.6" it always rains, and whenever he says "P=0.4" it never rains? How can we distinguish between someone like that, and another forecaster who is always perfectly confident but who sometimes gets it wrong?

Ryan: you are asking the same "stupid" question as me. But it has an answer. I know it *must* have an answer!

I'll give it a try: to determine which forecaster to prefer the ratio p(data|A)/p(data|B) could be a good indicator (reason below), where p(data|A) is the probability of observing the data given the predictions of forecaster A. If the ratio is much larger than 1 then A is better, if the ratio is much smaller than 1 then B. p(data|A) can be calculated as p(data|A) = PRODUCT_t [ (P(t) == 1.0) * R_A(t) + (P(t) == 0.0) * (1 - R_A(t)) ]. Similarly p(data|B). The ratio tells us which to prefer.

I'll take the argument for why this could be a way to do it from the chapter "Model Selection" in "Data Analysis: A Bayesian Tutorial" by D. S. Sivia: We have observed some data the forecast of two forecasters A and B. We want to determine which forecaster to prefer. If we denote by p(A|data) the "probability/belief that A is right" given the data, then we can calculate p(A|data)/p(B|data). If this is larger than one, then we prefer A, if its much smaller than one we prefer B. Using Bayes theorem this can be written as
p(A|data)/p(B|data) = (p(data|A)/p(data|B)) * (p(A)/p(B)). If we have no a priori preference for A or B then p(A)/p(B) = 1 and we are left with p(data|A)/p(data|B).

I'm curious if this is helpful (and simple enough :-)). It also doesn't help much to determine if one forecaster is just guessing. To do that, I would expect that we have to know more (Example: how often we expect it to rain. In the Sahara someone will do really well by guessing "no rain"). So the approach above just helps to differentiate between the two weather forecasters.

Mathias beat me to it. You revise in favour of the forecaster who gave the higher probability for event that actually occurred according to Bayes' rule.

Remember that in reality is deterministic, not probabilistic (with the exception of quantum physics), if you have enough information you should be able to produce a greater (or lesser) probability than someone with less information, so my method both penalizes an inability to give informative predictions (low deviation from 0.5 probability), and being wrong. I suppose this doesn't account for confidence explicitly, low confidence will be penalized, that just illustrates that low confidence makes you less useful (unless you're better than the more confident forecaster, in which case the other forecaster's overconfidence is making him worse instead).

Reality is deterministic*

My first thought was to go with the maximum likelihood estimator too.

But this may not work.

For example, say the long run average of rainy days is 10%. A moderately naive person may fix P(t) = 0.1 for each day. A more rational Bayesian updater might begin with a prior of 0.1, and then update it with each day of rain or no rain. Ideally we should reward this person.

Now consider a third person, who forecasts P(t) = 0.2 everyday, and wins out simply because in that year rainy days were indeed 20%, so he maximized likelihood (additive, multiplicative, std deviation etc. doesn't matter- it can be shown that a constant ex ante prediction that matches the realized ex post data series maximizes all likelihoods).

The answer to our question is dependent upon whether we should indeed be using the past long run average of rain as the prior, or if a de nouveau prior is justified. In other words, is the guy predicted 0.2 for everyday Warren Buffett, or an idiot savant. (Is Warren Buffett an idiot savant?)

The test of forecast efficiency suffers from the same problem as do all tests of market efficiency - they are joint tests of the market model and the forecast. Do we know the *true* model of rain. If yes, MLE gives us the right answer. If no, MLE might fail to tell us who is the better forecaster. What then matters is replicability and longevity. Can the forecaster teach another forecaster to outperform the naive long run average setter or the Bayesian updater? Does the forecaster's method perform in out of sample tests?

Oh and I'm pretty sure that two identical forecasters with the same method and information, but with one being less confident than the other, will still have the average accuracy.

The same average accuracy*, bah! I wish there was an edit function...

Matias: " p(data|A) can be calculated as p(data|A) = PRODUCT_t [ (P(t) == 1.0) * R_A(t) + (P(t) == 0.0) * (1 - R_A(t)) ]."

Is there a typo there? Did you switch P and R? Or am I even more confused than I think I am?

Suppose we estimated the regression: R(t) = a + b.P(t) + e(t) for each of them. Would a high R^2 but low estimate for b (much less than one) tell is the forecaster is better than he thinks he is?

Since the dependent is binary there, I'm not sure R^2 is valid; you should use probit instead.

what about a forecaster who is really perfect, but who underestimates his own ability.

What about a forecaster whose method is perfect but can't interpret his own results? He only ever gives probabilities of one or zero, but *every time* he gives P = 0.0 for rain, it does rain, and when he predicts P = 1.0 it never rains. Is he the best forecaster or the worst possible one?

Nick, Define "better." There are two dimension in which you could judge the forecast accuracy, making correct predictions in a binary rain/nonrain sense (throw out the 50-50 calls, or tell the forecasters to never use precisely 50%.) See which forecster gets it right more often.

The second dimension is over/underconfidence. Suppose a forecaster aleways calls it 80-20 (or that ratio on average), sometime for rain, sometimes that it won't rain. In that case you test the accuracy of claimed confidence by comparing the 80-20 ratio to the percentage of correct calls. If the forecaster is neither under or over confident, you would expert him or her to be right 80% of the time. If they are right 90% of the time they would be underconfident, and if they are right 70% of the time they are over confident.

It very possible that one forecaster will be "better" in the sense of getting it right more often, and the other will be closer to the proper degree of confidence. In that case you'd obviously want to take some sort of linear combination of the two forecast to get the best estimate of the probability of rain.

Scott: that was sort of my thinking too. There's information-content, bias, over/underconfidence, and overall accuracy.

Britonomist. Ah yes, probit, because the errors will have a very ugly distribution, or something.

William: yep. That's an example of a forecaster with very high information content but negative self-confidence!

Nick, do you care how much it rains? That is, is a person who forecasts sunny weather just before Superstorm Sandy blows in a worse forecaster than the person who forecasts dry but overcast when in fact it was overcast with a millimetre of rain?

Nate Silver himself gives a perfectly intelligible summary of the difference between bias and accuracy, and what accuracy means in this context:

Bias, in a statistical sense, means missing consistently in one direction — for example, overrating the Republican’s performance across a number of different examples, or the Democrat’s. It is to be distinguished from the term accuracy, which refers to how close you come to the outcome in either direction. If our forecasts miss high on Mr. Obama’s vote share by 10 percentage points in Nevada, but miss low on it by 10 percentage points in Iowa, our forecasts won’t have been very accurate, but they also won’t have been biased since the misses were in opposite directions (they’ll just have been bad).

Your posited forecaster who "is really perfect, but who underestimates his own ability" is a contradiction; all that matters is the forecasts (probabilities) that the forecaster gives. The problem, as you have set it up, is like this: every day (or poll) is treated as a binary random variable (biased coin), where the probability of "heads" itself is drawn from an unknown distribution. To be accurate means to correctly guess this unobservable bias on each trial, not to guess the outcome of each trial. That is not even being attempted.

Recall the meaning of a 40% forecast = "On 40% of days with these conditions it will rain". Or perhaps, "On 40% of days with this forecast it will rain". Suppose all the forecasts were multiples of 10% (or round a needed). For each of the eleven possible forecasts compare the empirical frequency of rain. The upshot here is you are usually comparing a number between (0,1) with another, instead of comparing it to a Bernoulli trial (zero or one).

Now you have a vector of accuracy across the forecast spectrum. This gives some insight on the shape of the error/bias - perhaps the forecast quality is high in the low range and biased upward in the high range.

If you need a final utility score come up with a metric on the vector.

OK, so pace ritwik, likelihood is in fact the orthodox measure to employ here. The question of bias is distinct from that of accuracy, as noted earlier, and in the case of polls there is no earlier sequence of data to draw on to estimate an average. In the case of weather, the same holds, because weather is not stationary.

There is another way to quantify the issue worth mentioning. We are observing a sequence of random variables X_i, with each variable drawn from a different binary distribution with unobservable Pr(rain) = mu_i; our premise is that mu_i is itself a random variable, or the problem would collapse to a standard convergence in law.

You are surely familiar with the standard version of the central limit theorem, in which the normalized sum of iid random variables converges to a normal distribution. There are also extended versions of the limit theorem which do not require the independent variables to be identical, provided that they meet some other technical conditions. One of these is Lindeberg's version (which is satisfied here), that says:

1 / sum(X_i) x sum(X_i - mu_i)

converges toward the standard normal distribution. For each day (or poll), X_i is 1 if it rained, and mu_i is the forecaster's given probability of rain. The degree to which a given forecaster's sequence of forecasts converges to the standard normal distribution in this transformation is a measure of the goodness of those forecasts.

Thomas: but that doesn't really work. Suppose I know nothing, except that it rains 30% of the time. So every day I say "30%", and I am exactly right, because the frequency is in fact 30% when I say it's 30%. I am unbiased, but otherwise a useless forecaster.

Phil: Your posited forecaster who "is really perfect, but who underestimates his own ability" is a contradiction; ..."

OK. Suppose one forecaster had a perfect model (it's a crystal ball), but he doesn't know it's perfect and doesn't trust it. So he takes an average of the model's forecast and the population mean probability. When the model says 0% he says 10% and when the model says 100% he says 90%. A second forecaster has a flawed model that he trusts perfectly, and so always says 0% or 100%. Both forecasters are imperfect, but they are imperfect in different ways. And it is much easier to fix the first forecaster's imperfections. And a useful metric would let us know if a forecaster is making mistakes that can easily be fixed.

Frances; good point. We need a loss function. I think I was (implicitly) assuming a simple loss function, where the amount of rain doesn't matter, and the cost of carrying an umbrella conditional on no rain is the same as the cost of not carrying an umbrella if it does rain?

Phil (quoting nate Silver): "If our forecasts miss high on Mr. Obama’s vote share by 10 percentage points in Nevada, but miss low on it by 10 percentage points in Iowa, our forecasts won’t have been very accurate,..."

But that is a point estimate of a continuous variable. I'm talking about a probabilistic estimate of a binary variable. "What is the probability of rain/O wins?"

Hmm. In this case, given the data the best thing I could do is to evaluate those forecasters utilizing information theory. So if based on actual observations we know the "objective" probability distribution of binary outcomes (Rain/NotRain) we may then calculate the Shannon entropy and therefore an information value of respective forecasts. I would say that the winner is the forecaster who transmitted most information.

PPS: anyways, contrary to what some people may think this is not how I would evaluate forecasters. It would be better to evaluate their forecasting models instead of actual forecasts with probabilities. If you would have access to models, you would be able to mot only evaluate such model's raw predictive (or to be more precise explanatory) power, but also their stability. The best way to think about this is to imagine that observations are points in space where dimensions are variables that you use for prediction. The actual model can be represented by a plane that you constructed using these points in space.

Now you can have a fantastic model that has 95% prediction power and it still could be useless. Why? Imagine that you have nice observations tightly packed roughly along a straight line along the x dimension. So you construct a plane and that is your model. Now imagine that just one observation shifts a little. Suddenly your plane changes slope and further you get from your observation the greater is the difference in prediction. Suddenly you find out that just slight changes in observations end to widely different predictions. Such model is so useless, it is just random match for past data. There are mathematical tools that allow you to calculate the stability of your model. There are more complicated things involved, like your model can be generally stable but it can show large local instabilities in some range of observations etc.

Nick:

Not sure you can compare those two probabilities ("O wins" vs weather forecast) as they most like mean different things in differing contexts.

The major and uncomfortable problem with the notion of "probability" is that it admits multiple, sometimes incompatible, interpretations (about half a dozen) thus leading to confusion while trying to understand what exactly the interlocutor means by "probability". Perhaps, before using the p-word, it has to be definde each time before any such use.

While climatologists use something close to the standard frequentist interpretation that they modestly call "climatological probability", the "O wins" probability can be most charitably interpreted as subjective probability of de Finetti's kind ("subjective degree of belief"), the political scientist quantitative models notwithstanding.

Re. "climatological probability":
"
Because each of these categories occurs 1/3 of the time (10 times) during 1981-2010, for any particular calendar 7-day period, the probability of any category being selected at random from the 1981-2010 set of 30 observations is one in three (1/3), or 33.33%.
"

http://www.cpc.ncep.noaa.gov/products/predictions/erf_info.php?outlook=814&var=p

"I'm talking about a probabilistic estimate of a binary variable."

Suppose that there are only two types of election, those which Republicans are 70% likely to win and those that Democrats are 70% to win, and these types occur with equal frequencies. Then a forecast that assigns every election a 50-50 chance is unbiased but useless.

'if based on actual observations we know the "objective" probability distribution of binary outcomes'

If we knew this, the problem would be much easier.

That's a useful canonical example that illustrates why we'd like to see the whole vector. Moreover, it's the extreme version of the theoretically valid strategy of consistently averaging in the information-less prior of 30% to every forecast.

In this way, every forecast-year is a linear combination of the information-less 30% constant and a non-trivial component. Perhaps there's a way to parse each forecast-year into components.

Well the weather forecasters at least admit that their predictions are not perfect, and then go on to quantify the uncertainty of their prediction. So what they predict is not actually whether it will rain or not on any particular day, but a probability distribution. Thus they should be judged on how close their predicted probability distributions are to the actual distributions.

So if the forecast of a 60% probability of precipitation we expect rain on 60% of the days when this forecast is given, and if instead we consistently get 50% or 70% then this is evidence that the forecast is biased in one way or the other. If the forecaster is doing a good job we would expect a fairly narrow distribution, presumably roughly "normal" around the predicted frequency. Whether it rains or does not rain on any particular day taken by itself is beside the point, it seems to me. If the actual distribution of precipitation over time coincides closely with the predicted distribution then the forecaster is doing a good job. If it doesn't then he isn't.

"Suppose one forecaster had a perfect model (it's a crystal ball) ... it is much easier to fix the first forecaster's imperfections."

You are merely assuming the second sentence. There is no reason why it should be easier to increase "confidence", interpreted as a metaphor, than to improve the "crystal ball", interpreted as a metaphor, it is just that your choice of metaphors has made this seem plausible. The complete model is the combination of the crystal ball plus confidence (plus star-gazing plus sunspots plus whatever else you want to add) and the complete model is what we must assess.

A bit surprised that no one has mentioned that the most common way to assesses forecasts would be by taking the forecast with the smallest mean square prediction error E[ R(t)-P(t))^2]. Lowest MSPE is what you want to pick if you have a quadratic loss function.

Of course, if you are particularly sensitive to certain contingencies, such as thresholds like Frances suggests, you have a different loss function. If you, say, are made worse off by rain when no rain is predicted, but not vice versa (because costs of carrying umbrella is small) then you would calculate what fraction of the 365 days prediction 1 made an error of that sort as opposed to prediction 2. And pick whichever performed better.

You would need some measure of what a person who is not a weather forecaster could reasonably guess. In that way you determine what value a weather forecaster adds. One method would be to gather a large enough time data series to determine what the probability of rain is on any particular day from past experience. This would serve as a baseline against which to measure both deviations in observed weather and deviations in predicted weather.

P(T) is prediction of a weather forecaster for day T
R(T) is weather on day T
S(T) is probability of rain on day T based upon all previous R(T), or S(T) = (sum of R(t) as t goes from 1 to T-1) / ( T-1 )

Suppose our non-weather forecaster bases his guess entirely on previous weather
P(T) = S(T) = ( sum of [ R(t) ] as t goes from 1 to T-1 ) / ( T - 1 )

The deviation of his guess at time T from the actual value would be:
| S(T) - R(T) | / R(T)

The average deviation A would be:
A = ( sum of [ | S(t) - R(t) | / R(t) ] as t goes from 1 to T ) / T

And so you can expect a non-weather forecaster to have an average deviation of A. For our weatherman:
P(T) = F(T) * S(T)

F(T) represents the factor that the weatherman applies to the data set to take into account his own experience and senses. The average deviation (AF) for our weatherman would be:
AF = ( sum of [ | F(t) * S(t) - R(t) | / R(t) ] as t goes from 1 to T ) / T

To calculate the value (V) of the weatherman, you must consider his accuracy above and beyond what a non-weatherman could guess.
V = ( A - AF ) / A

If the weatherman has exactly the same average deviation that the non-weatherman does, then his value is:
V = ( A - A ) / A = 0

If the weatherman calls the weather perfectly then his value is:
V = (A - 0) / A = 1

To compare two weathermen, you would need to compute the value of each.

Nick, the metric you propose in your 9:03 post is called the "Brier Score" and is in fact commonly used to evaluate forecasting skill. The Wikipedia article on the subject describes a couple of ways of decomposing it into terms that can be related to intuitive concepts like "reliability".

http://en.wikipedia.org/wiki/Brier_score

OK. I think I'm getting the intuition of Matias' answer (similar to Daniel's answer in the first comment). You work out the likelihood of observing the data, conditional on the forecast being true. So if on day 1 A says 90%, and it does rain, that's 90%. If on day 2 A says 70%, and it doesn't rain, that's 30%. Putting the two days together that's a probability of 0.9 x 0.3 = 0.27. Etc for all 365 days. Then use Bayes' theorem to get the probability of the forecast, conditional on the data, as the likelihood x the prior of the forecast/the prior of the data. When we do this for both forecasters, and take the ratio, and assume we have equal prior confidence in both forecasters, all the priors cancel out.

That probably wasn't very clear.

Nick,

From Mathias above:

"I'm curious if this is helpful (and simple enough :-)). It also doesn't help much to determine if one forecaster is just guessing. To do that, I would expect that we have to know more (Example: how often we expect it to rain. In the Sahara someone will do really well by guessing "no rain"). So the approach above just helps to differentiate between the two weather forecasters."

To determine the value of each weatherman, you would also need a reasonable set of predictions by a non-weatherman. Like Mathias mentions, having a weatherman in the Sahara desert to forecast no rain does not tell you much about his / her value. The Brier score will tell you how accurate a weatherman is in relation to what the actual weather is, but will tell you nothing about how much more accurate a weatherman is compared to say sticking your head out the window or venturing a guess based upon the time of year.

Frank: we could always construct a fake weatherman C, who just makes the same average forecast every day, and repeat the likelihood ratio test.

Neat stuff this Bayesian thing. I think I'm getting the gist of it.

As for my other question, about how we could tell if a forecaster was over- or under-confident in his predictions, we could also construct a fake weatherman A', by taking an S-shaped function of A's forecasts, to push those forecasts either closer to 0%-100%, or closer to 50%, and then see if A' does better than the original A.

Arin: that was what I was wondering about in my 9.03 am comment, which rpl says is called a Brier Score.

I didn't realise there would be several quite different answers to my question.

I have to surrender to "Brier Score"; I apologize for my ignorance.

The 3-component decomposition mentioned in the Wikipedia article is particularly instructive (the Brier score can be decomposed into "Uncertainty" - the unconditioned entropy of the event, "Reliability" - how well the forecast probabilities match the true probabilities, and "Resolution" - how much the forecast reduces the entropy of the event.)

Suppose you had two weathermen #1 and #2. Weatherman #1 gets it right 95% of the time and weatherman #2 gets it right 90% of the time.

In the Sahara, suppose a non-weatherman has an 80% chance of guessing the weather just from living there. The relative value of weatherman #1 to weatherman #2 is:

(95% - 80%)/(90% - 80%) = 1.5

On the Florida coast, suppose a non-weatherman has only a 50% chance of guessing the weather from living there. The relative value of weatherman #1 to weatherman #2 shrinks to:

(95% - 50%)/(90% - 50%) = 1.13

And of course there are other measures of value. The average deviation of a weatherman's predictions tells us nothing about the volatility of those predictions. Which kind of weatherman would be more valuable?

1. One that on average gets things pretty close, but swings wildly from overestimating the chances of rain to underestimating them
2. One that on average overestimates the chances of rain, but also consistently overestimates the chances of rain

"Having a weatherman in the Sahara forecast no rain does not tell you much about his/her value."

Well, not until the unusual day when it does rain. Then you learn everything about the forecaster's value (sort of -- real weather forecasting has a dimension of time phasing as well). Metrics based on Bayesian methods have the same problem.

"The Brier score will tell you how accurate a weatherman is in relation to what the actual weather is, but will tell you nothing about how much more accurate a weatherman is compared to say sticking your head out the window or venturing a guess based upon the time of year."

Typically, short-range weather forecasts are compared to the persistence forecast (i.e., that the weather won't change from what it is now). Long-range forecasts might use the climatological average. For other applications you usually have two forecasting methods that you want to compare to one another to see which is more skillful, so each serves, after a fashion, as a benchmark for the other.

Also note that if you interpret the ratio of the Bayesian-derived scores as an odds ratio, then you're implicitly assuming that the events being forecast are independent, which may or may not be a good assumption. (For short-range weather forecasts, it probably isn't.)

Hmmm. Maybe Kelly's Theorem works, as well. :)

Give each forecaster an initial bankroll of \$1, and let her make an even bet of (2*P - 1)*B of rain or no rain for each day, where B is their current bankroll and P is their probability of rain or no rain, whichever they predict to be more likely. (They are not betting against each other, but against the Banker in the Sky.) Who ends up with the bigger bankroll?

The correct but timid prognosticator who is always right about which option is more likely, but underestimates, so that she predicts a 60% chance of rain or no rain, will always bet 20% of her bankroll and will always win, ending up with a bankroll of \$1.2^365. The overconfident prognosticator who always predicts a 100% chance of rain or no rain will always bet 100% of her bankroll and will, unless perfect, go bust.

----

Here is another betting scheme, which is kinder to the all or nothing predictor. Let the two predictors bet against each other, giving odds such that their subjective expectation for each bet is \$1. See who wins overall.

Suppose that the timid predictor above (T) predicts a 60% chance of rain, while the confident predictor (C) predicts rain. Then T bets \$1 that it will **not** rain while C bets \$2.50 that it will rain. C will win the \$1. OTOH, when their predictions differ, T will bet \$1 and C will bet \$1⅔. T will win the \$1⅔.

Under this scheme C will win more money if she is correct more than 62.5% of the time, even though, from a Bayes/Kelly standpoint, she is the worse predictor because sometimes she makes a wrong prediction with 100% certainty. ;)

The best forecasters would be the best gambler, so gauge the forecasters by their expected return if they bet their total worth every day.

Example:
Suppose a the forecaster gives a 70% chance of rain, and has a net worth of \$1.00. They would bet \$1.00 on "rain" at 7 to 3, and would bet \$1.00 (on margin) for "no rain" at 3 to 7. If it rained, they'd end up with a net position of \$2.33 (7/3 dollars) and if it shined, they'd end up with \$0.43 (3/7 dollars).

Weather in San Francisco is rainy about 20% of the time. A forecaster who always gives a 20% chance of rain will have an expected return of 0%. A forecaster who gives a 1% chance of rain will be "right" 80% of the time but will have a rate of return of about -95%. A forecaster who changed odds every day could have a very positive expected return.

The logarithmic rate of return of these bets is an average of logits (http://en.wikipedia.org/wiki/Logit). An average of logits almost certainly has some information theoretic interpretation, but I'd need to sit down with pen and paper to work out exactly what.

I'm siding with it being a higher-dimensional problem: you need to specify a loss or utility for a prediction of p with actual outcomes 0 or 1.

In general, if you don't want to specify a loss function you could use the one implied by the likelihood, or you could use the log predicted probability, which has some theoretically pretty properties (it's related to entropy). In the case of independent binary outcomes these are the same.

The log probability is the only proper scoring rule that is a function only of p, but "proper scoring rule" is less desirable than the name makes it sound. A proper scoring rule is one where its always to your advantage to quote your actual predicted probability of an event. It's pretty clear that in weather forecasting this isn't true. Nate Silver writes about weather forecasting in his new book and points out that the US national weather service quotes their actual probabilities (which are pretty well calibrated), but the Weather Channel quotes higher probabilities of rain at the low end because it is actually worse to be wrong by predicting no rain than to be wrong by predicting rain.

The Brier score is another useful default, but like all defaults it sweeps under the carpet the question of what scoring rule you actually want.

Nick, your questions seems to assume that weather is a pre-determined closed system which is subject to Laplacean determinism and Laplacean predicitve mastery.

If it is no such a system, then running history over in repeated 'trials' would produced different results each time from the same set of 'given' initial conditions knowable to the two weather forecasters.

Re-run history and *different* forecasters can come out the 'better' forecaster.

Are you familiar with the concept of sensitivity of initial conditions or are you familiar with nonlinear dynamic systems -- or systems involving fluid dynamics with turbulence which we mathematically intractable.

RPL,

"Give each forecaster an initial bankroll of \$1, and let her make an even bet of (2*P - 1)*B of rain or no rain for each day, where B is their current bankroll and P is their probability of rain or no rain, whichever they predict to be more likely. Who ends up with the bigger bankroll?"

That doesn't sound quite right. The Kelly bet is:

p = probability of rain based on gamblers intuition
b = payoff on wager (for instance 4:1, this represents the "house" odds)
f = fraction of bankroll to bet
r = result of wager (1 = win, 0 = loss) - Obviously, this is also the weather results (1 = rain, 0 = no rain)
w = winnings from wagering

You are assuming that b is always equal to 1, that the "house odds" for rain is always 50%. If that is the case, then you get the simplified form:
f = 2p - 1
w = 1 + f * ( rb - 1 )

f = ( p + p/b - 1/b )
w = 1 + f * ( rb - 1 ) = 1 + ( p + p/b - 1/b ) * ( rb - 1 )

The total payout over a timeframe T would be:
Product [ 1 + ( p + p/b - 1/b ) * ( rb - 1 ) ] as t goes from 1 to T - Note that p, b, and r are all functions of t

Compare it to this:
a = b / (1 + b) : This converts the house odds b (for instance 4:1 odds of rain) into a percentage a (for instance 80% chance of rain)

"House" deviation (H)= Sum of [ | a - r | / T ] as t goes from 1 to T
Gambler deviation (G) = Sum of [ | p - r | / T ] as t goes from 1 to T

Value (V) of the gambler would be:
V = H - G / H

It seems to me that the payout method for evaluating a weatherman / gambler is only as good as the accuracy of the house odds. If the actual weather significantly differs from the weather predicted by the house, then that would skew the results of the betting strategy. A few long odds successful big bets against the house could overwhelm a lot of short odds losing bets.

Instead I think it is better to look at a ratio of the gamblers deviations from actual weather in comparison with the house deviations.

Frank Restly: "You are assuming that b is always equal to 1, that the "house odds" for rain is always 50%."

I am postulating that. No learning from the wagering occurs. The Great Banker in the Sky does not care about winning or losing.

Frank Restly: "Instead I think it is better to look at a ratio of the gamblers deviations from actual weather in comparison with the house deviations."

I am not assessing the prognostication vs. actual weather, but one prognosticator vs. another. That makes a big difference. The comparison is indirect, by having both play against the house. The payoff for each is the product of their probabilities divided by (0.5)^365. In the ratio of their payoffs, the denominators drop out. This is equivalent to the Bayes comparison.

Note: You can permute the wagers. For instance, suppose that the prognosticator predicts rain with a probability of 80% on one day and no rain with a probability of 60% on another day, and it rains both days.

Order 1:
Day 1: Prediction of rain of 80%. Bet of \$0.60, which wins. New bankroll: \$1.60.
Day 2: Prediction of no rain of 60%. Bet of \$0.32, which loses. New bankroll: \$1.28.

Order 2:

Day 1: Prediction of no rain of 60%. Bet of \$0.20, which loses. New bankroll: \$0.80.
Day 2: Prediction of rain of 80%. Bet of \$0.48, which wins. Now bankroll: \$1.28.

All same same. :)

Robert Cooper: "Suppose a the forecaster gives a 70% chance of rain, and has a net worth of \$1.00. They would bet \$1.00 on "rain" at 7 to 3, and would bet \$1.00 (on margin) for "no rain" at 3 to 7. If it rained, they'd end up with a net position of \$2.33 (7/3 dollars) and if it shined, they'd end up with \$0.43 (3/7 dollars)."

Isn't that backwards? The first bet is \$1 vs. \$0.43 in favor of rain. The second bet is \$1 vs \$2.33 against rain. If it rains, the gambler wins the first bet but loses the second. Result: \$0.43. If it does not rain, the gambler loses the first bet but wins the second. Result: \$2.33. If the probability of the forecast is correct, the expectation of both bets together is \$1.

If I think that the chance of rain is 70%, how do I find someone else to give me odds of 7:3 on rain?

The problem is this isn't actually covered in any econometrics or probability classes. You have to go to a meteorology department to figure out how to do this without making stupid mistakes.
A good example of why good skill scores are needed is this model. This model was a real model back in the day an actually used by a real weather forecaster. It predicts on every given day there will not be a Tornado in a given town. The forecaster claimed that he will be 98% right, but what you care about is that day when there in fact is a tornado. He got no false positives, but way too many (100%) false negatives. His score was *biased.* This is very different from what econometricians call bias.

The Heidke skill score is a better predictor.

Here's a really simple example for yes-no answers:

a= Forecast=Yes, Reality=Yes.
b= Forecast=Yes, Reality=No.

c= Forecast=No, Reality=Yes
d=Forecast=No, Reality=Yes

We come up with the Heike Skill score by
We compare how good our model does compared to a coin flip. And how good a perfect model is compared to a coin flip.
Then to make the numbers nice we take the ratio of those two results:

HSS = (number correct - expected number correct with a coin flip)/(perfect model's number - number correct with a coin flip)

This simplifies to:

HSS = 2(ad - bc)/[(a+c)(c+d) + (a+d)(b+d)]
HSS of 1 means a perfect forecaster, a 0 means the forecaster has no skill, and a negative value says that flipping a coin is actually better than the forecaster.

There are many other types of Skill Scores. They differ based on how they treat rare events and non-events and systemic vs random errors. You can extend skill scores from a 2x2 table to a larger table for more complex forecasts. This won't do for probabilistic forecasts.

For probabilistic forecasts instead of weighing false positives vs false negatives you are weighing sharpness vs reliability.
Here are some skill scores for probabilistic forecasts.

The Ignorance Skill Score:
Let: f be the predicted probability of an event occurring lying on the open interval (0,1). (The ignorance skill score assumes that we are never 100% sure about anything.) Also the ignorance skill score has units of "bits." Yes, its the same thing as we talk about when we speak of "bits" in a computer. It traces its foundations to information theory.

And Let:
Ignorance_n(f_t)=-log_base_2(f_t) when the event happens at time period n, and
Ignorance_n(f_t)=-log_base_2(1 - f_t) when the event does not happen at time period n.
T=Number of time periods t.

The expected ignorance is computed the normal way:

Ignorance(f)=Sum_over_all_t( I_t(f_t)) / T

Standard errors for our estimate of ignorance are also computed the normal way

Back to your original question, we can then compare the ignorance of the two forecasters by seeing which one is more ignorant.

Explanation:
That was not intuitive, so next we will try to come up with an intuitive way to explain it.

Lets define a function that is "a measure of the information content associated with the outcome of a random variable."

Since its a measure of information, then it should have the following properties.

That has the following properties.
1) The self information of A_i depends only on the probability p_i of A happening.

2) Its a strictly decreasing function of p_i. This is so the higher the probability, the less useful our prediction of event A_i.

3) Its a continuous function of p_i. We don't want finite changes in information when we have infinitesimal changes in probability.

4) If an event A is the intersection of two independent events B and C then the amount of information we gain when we find out C has happened is should be a function of the intersection of A and B. Also, It should be equal to the information we gain when we find out A and B have happened.
Said another way: if P_1=P_2*P_3 then I(P_1)=I(P_2)+I(P_3).

Luckily only one class of functions that fulfill these criteria!
I(event x)=k*log(p(x)). Now k can be any negative number so we pick k to give us units of bits.
I(event x) =(1/ln(2)) * ln(1/p(x)) = - log_base_2(p(x)) where p(x) is the probability of that event happening.

Now lets define a sort of measure of our surprise. This is the information we gain from seing the results of our predictions. If the event happened, the knowledge we gained was from our probability forecast was -log_base_2(f_t). However, if we picked incorrectly, we gained evidence for the prediction of the alternate event. So if we believed incorrectly we gain -log_base_2(1-f_t) knowledge.

Lets work this out for an event.
We think there's 10% chance of James winning an election in 2010. James loses so we gain -log_b2(0.9) bits of info. We gained very little information, because 10% chance that James looses is is cose certainty that he loses.

We think there's 90% of Bill winning his election. Bill wins, so we gain -log_b2(0.9). Again we gain very little information, because again 90% is close to certainty.

Bill gets caught cheating on his wife with a goat. We think there's a 1% chance of Bill winning his next election. He manages to win. We are very surprised! We gain a lot of information this time. We gain -log_b2(0.01).

James turns out to have done a great job in office. We think there is a 90% chance that he gets reelected. But we are surprised; he loses. We gain log_b2(0.1) bits of information.

Then our total information gained is:
10.27 bits of information.

Our expected ignorance as a forecaster is about:
10.27/4 =2.57 bits per forecast.

I wrote this way early in the morning, before going to bed, so it may have typos.

Another way to think of Ignorance skill score is an estimate of the difference in surprise as measured in binary bits, between you and an omniscient forecaster.

Ok, a grammar, typo, and example corrected version of my explanation is here:

http://entmod.blogspot.com/2012/11/skill-scores-re-nick-rowe.html

Min,

"Note: You can permute the wagers. For instance, suppose that the prognosticator predicts rain with a probability of 80% on one day and no rain with a probability of 60% on another day, and it rains both days

Day 1: Prediction of rain of 80%. Bet of \$0.60, which wins. New bankroll: \$1.60.
Day 2: Prediction of no rain of 60%. Bet of \$0.32, which loses. New bankroll: \$1.28."

What I was referring to is the effect that one long odds bet has on the bankroll. Suppose the house odds for rain on a particular day are a 1 million : 1 against (0.0001%. chance of rain) but it happens to rain on that day.

Player: Prediction of rain 100%. Bet of (1 + 1/1000000 - 1/1000000)=\$1.00 which wins. New bankroll = \$1,000,000

What effect does one long odds bet have on the bankroll in a finite number of gambles? For instance suppose that two gamblers are given 1000 guesses, but on one of those days the odds of rain are significantly higher or lower than the total number of guesses given (1 million to 1, or 1 to a billion). In a betting strategy, that would place a premium on guessing that day correctly.

"The Great Banker in the Sky does not care about winning or losing."
But the payout is a function of what the prevailing house odds are. The Great Banker in the Sky may not care about winning or losing, but the payout is determined by the house odds set by that Banker.

"This is equivalent to the Bayes comparison."
I don't think so. It has to do with the effect that one "lucky guess" can have on the net result. In the Bayes calculation, if the actual weather deviates from the house odds by 99.9999% on a single day, then the net effect of one long shot bet for that day on the results of 1000 bets is 99.9999% / 1000 bets = 0.099999%. Meaning it will shift the result of the Bayes calculation about 0.1%. However, the effect of the long shot bet for that day on the gamblers 1000 bets can be much more significant.
l

The gambler idea is a bad skill score, because it makes later winnings dependent on earlier winnings. This weighs earlier predictions higher than later ones.

DocMerlin,

"The gambler idea is a bad skill score, because it makes later winnings dependent on earlier winnings. This weighs earlier predictions higher than later ones."

The betting strategy being discussed sets a bet amount as a percentage of total holdings, not a fixed amount. And it really doesn't matter when a win occurs because the percentage gain on holdings will carry through.

For instance, three bet results

Bet one: Win 15% of holdings
Bet two: Lose 5% of holdings
Bet three: Lose 5% of holdings

It really doesn't matter what order these bets will occur. The net result is the same (1 + .15)(1 - 0.5)(1 - .05) = 1.0379

My issue had more to do with when the odds of winning on a particular day are significantly greater than or less than the total number of gamble chances that are given. That places a premium on winning those days, since the payout on those days can be significantly higher than the rest.

Frank Restly: You convinced me that expected ignorance is the way to go. The perplexity of the forecaster's ignorance might, which you can interpret as a gamble, might be more intuitive for people who don't like to think in bits.

Suppose our forecaster must, by law, give a probability for weather tomorrow and accept bets with payouts 1 / probability he assigns to the event. For example, if he gives a 75% forecast of rain, he must offer anyone bids that pay 4/3s if it rains and pays 4/1 if it doesn't rain.

If you could always perfectly predict the future and always bet correctly the against the forecaster, how much money could you make? How much can you expect to you multiply your net worth by per forecast, on average?

Gambler's interpretation: Multiply the forecaster's payouts for correct bids for the year, and take the 365th root (geometric average).
Information theoretic interpretation: We expect b bits of ignorance per forecast, and so you can expect multiply your net worth by 2^b bits per forecast.

Frank Restly: "What I was referring to is the effect that one long odds bet has on the bankroll. Suppose the house odds for rain on a particular day are a 1 million : 1 against (0.0001%. chance of rain) but it happens to rain on that day."

First, the house odds do not have to exist. (I think that they do, but that's another question.) As I said, that has to do with how well a forecaster does with regard to a probabilistic reality. What I was doing with the Kelly scheme, in a slightly roundabout way, was comparing two forecasters against each other, in a way that does not require probabilistic reality. Reality can be deterministic, or non-deterministic without probabilities, or non-deterministic with non-numeric probabilities, and there are surely other possibilities for reality! :)

Second, the Kelly comparison does not take into account the variability of results. That is indeed a concern. :)

Noam Chomsky on calculating statistical probabilities between coorelated weather events vs understanding complex weather systems:

http://tillerstillers.blogspot.com/2012/11/noam-chomsky-on-ai-bayesianism-and-big.html

Min,

I understand what you are saying, but without a probabilistic reality to compare the gamblers against, they could both be flat out guessers and one of them just happened to guess right more often.

That is why I think the Bayes calculation makes more sense, even when comparing two gamblers, the house odds are not thrown out.

Value of Gambler #1 = (House Deviation from Actual - Gambler #1 Deviation from Actual) / House Deviation from Actual
Value of Gambler #2 = (House Deviation from Actual - Gambler #2 Deviation from Actual) / House Deviation from Actual

The relative value of Gambler #1 to Gambler #2 is:
(House Deviation from Actual - Gambler #1 Deviation from Actual) / (House Deviation from Actual - Gambler #2 Deviation from Actual)

The relative value of one gambler to another becomes smaller as the house deviation increases. In a truly random event that is non-deterministic without probabilities, the relative value of one gambler to another is always 1 because the house deviation from actual will be infinite. Meaning in a random event, one gambler's guess is as good as another's. In a random event, even if one gambler gets more guesses right, the relative value of that gambler to another is still 1.

@ Frank Restly

The "house odds" do not appear in the Bayesian comparison. What you have is P(data | predictions of A)/P(data | predictions of B). That's it. You do not have P(data | predictions of the House).

I thought I'd point you to this common measure of accuracy and calibration. http://en.wikipedia.org/wiki/Brier_score

There seems to be a conceptual confusion in here between predictions for repeated events and predictions for single events.

A probabilistic prediction for a single event (like rain/no rain today) is worthless/meaningless. For a single event, you just want a yes/no answer or a number (maybe generated from a probabilistic model, but you need a definite answer). For sets of events, a probabilistic prediction could be useful.

The Brier score that some have mentioned is about how probabilistic predictions fare over a defined set of similar events.

Has anyone written a clear and comprehensive paper on the limits of applying the pure logic of probabilities & statistical math to non-linear dynamic phenomena with open-ending evolutionary unfolding and essential sensitivity to initial conditions?

For example, what what does statistical 'science' have to tell us about the unfolding of a Mandelbrot set? Anything?

http://en.wikipedia.org/wiki/Mandelbrot_set

Mark Stone wrote an good piece in the 1990s explaining how this sort of phenomena defeats 'prediction' as imagined by Laplacean determinism.

Is there a literature on this generatl topic?

Greg: Yes. Ergodic theory.

And I think you're guilty of misrepresenting Laplace's view. People tend to forget that he concluded:

"All these efforts in the search for truth tend to lead back continually to the vast intelligence which we have just mentioned, but from which it will always remain infinitely removed"

Oh, and optimal filtering is the practical machinery.

@Phil H
"The Brier score that some have mentioned is about how probabilistic predictions fare over a defined set of similar events."

The ignorance score is more sensitive for events with very high or very low probability than the Brier score is. Plus it has really nice units in terms of information theory.

Phil:

You wrote:
"A probabilistic prediction for a single event (like rain/no rain today) is worthless/meaningless"

Not at all. It depends on what you mean by "probability". In the frequentist interpretation, your statement would be correct, but that's just one of many interpretations. A Bayesian, for example, would be quite comfortable assigning a probability to a single event.

Now, weather forecasting is closer to the frequentist's use of probabilities as mentioned above.

Do Austrian economists buy life insurance? Keynes was president of a life insurance company, he knew risk, probability and uncertainty very well. Without aggregation and risk modelling, life insurance would not work. In fact it has worked for centuries.

@Determinant?
WTF is this coming from?
Of course Austrians believe in financial risk modeling, they just believe that it has no place in economic theory. They believe that sort of modeling isn't rich enough to handle the complexities of human behavior.

From the Theory of Signal Detection (SDT), there is a procedure for plotting a Receiver Operating Characteristic (ROC) curve for an observer based on the confidence of the decision/prediction, self-rated:

If you interpret forecast probability as a measure of confidence, the application is straightforward. SDT characterizes performance with separate measures of sensitivity and bias. Also, the sensitivity measures have true zero points at the chance level of performance.

Of course, this is not an 'econometric' methodology; rather, it has been applied in fields from radio communications to human sensory psychology to medical diagnosis - any case of decision-making under uncertainty.

Good luck.

Nick:

Galbraith, John W. and Simon van Norden (2012) "Assessing gross domestic product and inflation
probability forecasts derived from Bank of England fan charts" J. R. Statist. Soc. A (2012)
175, Part 3, pp. 713–727

Think in terms of the regression of R(t) on a constant and P(t). The forecast will be unbiased if the constant is zero and the estimated coefficient on P(t) is 1. Statisticians say such a forecast is "well-calibrated". More precisely, that means E(R|P) = P. However, not all such forecasts are created equal. We'd also want to consider the R^2 from our regression. A higher R^2 implies that our P(t) has more explanatory power for R(t). Statisticians say that such a forecast has higher "resolution".

For forecast comparisons, statisticians typically like to ensure that both forecasts are well-calibrated and then compare their resolution. However, you might instead want to just compare MSFE (or, if you have a different loss function in mind, just compare expected loss.) You could also do tests in the spirit of forecast encompassing. For example, supppose you have P1 and P2 and you'd like to compare them. Well, just regress R on a constant, P1 and P2. Do both have significant coefficients? If not, then you can say that the one with the insignificant coefficient adds nothing significant to the other forecast.

Ken Schulz: Actually, Oscar Jorda (among others) has been using ROC analysis in economic applications. His recent paper in American Economic Review is a good example.

Simon van Norden,
Thanks for the pointer. I should have said ROC analysis is not specific to any one discipline. I'm glad to see it's finding application in economics.
A little while back I was reading a discussion of entrepreneurship in the US - apparently new-business starts are down, but the success rate is up. I thought, well, yes, if entrepreneurs and investors have any ability to discriminate good from poor opportunities, that's exactly what should be expected; false alarms should drop off faster than hits as the criteria tighten. I didn't find anything relevant in a quick Google Scholar search; jumped too quickly to the conclusion that ROC wasn't being used much in Econ.
I am an engineering psychologist, this is purely an avocational interest.

The comments to this entry are closed.

• WWW