Predictions with varying probabilities are a much more difficult problem. In fact, the procedure had to be developed specifically for this situation by Dr Jones and his brother, who is a PhD statistician in the medical field.
First the probability for each prediction has to be determined. Then the prediction has to be scored as a hit or a miss. Once this has been done each prediction is scored.
The score for a hit is -ln(p) and a miss gets a score of ln(q). The expected score is -p*ln(p) + q*ln(q) and the variance is p*q*(ln(p*q))^2. All these numbers are accumulated in running totals. For clarification, p is the probability of success, q is the probability of failure, ln is the natural log of the number and * indicates multiplication.
When all the predictions have been processed the standard deviation is the square root of the sum of the variances and the final score is the quantity (total score minus the total expected score) divided by the standard deviation.
The result is the number of standard deviations away from center for the whole group. This is compared to a table of normal values to get the significance level.
At this point things get arbitrary. The acceptable level of significance is a matter of opinion. In the social sciences 95% is considered good enough. For prediction evaluation I require 99%. Randi requires 99.9% but then he's risking $1,000,000.00 so he wants to be REALLY sure.
Roger