Friday, June 27, 2008

Bayesian Thinking

Since this winter, I haven't had a lot of time to work on ChessVortex. But today I've revisited, for the umpteen-thousandnth time, the wikipedia page for Bayesian Inference. Since I will probably use Bayesian Inference for some ChessVortex studies I may contrive (and since it is a cornerstone of the Glicko rating system), I decided it might make a good topic for a blog entry.

Now, every time I work with Bayes's theorem, I have to re-read the entire introduction of the wikipedia article to help me remember exactly what Bayes's theorem means. Today, I invent a mnemonic that I hope will stick. First, the theorem:

This, of course, is the mathematical formula. Let's forget the formula for now and instead think about what it describes in terms of natural language, avoiding numbers, math, and equations at all costs (for a while, at least). For this exercise, I will use an example.

First, let's say we decided to take a hike through the desert. We know water is scarce (our source for Hydration out in the desert). We'll give a name to this knowledge: "prior knowledge". In other words, our prior knowledge is that, at any arbitrary time while walking through the desert, we know that we have a low chance (or Probability) to observe a source of Hydration. To put it yet another way, we can say that our Probability to observe Hydration in the desert is low.

Second, we know that Elephants, like sources of Hydration, can also be found in the desert, but we don't expect to see them very often. That is, the Probability to observe Elephants in the desert is low. This is also prior knowledge.

Third, lets imagine some other prior knowledge we have, namely that, if we observe a source of Hydration in the desert, we know that an Elephant must not be far away. We can re-phrase this by saying that the Probability to observe a source of Hydration increases given the observation of an Elephant.

Now, we can use all of this prior knowledge while in the desert and on the quest for some water. Namely, Bayes's Theorem says that we know to look hard for a source of Hydration when we see an Elephant. Why? Because (1) we know Hydration is scarce, (2) we know Elephants are even more scarce, but (3) we know that Hydration means Elephants. Since Hydration means Elephants, then conversely we can conclude that Elephants mean Hydration. That's Bayes theorem.

For completeness, let's turn it into the formula above:
  • Probability of Hydration: P(H)
  • Probability of Elephants: P(E)
  • Probability of Elephants given Hydration: P(E|H)
  • Probability of Hydration given Elephants: P(H|E)
Here is Bayes's theorem again:

One way to think about equations is to examine what will happen if you change values of the terms in them. For example, if the number of Elephants in the desert decreases, the the probability of seeing Elephants, P(E), decreases. If P(E) decreases, P(H|E) must increase if all other terms stay the same. In other words, as the probability of seeing Elephants decreases, the probability of finding Hydration given the observation of an Elephant, P(H|E), must increase. Such analysis can give equations like Bayes's Theorem an intuitive feeling.

Now, if we know the numbers for P(H), P(E), and P(E|H), then we can plug those in and get P(H|E). For example, let's say we expect only a 1% (P(H)=0.01) chance of seeing Hydration and a 0.5% (P(E)=0.005) chance of seeing Elephants at any given time in the desert. Let's also assume that, 40% (P(E|H)=0.40) of the sources of Hydration in a desert also have an Elephant near, then the probability that a source of Hydration is near when we observe an Elephant is

P(H|E) = P(E|H)*P(H)/P(E) = 0.40*0.01/0.005 = 0.8 = 80%

So, when we see an Elephant out in the desert, our expectation that Hydration is nearby jumps to a whopping 80%. This makes sense, doesn't it? Its like saying that we expect to see a restaurant near a hotel or fire in proximity to smoke. Bayes theorem puts numbers to all of these types of intuitive guesses we make every day.

Advanced Bayes: The Negative Test

Now we are going to rely a tiny bit more on math, so this is the Advanced section.

To make Bayes's theorem more useful, we need to know how to use it when we get a negative test. First, let's define our Elephant test as "looking around for an Elephant", which we do every time we are thirsty. Let's imagine that we get thirsty and then look around for an Elephant but we don't see one, resulting in a negative test. How can we combine the results of this test with our prior knowledge to estimate our chances for finding Hydration nearby?

Before we do this, we need to be able to define probabilities in terms of their complements. For example, the complement to one's seeing a source of Hydration is one's not seeing a source of Hydration. In terms of probabilities, the complement probability is one minus the probability. So, the Probability to not observe Hydration in the desert (denoted P(H')) is one minus the probability to observe Hydration in the desert:

P(H') = 1 - P(H) = 1 - 0.01 = 0.99 = 99%

If we see sources of Hydration in 1% of the places we stand in the desert, then we expect to not see sources of Hydration in the other 99% places we stand. The complement probability is just that simple.

But what is the complement of a conditional probability, like P(E|H)? Remember that P(E|H) is the Probability of seeing an Elephant given that we have observed a source of Hydration. The complement would be the Probability of not seeing an Elephant given that we have observed a source of Hydration. This probability is simply:

P(E'|H) = 1 - P(E|H) = 1 - 0.4 = 0.6 = 60%

This says that since 40% of the sources of Hydration have Elephants around them, 60% won't--it's pretty straightforward.

What about the probability of being near a source of Hydration given that we don't observe an Elephant (P(H|E'))? This is a bit more complicated, but mathematically we get:

P(H|E') = {P(H) - P(H)*P(E|H)}/{1 - P(E)}

To help understand why this equation takes the form that it does, notice that the left side, P(H|E'), is the value we want to calculate and the right side is put into terms we already know: P(H), P(E|H), and P(E). To put it succinctly:

This equation is the simplest equation that puts the value we want to calculate into terms we already know.

This idea explains almost everything about the choices we make for the terms we put equations into.

That equation is not terribly difficult to derive, but I'll press on to show how to use it and give a quick derivation at the end just for the curious (and for my own future reference).

Now, lets pretend that we are in the desert and get thirsty, look around for an Elephant, and don't see one. What is our new Probability for observing Hydration nearby?

P(H|E') = {0.01 - 0.01*0.4}/{1 - 0.005} = 0.006 = 0.6%

So the our observation that we can't spot an Elephant drops our expectation of seeing a source of Hydration from a 1% prior probability to an 0.6% posterior probability. In other words, the effect on our estimation of the probability for the presence of Hydration is fairly minimal, mostly because we didn't really expect to see an Elephant in the desert anyway given how rare they are in general (more rare than even water).

In my next posting, I'll go one step further and show how to use Bayes's theorem without our having any direct knowledge of how Probable Elephants are found in the desert (no direct knowledge of P(E)). This turns out to be a little more useful in the real world, as I hope to demonstrate soon.

Here's that derivation I was talking about for the curious (notice I begin with a statement of Bayes's theorem with E' substituted for E):

P(H|E') = P(E'|H)*P(H) / P(E')
P(H|E') = {1 - P(E|H)} * P(H) / P(E')
P(H|E') = {P(H) - P(H)*P(E|H)} / {1 - P(E)}

Though not used in the above derivation, of special note are the following identities which allows for an alternative derivation:

P(x ∩ y) = P(x|y)*P(y) = P(y|x)*P(x)
x ∩ y') = P(x) - P(x ∩ y)

Here's the alternative derivation (beginning with the first identity above):

P(H|E') = P(H
∩ E') / P(E')
P(H|E') = {P(H) - P(H
∩ E)} / P(E')
P(H|E') = {P(H) - P(H)*P(E|H)} / P(E')
P(H|E') = {P(H) - P(H)*P(E|H)} / {1 - P(E)}

Friday, November 16, 2007

The Sweet Spot

I need to reevaluate my sanity for starting a blog entry at 3:00 am. But I have been promising a blog on my data mining of CTS tactician stats. I've done a lot of calculations, but tonight I'll focus on my most provocative findings. In the following graph, rating is plotted against accuracy for two groups of tacticians: those who have done between 1000 and 10,000 problems (blue circles) and the those who have done over 10,000 problems (red circles).

The idea behind having two groups differing in number of problems attempted is that the red group is representative of tacticians who have done a lot of problems and the blue group is representative of tacticians who have done only a few. So, the way I interpret this plot is that solving at just above 80% accuracy gives one the greatest score improvement over time. Interestingly, this interpretation is consistent with observations made by wormwood on the Message Board several months ago. His observations led me to experiment with accuracy, though I have taken it somewhat to an extreme--working for 98% lately, but achieving just over 96% in actuality.

I should note a couple of caveats about this graph. First, the two red points on the extremes (97.5% bin and 52.5% bin) should be more-or-less ignored because too few tacticians fall into these bins to give reliable statistics. Second, the error bars do not necessarily represent uncertainty about the actual rating of the group. These "errors" arise from the natural score distribution of the groups and their magnitudes are related to the robustness of the statistics and not related to their uncertainty. I have yet to perform proper error analysis on these numbers, but my feeling is that the numbers are robust.

One conclusion from the above graph may be that solving problems at greater than 90% accuracy results in little progress, given the information in the 92.5% bin. Its difficult at this point for me to argue with that conclusion. And anyone who has taken a gander at the message board lately will know I am eating my words right now.

But, despite the obvious conclusions one might make from the above graph, my belief is still quite the contrary. I believe that tacticians solving at greater than 90% accuracy are getting much more out of their training than tacticians solving at less than 90% accuracy. Difficult (and perhaps embarrassing) for me, however, is that this benefit is not revealed by analysis of rating alone as I have done thus far. So I am still short of proof on my theory.

Sunday, November 11, 2007

The Glicko System

By now, it would probably be of little surprise to anyone who has been keeping up with this blog that a key point of discussion has been the interplay of accuracy and Glicko rating on the CTS. I introduced this topic a few postings ago. Since then, I have spent some time carefully studying Professor Glickman's original publication on his rating system, and, though I have yet to complete my analysis of his paper, I am coming to the conclusion that the Glicko system is ill-suited to rate tacticians by the time they spend to solve problems. Several reasons exist, but the most fundamental is that the Glicko system rates players based on three possible outcomes: win, loss, and draw. The CTS attempts to approximate these outcomes by using a time component to create fractional wins and losses, just as the Glicko system estimates a draw as half a win and half a loss. At first glance, this time-based approximation seems reasonable. However, at times > 30 sec, the CTS method does not distinguish between correctly solving a problem and failing it outright. But, as anyone who has attempted greater than 70% accuracy on CTS can attest, correctly solving a problem after 30 seconds and failing it outright are two fundamentally different phenomena. To accurately rate players solving problems, a rating system should therefore treat these two phenomena differently.

Bear in mind, however, that the Glicko system is suitable for what it was designed, namely to rate players who compete in head-to-head competitions, like chess tournaments.

Strangely enough, though, the Glicko system seems also to satisfactorily rate the problems. (I will discuss why in the near future.) A cursory inspection of the rating distributions of the problems and the tacticians reveals that the problems have a nearly Gaussian (normal) distribution across all rating categories while the rating distribution of the tacticians shows significant skew towards the lower ratings. Thus, the shape of the rating distribution peaks suggests that the CTS method rates problems accurately but not tacticians. While I have not confirmed the source of this skew yet, my theory right now is that the excess of tacticians below 1500 arises from a natural tendency in tacticians to work with accuracy in mind, lowering their ratings in general and skewing the distribution.

Saturday, November 10, 2007

On Perfection

Prelude: Focus on the Tactics

I blogged Chess Vortex for over a week pretty intensely and completely ran myself down, getting far too little sleep for my own good. I did learn a lot from the experience and gleaned excellent comments from fellow tacticians. However, I spent about five days traveling and experienced a mental crash of sorts, manifesting itself most noticeably on 11/6/07 as a 42p/5f day (89.36%). It has taken me a few days to get back on track. I have focused on sleep primarily and on re-training my thought processes for careful problem solving. Tonight I reaped the rewards--after over 100 sessions training for accuracy and almost 15,000 problems, I finally had a perfect 100 problem session:

100% @ 1394 ± 104 ; 1371

Of course I hemorrhaged a few points at the end of the session, but the last 50 problems of any session are always more difficult than the first 50, mostly because of fatigue. Moreover, tonight I was dropping points at the expense of perfection because I felt like I could get that 100p/0f.

So what's the moral of this story? Focus on solving chess problems first! Killing one's self blogging does not get chess problems solved correctly, so my new resolution is to blog only when it will not tire me for chess problem solving.

More Fun with Binomials

Because of my perfect session, I'm going to have a little fun using VassarStats to get a rough idea of what tonight's performance means in terms of my accuracy as a 1394 ± 104 Tactician:

If I am a...Probability of 100 Correct
99.5% Tactician0.606
99% Tactician0.366
98% Tactician0.133
96% Tactician0.017

So this table means that there is over a 98% chance that I am at least a 96% Tactician for problems rated 1394 ± 104. I guess I can live with that.


Given that my focus on sleep has helped my tactical vision, it should come as no surprise that I saw a lot deeper into my problems today than I usually do. This was an intensely rewarding experience and, in my opinion, is reason enough to focus exclusively on accuracy. A lot of my point loss tonight actually came from my looking at alternative lines. Mostly, I saw the main line immediately, which is not entirely remarkable given that my problem set tonight was rated less than 1400 on average. However, most of the pleasure was in the variations--and tonight the problem of the day has its beauty buried in the notes:

Chess Tactics Server Problem of the Day

Black to Move

Here's the solution and why I like it (start selecting text following the colon): 2...Bxg2. Now 3.Kxg2? leads to a problem-like mating net that can be difficult to see: 3...Qh2+ 4. Kf3 Qh3++.

Wednesday, October 31, 2007

Tactics Training as Probabilities

Tonight, I want to discuss a probabilistic perspective on the problem solving session. For this analysis, I'm going to model a tactical session as a binomial sampling phenomenon.

First, what is a binomial sampling? This question is best answered using the real-world example of a series of coin tosses. Were we to toss a coin several times and record the result at each toss, we would have a binomial sampling of heads and tails--its that simple. Tactical sessions follow the same model because each problem attempted by the tactician has one of two outcomes: pass or fail. Fortunately, the mathematics behind the probabilities of a binomial distribution are well established, so I won't review them here. But I will calculate probabilities using the binomial probabilities calculator available at VassarStats.

The following table shows the Probability that a Tactician with given a accuracy will solve at least a given number of problems Correct in a given number of Tries.
Tactician With
98%10098 (98%)0.67669
98%10099 (99%)0.40327
99%10098 (98%)0.92063
99%10099 (99%)0.73576
98%200196 (98%)0.62884
98%200198 (99%)0.23515
99%200196 (98%)0.94825
99%200198 (99%)0.67668
98%1000990 (99%)0.01023
99%1000990 (99%)0.58304

Lets first consider the likelihood for a 98% tactician to "accidentally" get 99 problems out of 100 on some arbitrary set of 100 problems he might encounter in a normal training session: 40% (0.40327). Now, consider the likelihood that this 98% tactician will get 198 right out of 200: 23.5%. So the 98% tactician will look like a 99% tactician on over 40% of his stretches of 100 problems and more than 23% of his stretches of 200 problems!

Lets compare a 198/200 performance to our expectations for a true 99% tactician: 67.7%. So a true 99% tactician will look like a 99% tactician on more than 2/3 of his stretches of 200 problems.

So what do all of these numbers mean? In short, these probabilities say that 200 problems are insufficient to confidently distinguish a 99% tactician from a 98% tactician. Only at very high numbers of problems can these two tacticians be distinguished with good certainty. In the example above, a 98% tactician has only about a 1% chance to get at least 990 of 1000 problems correct while a 99% tactician has significantly better than an even chance (58.3%).

So, what is the bottom line of this analysis when thinking of our own accuracy statistics? Well, one must do a lot of problems of at least a given accuracy before he can be certain that he is actually a tactician of that accuracy!

With this in mind, I present my performance today at the CTS. I wish I would have warmed up--three misses before number 30.

97% @ 1395 ± 90 ; 1383 final

Tuesday, October 30, 2007

The CTS Training Parameters

Yesterday, I finished discussing the topic of Accuracy Versus Strength by alluding to training parameters. Today I will begin to identify these parameters and hopefully unambiguously define them. To introduce the importance of this topic, I borrow from waaek's comment to yesterday's post:
...people use CTS in different ways and for their own purposes. I personally use CTS as [a] source of tactical training puzzles to train my analysis ability. My goal is to improve the accuracy, and over time, speed, of my analysis. Accuracy at this point is by far the #1 concern for me.
The notion of "using" CTS covers not only one's goals but also one's approach to training. But how are we to know the efficacy of our training methods? I propose that the first step is to identify quantitatively the components of those training methods and to give them concrete definitions.

A tactician's rating, or more accurately his Glicko rating, is a bottom-line measurement of his strength as a tactician. Several assumptions are used by the Glicko rating system but the most critical (and perhaps most fallacious) assumption is that a tactician will use rating and rating alone as his sole indicator of progress. As discussed yesterday and as evidenced by comments like waaek's, this assumption is categorically wrong most of the time. The truth of the matter is that different tacticians are comfortable with different training methods and these methods directly impact their rating.

Accuracy Rate
Most CTS tacticians are familiar with their accuracy as an aggregate statistic that summarizes their total passes and total fails for all problems solved at CTS. However, because training habits change over time, using the aggregate accuracy of a tactician to measure his tactical strength can be misleading. Instead, I propose the concept of accuracy rate, which measures a tactician's accuracy over a certain number of problems. For example, during my last 500 problems I have missed ("failed") nine problems and solved correctly ("passed") 491, giving me an accuracy 98.2% for these 500 problems. So an accuracy rate must combine both the accuracy and the number of problems solved at that accuracy. Here is the formula I propose to calculate accuracy rate (see below for a mathematical explanation):
-N * log(1 - A)
Here, A is the accuracy of a tactician over number of problems N.

Let's see how this definition behaves when comparing some hypothetical problem runs:

Number SolvedAccuracyAccuracy Rate

So, were one to interpret accuracy rate (rather loosely) as solving power, the accuracy rate suggests that solving 200 problems at 99% accuracy takes much more solving power than solving 100 problems of equivalent difficulty at 98% accuracy.

I am open to alternative proposals for a metric to quantify accuracy rate. Please submit comments if you have any suggestions.

I have several more training parameters to define, but its getting late, so I'll finish with my performance today followed by a mathematical explanation of accuracy rate for the curious.

99% @ 1410 ± 89 ; 1383 final

An Explanation of Accuracy Rate
To combine accuracy and problems solved to create an accuracy rate, I propose borrowing from a common technique used to multiply probabilities, which makes use of this property of logarithms:
log(p1 * p2) = log(p1) + log(p2)
The idea is that, the higher a tacticians accuracy, the less likely that his answers arise from chance. So the combination I propose is
-N * log(1-A)
where A is the accuracy of a tactician over number of problems N. Subtracting A from 1 comes from the fact that a higher accuracy means a lower likelihood of guessing. Taking the negative of the product is a simple way of making the score a positive number because the log of a fraction is negative. Multiplying N and A comes from the following property of logarithms:
log(XY) = Y * log(X)

Monday, October 29, 2007

Introduction to Chess Vortex

In the next few days, I'm going to back up and discuss some of the issues that led me to create the Chess Vortex Blog and corresponding Chess Vortex Project. I will then discuss in general terms how the Chess Vortex Project might approach addressing these issues in a scientific way. And finally, I will discuss the motivation behind making Chess Vortex a community project, what it means for Chess Vortex to be a community project, and how I will attempt to act as an agent of the community to create the "products" unique to Chess Vortex.

My hope is that these Chess Vortex "products", as I am calling them, will be three-fold in nature. First, I imagine a new set of representational tools to help evaluate graphically, or via other media, human interaction with tactical problems. Second, I imagine a set of mathematical tools to analyze behavior and the cognitive process of solving tactical problems. And third, I imagine that Chess Vortex will yield--as its most valuable product--new knowledge that leads to a deeper understanding of these cognitive processes.

So first lets begin with the fundamental issues. Today the issue will be...

Accuracy Versus Rating
Probably the most controversial topic on the CTS Message Board is the nature of tactical strength (aka problem-solving strength). This controversy arises from the observation that one's accuracy (number of problems correct per number attempted) can be sacrificed for Glicko rating and vice-versa. As a result, the Glicko rating assigned by the CTS has a duplicitous nature and, like all things duplicitous, can only be trusted if taken in proper context. The context for the Glicko rating, therefore, is not merely one's Rating Deviation (RD), which is explicitly included as a parameter in the system, but also one's accuracy. The inherent difficulty in assessing the accuracy, however, is that it is not formally part of the Glicko rating system and thus its weighting in one's rating is not easily determined. One goal of the Chess Vortex Project is to determine this weighting using robust mathematical analysis. The idea would be to create a scale that incorporates Glicko rating, RD, and accuracy to dependably compare the tactical strength of any two tacticians. Moreover, and decidedly more importantly, the hope is to be able to compare the improvement of tacticians who train at differing accuracy rates (time-averaged accuracy) to determine the optimal training parameters for developing tactical strength.

I am not finished with this issue by any stretch of the imagination, but I promised myself some sleep tonight (for a change), and so now I present my CTS performance today--somewhat for purposes of vanity, but mostly to show off the Chess Vortex community's first (albeit a work in progress) product--the Session Graph:

27p-1f-72p 99% @ 1405 ± 96 ; 1393 final

And now, before I forget, I happily present the

Chess Tactics Server Problem of the Day

Black to Move

Here's the solution and why I like it (start selecting text following the colon): 1...Nh3+. Now if 2.gxh3, then its mate in three: 2...Qh4+ 3.Ke2 Rg2+ 4.Rf2 Qxf2++.