According to this tutorial, I implemented the algorithm of temporal difference learning on Connect Four, because the rules of Connect Four are easy to implement. But my neural net is not being trained properly. The original paper by Tesauro could be found here.
Here's my setup and the problem:
The input is a vector of length 43, i.e. status of the 6-times-7 board and the next player. The output layer has 3 neurons standing for the chance of "the first player will win", "this game will be a draw", and "the second player will win" respectively. After randomly initialize the weight, this net will play a game with itself, then learn from this game.
According to the original learning algorithm, we will just focus on decreasing the difference between two plys, and decrease the difference between the prediction and the true outcome of this game only in the end. The last step indicates the proper direction to train the weight. However, After about 100 games, I find out that the output values look like [0.99,0.05,0.99], i.e., in order to decrease the difference between two plys, the neural net chose to fix its output and ignore the last outcome. I thought this was caused by insufficient training, but the output will still fix to the [1,0,1] even after thousands of games.
I think there could be some reasons:
- I am doing it in a totally wrong way
- I chose a wrong game to play(connect four)
- I need to tune parameters
- The tutorial is misleading
I need to increase the penalty of the last step
Or , in the paper Why did TD-Gammon Work? by Jordan B. Pollack & Alan D. Blair:
It (Tesauro's TD-Gammon) has not led to similar impressive breakthroughs in temporal difference learning for other applications or even other games.
Does anybody has experiences on this topic? Thanks.
[link][comment]