CartPole is a simple problem in the classical control theory where you try to balance a pole by moving it horizontally. Reinforcement learning method like Deep Q Network introduced by DeepMind is a great way to solve task like this.
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every time step that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
The environment provides 4 observed states: horizontal location, horizontal speed, angle, and angular speed. The agent should takes in these 4 variables and output either left or right.
Deep Q Network
def q_network(input_state): h1 = tf.layers.dense(input_state, 10, tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.1)) h2 = tf.layers.dense(h1, 10, tf.nn.relu, kernel_initializer=tf.truncated_normal_initializer(stddev=0.1)) output = tf.layers.dense(h2, 2, kernel_initializer=tf.truncated_normal_initializer(stddev=0.1)) return output with tf.variable_scope('state_input'): state_in = tf.placeholder(tf.float32, [None, 4]) with tf.variable_scope('eval_weights'): q_value = q_network(state_in) with tf.variable_scope('state_next_input'): state_next_in = tf.placeholder(tf.float32, [None, 4]) with tf.variable_scope('target_weights'): q_target = q_network(state_next_in) weights = tf.get_collection(tf.GraphKeys.VARIABLES, 'eval_weights') target_weights = tf.get_collection(tf.GraphKeys.VARIABLES, 'target_weights') update_target_weight = [tf.assign(wt, w) for wt,w in zip(target_weights, weights)] with tf.variable_scope('q_target_input'): q_target_in = tf.placeholder(tf.float32, [None, 2]) with tf.variable_scope('loss'): loss = tf.nn.l2_loss(q_target_in - q_value) with tf.variable_scope('train'): train = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss) tf_steps = tf.placeholder(tf.int32) loss_smr = tf.summary.scalar('l2_loss', loss) step_smr = tf.summary.scalar('steps', tf_steps)
Here we have two copy of Q network with weights named eval_weights and target_weights. The eval network is used to calculate action for every observation. The target network is trained with gradient descent when we receive a new observation. The Q network is predicting the reward for each action for the current observation. Loss function is simply the square difference of observed reward and the predicted reward. Training eval network directly is very unstable, transferring weight from the target network to the eval network periodically solves the stability problem (see paper from deepmind for detail).
We see that the network reached the goal of 200 steps per episode pretty quickly.
CartPole on OpenAI: https://gym.openai.com/envs/CartPole-v0
Deep Q Network: https://deepmind.com/research/dqn/