DQN Agent¶

class dqn_agent.DqnAgent(state_space, action_space, gamma, lr, verbose, checkpoint_location, model_location, persist_progress_option, mode, epsilon)¶

DQN agent with production policy and benchmark

collect_policy(state)¶

The policy for collecting data points which can contain some randomness to encourage exploration.

Returns: action

load_checkpoint()¶

Loads training checkpoint into the underlying model

Returns: None

load_model()¶: Loads previously saved model :return: None

policy(state)¶

Outputs a action based on model

Parameters: state – current state
Returns: action

random_policy(state)¶

Outputs a random action

Parameters: state – current state
Returns: action

save_checkpoint()¶

Saves training checkpoint

Returns: None

save_model()¶

Saves model to file system

Returns: None

train(state_batch, next_state_batch, action_batch, reward_batch, done_batch, batch_size)¶

Train the model on a batch

Parameters

state_batch – batch of states
next_state_batch – batch of next states
action_batch – batch of actions
reward_batch – batch of rewards
done_batch – batch of done status
batch_size – the size of the batch

Returns

loss history

update_target_network()¶

Updates the target Q network with the parameters from the currently trained Q network.

Returns: None

Replay Buffer¶

class replay_buffer.DqnReplayBuffer(max_size)¶

DQN replay buffer to keep track of game play records

can_sample_batch(batch_size)¶

Returns if a batch can be sampled

Parameters: batch_size – the size of the batch to be sampled
Returns: (bool) if can sample

get_volume()¶

Gets the current length of the records

Returns: (int) the length of the records

record(state, reward, next_state, action, done)¶

Puts a game play state into records

Parameters

state – current game state
reward – reward after taking action
next_state – state after taking action
action – action taken
done – if the episode is finished

Returns

None

sample_batch(batch_size)¶

Samples a batch from the records

Parameters: batch_size – the size of the batch to be sampled
Returns: sample batch

Visualizer¶

Training progress visualizer

class visualizer.DummyTrainingVisualizer¶

Used when no logging is required

get_ui_feedback()¶

A dummy logger that does nothing

Returns: None

log_loss(loss)¶

A dummy logger that does nothing

Parameters: loss – a list of loss history
Returns: None

log_reward(reward)¶

A dummy logger that does nothing

Parameters: reward – a list of reward history
Returns: None

class visualizer.StreamlitTrainingVisualizer¶

Used when runs with stream lit

get_ui_feedback()¶

Gets the user defined config from the UI

Returns: config

log_loss(loss)¶

Adds a loss history to the chart

Parameters: loss – a list of loss history
Returns: None

log_reward(reward)¶

Adds a reward history to the chart

Parameters: reward – a list of reward history
Returns

class visualizer.TrainingVisualizer¶

Base training visualizer

abstract get_ui_feedback()¶

Gets the configuration from UI

Returns: None

abstract log_loss(loss)¶

Logs a loss history to the desired visualization

Parameters: loss – a list of loss history
Returns: None

abstract log_reward(reward)¶

Logs a reward history to the desired visualization

Parameters: reward – a list of reward history
Returns: None

visualizer.get_training_visualizer(visualizer_type)¶

A factory wrapper to generate training progress visualizers.

Parameters: visualizer_type – (str) the type of the visualizer to create
Returns: TrainingVisualizer

CLI Entrypoint¶

entrypoint.main()¶

The CLI entrypoint to the APIs

Returns: None

Configuration¶

Config

config.DEFAULT_BATCH_SIZE = 128¶: The default batch size the model should be trained on

config.DEFAULT_CHECKPOINT_LOCATION = './checkpoints'¶: The default location to store the training checkpoints

config.DEFAULT_ENV_NAME = 'CartPole-v0'¶: The OpenAI environment name to be used

config.DEFAULT_EPSILON = 0.05¶: The default value for epsilon

config.DEFAULT_EVAL_EPS = 10¶: The default number of episode the model should be evaluated with

config.DEFAULT_GAMMA = 0.95¶: The default discount rate for the Q learning

config.DEFAULT_LEARNING_RATE = 0.001¶: The default learning rate

config.DEFAULT_MAX_REPLAY_HISTORY = 1000000¶: The default max length of the replay buffer

config.DEFAULT_MIN_STEPS = 10¶: The minimum number of steps the evaluation should run per episode so that the tester can better visualize how the agent is doing.

config.DEFAULT_MODE = 'train'¶: The default mode the program should run in

config.DEFAULT_MODEL_LOCATION = './model'¶: The default location to store the best performing models

config.DEFAULT_NUM_ITERATIONS = 50000¶: The default number of iteration to train the model

config.DEFAULT_PAUSE_TIME = 0¶: The default value for pausing before execution starts to make time for screen recording. It’s only available in testing mode since it’s pointless to do so while training.

config.DEFAULT_RENDER_OPTION = 'none'¶: The default value for rendering option

config.DEFAULT_TARGET_NETWORK_UPDATE_FREQUENCY = 120¶: How often the target Q network should get parameter update from the training Q network.

config.DEFAULT_VERBOSITY_OPTION = 'progress'¶: The default verbosity option

config.DEFAULT_VISUALIZER_TYPE = 'none'¶: The default visualizer type

config.MODE_OPTIONS = ['train', 'test']¶: The supported modes

config.RENDER_OPTIONS = ['none', 'collect']¶

The available render options:

none: don’t render anything
collect: render the game play while collecting data

config.VERBOSITY_OPTIONS = ['progress', 'loss', 'policy', 'init']¶

The available verbosity options:

progress: show the training progress
loss: show the logging information from loss calculation
policy: show the logging information from policy generation
init: show the logging information from initialization

Train¶

train.train_model(num_iterations=50000, batch_size=128, max_replay_history=1000000, gamma=0.95, eval_eps=10, learning_rate=0.001, target_network_update_frequency=120, checkpoint_location='./checkpoints', model_location='./model', verbose='progress', visualizer_type='none', render_option='none', persist_progress_option='all', epsilon=0.05)¶

Trains a DQN agent by playing episodes of the Cart Pole game

Parameters

epsilon – epsilon is the probability that a random action is chosen
target_network_update_frequency – how frequent target Q network gets updates
num_iterations – the number of episodes the agent will play
batch_size – the training batch size
max_replay_history – the limit of the replay buffer length
gamma – discount rate
eval_eps – the number of episode per evaluation
learning_rate – the learning rate of the back propagation
checkpoint_location – the location to save the training checkpoints
model_location – the location to save the pre-trained models
verbose – the verbosity level which can be progress, loss, policy and init
visualizer_type – the type of visualization to be used
render_option – if the game play should be rendered
persist_progress_option – if the training progress should be saved

Returns

(maximum average reward, baseline average reward)

Utilities¶

Utilities

utils.collect_episode(env, policy, buffer, render_option)¶

Collect steps from a single episode play and record with replay buffer

Parameters

env – OpenAI gym environment
policy – DQN agent policy
buffer – reinforcement learning replay buffer
render_option – (bool) if should render the game play

Returns

None

utils.collect_steps(env, policy, buffer, render_option, current_state, n_steps)¶

Collects a single step from the game environment with policy specified. It is currently not used in favor of collect_episode API.

Parameters

n_steps – the number of steps to collect
current_state – the current state of the environment
env – OpenAI gym environment
policy – DQN agent policy
buffer – reinforcement learning replay buffer
render_option – (bool) if should render the game play

Returns

None

utils.compute_avg_reward(env, policy, num_episodes)¶

Compute the average reward across num_episodes under policy

Parameters

env – OpenAI gym environment
policy – DQN agent policy
num_episodes – the number of episode to take average from

Returns

(int) average reward

utils.play_episode(env, policy, render_option, min_steps)¶

Play an episode with the given policy.

Parameters

min_steps – the minimum steps the game should be played
env – the OpenAI gym environment
policy – the policy that should be used to generate actions
render_option – how the game play should be rendered

Returns

episode reward

utils.play_episodes(env, policy, render_option, num_eps, pause_time, min_steps)¶

Play episodes with the given policy

Parameters

min_steps – the minimum steps the game should be played
pause_time – the time that should pause to prepare for screen recording
env – the OpenAI gym environment
policy – the policy that should be used to generate actions
render_option – how the game play should be rendered
num_eps – how many episodes should be played

Returns

average episode reward

Tests¶

Tests for model training

class train_test.TestTrain(methodName='runTest')¶

Test suite for model training

test_sanity_check()¶

Tests if the model training finishes without crashing

Returns: None

test_training_effectiveness()¶

Test if the model training can achieve a performance better than a random policy

Returns: None