DQN Agent

class dqn_agent.DqnAgent(state_space, action_space, gamma, lr, verbose, checkpoint_location, model_location, persist_progress_option, mode, epsilon)

DQN agent with production policy and benchmark

collect_policy(state)

The policy for collecting data points which can contain some randomness to encourage exploration.

Returns

action

load_checkpoint()

Loads training checkpoint into the underlying model

Returns

None

load_model()

Loads previously saved model :return: None

policy(state)

Outputs a action based on model

Parameters

state – current state

Returns

action

random_policy(state)

Outputs a random action

Parameters

state – current state

Returns

action

save_checkpoint()

Saves training checkpoint

Returns

None

save_model()

Saves model to file system

Returns

None

train(state_batch, next_state_batch, action_batch, reward_batch, done_batch, batch_size)

Train the model on a batch

Parameters
  • state_batch – batch of states

  • next_state_batch – batch of next states

  • action_batch – batch of actions

  • reward_batch – batch of rewards

  • done_batch – batch of done status

  • batch_size – the size of the batch

Returns

loss history

update_target_network()

Updates the target Q network with the parameters from the currently trained Q network.

Returns

None

Replay Buffer

class replay_buffer.DqnReplayBuffer(max_size)

DQN replay buffer to keep track of game play records

can_sample_batch(batch_size)

Returns if a batch can be sampled

Parameters

batch_size – the size of the batch to be sampled

Returns

(bool) if can sample

get_volume()

Gets the current length of the records

Returns

(int) the length of the records

record(state, reward, next_state, action, done)

Puts a game play state into records

Parameters
  • state – current game state

  • reward – reward after taking action

  • next_state – state after taking action

  • action – action taken

  • done – if the episode is finished

Returns

None

sample_batch(batch_size)

Samples a batch from the records

Parameters

batch_size – the size of the batch to be sampled

Returns

sample batch

Visualizer

Training progress visualizer

class visualizer.DummyTrainingVisualizer

Used when no logging is required

get_ui_feedback()

A dummy logger that does nothing

Returns

None

log_loss(loss)

A dummy logger that does nothing

Parameters

loss – a list of loss history

Returns

None

log_reward(reward)

A dummy logger that does nothing

Parameters

reward – a list of reward history

Returns

None

class visualizer.StreamlitTrainingVisualizer

Used when runs with stream lit

get_ui_feedback()

Gets the user defined config from the UI

Returns

config

log_loss(loss)

Adds a loss history to the chart

Parameters

loss – a list of loss history

Returns

None

log_reward(reward)

Adds a reward history to the chart

Parameters

reward – a list of reward history

Returns

class visualizer.TrainingVisualizer

Base training visualizer

abstract get_ui_feedback()

Gets the configuration from UI

Returns

None

abstract log_loss(loss)

Logs a loss history to the desired visualization

Parameters

loss – a list of loss history

Returns

None

abstract log_reward(reward)

Logs a reward history to the desired visualization

Parameters

reward – a list of reward history

Returns

None

visualizer.get_training_visualizer(visualizer_type)

A factory wrapper to generate training progress visualizers.

Parameters

visualizer_type – (str) the type of the visualizer to create

Returns

TrainingVisualizer

CLI Entrypoint

entrypoint.main()

The CLI entrypoint to the APIs

Returns

None

Configuration

Config

config.DEFAULT_BATCH_SIZE = 128

The default batch size the model should be trained on

config.DEFAULT_CHECKPOINT_LOCATION = './checkpoints'

The default location to store the training checkpoints

config.DEFAULT_ENV_NAME = 'CartPole-v0'

The OpenAI environment name to be used

config.DEFAULT_EPSILON = 0.05

The default value for epsilon

config.DEFAULT_EVAL_EPS = 10

The default number of episode the model should be evaluated with

config.DEFAULT_GAMMA = 0.95

The default discount rate for the Q learning

config.DEFAULT_LEARNING_RATE = 0.001

The default learning rate

config.DEFAULT_MAX_REPLAY_HISTORY = 1000000

The default max length of the replay buffer

config.DEFAULT_MIN_STEPS = 10

The minimum number of steps the evaluation should run per episode so that the tester can better visualize how the agent is doing.

config.DEFAULT_MODE = 'train'

The default mode the program should run in

config.DEFAULT_MODEL_LOCATION = './model'

The default location to store the best performing models

config.DEFAULT_NUM_ITERATIONS = 50000

The default number of iteration to train the model

config.DEFAULT_PAUSE_TIME = 0

The default value for pausing before execution starts to make time for screen recording. It’s only available in testing mode since it’s pointless to do so while training.

config.DEFAULT_RENDER_OPTION = 'none'

The default value for rendering option

config.DEFAULT_TARGET_NETWORK_UPDATE_FREQUENCY = 120

How often the target Q network should get parameter update from the training Q network.

config.DEFAULT_VERBOSITY_OPTION = 'progress'

The default verbosity option

config.DEFAULT_VISUALIZER_TYPE = 'none'

The default visualizer type

config.MODE_OPTIONS = ['train', 'test']

The supported modes

config.RENDER_OPTIONS = ['none', 'collect']

The available render options:

  • none: don’t render anything

  • collect: render the game play while collecting data

config.VERBOSITY_OPTIONS = ['progress', 'loss', 'policy', 'init']

The available verbosity options:

  • progress: show the training progress

  • loss: show the logging information from loss calculation

  • policy: show the logging information from policy generation

  • init: show the logging information from initialization

Train

train.train_model(num_iterations=50000, batch_size=128, max_replay_history=1000000, gamma=0.95, eval_eps=10, learning_rate=0.001, target_network_update_frequency=120, checkpoint_location='./checkpoints', model_location='./model', verbose='progress', visualizer_type='none', render_option='none', persist_progress_option='all', epsilon=0.05)

Trains a DQN agent by playing episodes of the Cart Pole game

Parameters
  • epsilon – epsilon is the probability that a random action is chosen

  • target_network_update_frequency – how frequent target Q network gets updates

  • num_iterations – the number of episodes the agent will play

  • batch_size – the training batch size

  • max_replay_history – the limit of the replay buffer length

  • gamma – discount rate

  • eval_eps – the number of episode per evaluation

  • learning_rate – the learning rate of the back propagation

  • checkpoint_location – the location to save the training checkpoints

  • model_location – the location to save the pre-trained models

  • verbose – the verbosity level which can be progress, loss, policy and init

  • visualizer_type – the type of visualization to be used

  • render_option – if the game play should be rendered

  • persist_progress_option – if the training progress should be saved

Returns

(maximum average reward, baseline average reward)

Utilities

Utilities

utils.collect_episode(env, policy, buffer, render_option)

Collect steps from a single episode play and record with replay buffer

Parameters
  • env – OpenAI gym environment

  • policy – DQN agent policy

  • buffer – reinforcement learning replay buffer

  • render_option – (bool) if should render the game play

Returns

None

utils.collect_steps(env, policy, buffer, render_option, current_state, n_steps)

Collects a single step from the game environment with policy specified. It is currently not used in favor of collect_episode API.

Parameters
  • n_steps – the number of steps to collect

  • current_state – the current state of the environment

  • env – OpenAI gym environment

  • policy – DQN agent policy

  • buffer – reinforcement learning replay buffer

  • render_option – (bool) if should render the game play

Returns

None

utils.compute_avg_reward(env, policy, num_episodes)

Compute the average reward across num_episodes under policy

Parameters
  • env – OpenAI gym environment

  • policy – DQN agent policy

  • num_episodes – the number of episode to take average from

Returns

(int) average reward

utils.play_episode(env, policy, render_option, min_steps)

Play an episode with the given policy.

Parameters
  • min_steps – the minimum steps the game should be played

  • env – the OpenAI gym environment

  • policy – the policy that should be used to generate actions

  • render_option – how the game play should be rendered

Returns

episode reward

utils.play_episodes(env, policy, render_option, num_eps, pause_time, min_steps)

Play episodes with the given policy

Parameters
  • min_steps – the minimum steps the game should be played

  • pause_time – the time that should pause to prepare for screen recording

  • env – the OpenAI gym environment

  • policy – the policy that should be used to generate actions

  • render_option – how the game play should be rendered

  • num_eps – how many episodes should be played

Returns

average episode reward

Tests

Tests for model training

class train_test.TestTrain(methodName='runTest')

Test suite for model training

test_sanity_check()

Tests if the model training finishes without crashing

Returns

None

test_training_effectiveness()

Test if the model training can achieve a performance better than a random policy

Returns

None