function [Qseq, Rseq] = qlearning(config)
% Template for Q-learning function
% Inputs: assumed here to be given as a configuration structure "config", which
% should contain the following fields:
% gamma = discount factor
% alpha = learning rate
% epsilon = exploration probability
% epsilondecay = exploration probability decay rate (when this feature is implemented)
% T = number of trials
% K = maximum number of steps in every trial
% visualize = a boolean value telling the function whether visualization should be enabled
% Outputs: Qseq, Rseq
% Qseq could be an array with dimension (T, dimx1, dimx2, dimu) where dimxi are the numbers of
% discrete states of the problem along each dimension i and dimu the number of discrete actions
% Rseq could be a vector of dimension T containing the return for each trial
% initialization:
% create a model, startup the visualization if enabled, etc.
% (see the examples and your solutions to earlier labs for ideas)
% run T trials, choosing actions with epsilon-greedy and updating the Q function at each step
% after each step, update the visualization if enabled
% Hints:
% - reuse the code you implemented to simulate a trajectory in lab 1
% - to implement an instruction "with probability epsilon", use:
% if rand <= epsilon, instruction (since rand gives a uniformly distributed number in [0, 1])
% at the end of the trial:
% save the updated Q function on the sequence
% save the return on the sequence
% decay the exploration probability
% if visualization is enabled, show the new Q-function and the corresponding greedy policy on the visualization