Implements an alternate version of the minimax Q-learning algorithm A = MINIMAXQ2(A, STATE, ACTION, REWARD, PARAMS) Implements minimax Q-learning in a faster learning version than Littman's. Discards the auxiliary table for state values used by Littman, computing the next state value at each step by solving an LP problem similar to the one solved for computing the policy at the current step. Note that this actually means slower running speed, but fewer learning iterations for the same performance. However, the theoretical convergence guarantees are lost in this version. Required values on the agent learning parameters: alpha - the learning rate gamma - the discount factor lambda - the eligibility trace decay rate epsilon - the exploration probability Required values on the extra parameters: newtrial - only in episodic environments; whether a new trial is beginning References: [1] Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-01), pages 322-328, Williams College, Williamstown Massachusets, USA. See agent_learn, minimaxq2_init