Implements the Win-or-Learn-Fast Policy Hill Climbing algorithm A = WOLFQ(A, STATE, ACTION, REWARD, PARAMS) Implements the Win-or-Learn-Fast Policy Hill Climbing algorithm [1]. While the originally presented algorithm assumes the whole monolithic state of the world is used by the agents, this version also supports learning also on the basis of the single agent's state (flag "usefullstate" on the learning parameters). Here, agents only uses their own state. Values on the agent learning parameters: gamma - the discount factor alpha - the learning rate alphadecay - the decay ratio for the learning rate deltaw - the initial deltaw (for simple delta decay schedule) deltadecay - the decay ratio for the policy learning rate deltaratio - how much bigger is the learning rate when losing than when winning; deltal = deltaratio * deltaw lambda - eligibility trace decay rate usefullstate - whether to use the full state in learning Any decay ratio may be either a single value (in which case a simple decay schedule based on plain multiplication is assumed) or a vector of two values [a b] in which case Pk = 1 / (a + k/b), where Pk is the value of the parameter at the k-th iteration. Supports discrete states and actions, with 1 action variable per agent. References: [1] Bowling, M. and Veloso, M. (2002). Multiagent learning using a variable learning rate. Journal of Artificial Intelligence, 136(2):215-250. See also agent_control, wolfq_init