Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis
Gosavi A. Machine Learning55 (1):5-29,2004.Type:Article
Date Reviewed: Jul 28 2005

An algorithm for solving average reward Markov and semi-Markov decision problems is presented in this paper. The algorithm is asynchronous and model free, and is capable of cooperating with a nearest neighbor approach to manage large state spaces. In addition, a convergence analysis is carried out by means of an ordinary differential equation method.

Classical policy iteration is presented for solving the problem, as long as it doesn’t require any uniformization. This is quite relevant as well, since the transition probabilities are not always available. In reinforcement learning, optimal policy is determined via a reinforcement mechanism. In such a setting, the decision maker can be viewed as an agent that selects actions in each state, visited either on a real-time basis or via a simulator. The feedback obtained from every action selected is used to update the knowledge base of the agent.

The algorithm provided is quite close to the classical policy iteration: begin by selecting a policy arbitrarily, and estimate (using simulation) the average reward associated with it. Then, perform policy evaluation. It goes on to perform policy improvement, where it selects the new policy based on Q-factors. Before beginning a phase, it must first obtain the average reward associated with the new policy (which can be done using simulation). The old vector of Q-factors will now be called P-factors, (P for policy), since these P-factors will contain information about the policy that will be evaluated in the next phase (policy improvement). Fresh Q-factors will be approximated in the next phase.

A case study related to an airline’s seat selling policy (and overbooking) is presented as an application. The seat allocation problem has its roots in the airlines’ practice of selling seats within the same cabin of a flight at different prices. There are differences in customers, and airlines use these differences to sell tickets at different prices. A passenger who wants a lesser number of stopovers, or a passenger who arrives late in the booking process, is made to pay a higher fare. This strategy for classifying passengers leads to different fare classes based on their needs and circumstances. Good results are shown for this algorithm as compared to the heuristic algorithm used in the industry.

Reviewer:  Joaquin Ordieres Review #: CR131592
Bookmark and Share
 
Concept Learning (I.2.6 ... )
 
 
Knowledge Acquisition (I.2.6 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Concept Learning": Date
Parallelism and programming in classifier systems
Forrest S. (ed), Morgan Kaufmann Publishers Inc., San Francisco, CA, 1991. Type: Book (9780273088257)
Sep 1 1991
Learning structures of visual patterns from single instances
Suganuma Y. Artificial Intelligence 50(1): 1-36, 1991. Type: Article
Apr 1 1992
Learning simple concepts under simple distributions
Li M. (ed), Vitányi P. SIAM Journal on Computing 20(5): 911-935, 1991. Type: Article
Jul 1 1993
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy