Learning Robot Operators From Examples
Transcription
Learning Robot Operators From Examples
Continuous Conceptual Set Covering: Learning Robot Operators From Examples Carl Myers Kadie Knowledge-Based Systems Group, Department of Computer Science & Beckman Institute, University of Illinois, 405 N. Mathews Ave., Urbana, IL 61801 kadie@cs.uiuc.edu Abstract Continuous Conceptual Set Covering (CCSC) is an algorithm that uses engineering knowledge to learn operator effects from training examples. The program produces an operator hypothesis that, even in noisy and nondeterministic domains, can make good quantitative predictions. An empirical evaluation in the traytilting domain shows that CCSC learns faster than an alternative case-based approach. The best results, however, come from integrating CCSC and the case-based approach. 1. INTRODUCTION An important open problem in Machine Learning is learning the effects of robot operators from examples. Previous research has provided a partial solution. Case-based approaches, for example, have been effective in some domains [Moore, 1990]. A limitation of the case-based approach is that it is very sensitive to the number of attributes used to describe a case. Machine Learning offers a number of generalization techniques that are relatively insensitive to the number of attributes [Quinlan, 1986; Kadie, 1990]. Most of these techniques, however, work only on discrete concept-learning problems. The CCSC algorithm extends these generalization methods to work on continuous problems. The key extensions involve the application of background knowledge and the automatic determination of an error threshold. 2. PHYSICAL-WORLD OPERATOR-EFFECT LEARNING Conceptually, the effects of an operator are a function from the state of a system and a set of parameters to a new system state. For example, the tray-tilting domain system is made up of a Puma robot holding a square 11" × 11" tray (figure 1). The tray contains a single round puck. The state of the world is represented by the x and y coordinates of the puck. The values of the coordinates are continuous and range from about 30.0 to about 240.0. Figure 1. Experimental Set Up Initially, the robot knows how to physically execute the tilt_operator. It does not, however, know the effects of the operator. When the tray-tilt operator is executed, the robot tips the tray down 30° from the horizontal in the direction of tilt. The new position of the puck is hard to predict because of uncertainty in the initial conditions (the initial position of the puck and the tilting angle are continuous values subject to measurement error). In addition, the puck's movement can be complex; it can slide, bounce and even roll (along an edge of the tray). The performance task in the tray-tilting domain is to take the initial position of the puck, <x0 ,y0>, and the tilt angle tilt, and predict the puck's final position, <x1, y1>. The goodness of this prediction will be measured as the Euclidean distance between the actual final position and the predicted final position. The input to the learning task is 1) a set of training examples of the form {<<statei,parms>, statei+1>} where statei+1 is the result of applying the operator to statei with parameters parmsi and 2) background knowledge, K. The output of the learning task is an (operator) hypothesis, H. The operator hypothesis should be an executable function, for example, a Lisp lambda expression (for an example, look ahead to figure 4). Background knowledge here is in the form of engineering experts. Each expert is a program, Ej, that makes (sometimes bad) predictions about the puck's movement. Referring to figure 2, the experts used in the tray-tilting experiments are: • stay-still (position a). • go-and-stick (position b). • go-and-slam (position c). • slider: The puck will slide from position b toward position c. The distance it sides will be a linear function of the distance from position a to position b and cos(φ), where φ is the angle of incidence. Figure 2. Geometry of Tray Tilting for an Initial Puck Position a and Tilt Angle tilt - If the puck traveled in a straight line it would contact the wall at point b with an angle of incidence of φ. If the puck then slid along the wall, it would reach point c. The experts may be fixed or they may be the output of other learning programs. For example, slider is the output of a program that takes the training examples as input and then uses multiple-linear regression to find the best linear relation. CCSC is similar to work in quantitative discovery, especially the ABACUS system [Falkenhainer and Michalski, 1986]. Unlike most quantitative discovery systems, CCSC works with whatever experts (or expert constructors) it is given. CCSC can, for example, work with equations, look-up tables, and Lisp programs. It could even work with ABACUS's equation constructor. CCSC also differs from most quantitative discovery systems in that it automatically adapts to error. Learning in a tray-tilting domain is described in [Christiansen, et al, 1990; Mason, et al, 1989]. In their version of the problem the domain is discrete. They divide the tray into nine subsquares (like a tic-tac-toe board) and the tilt heading into 24 angles. The goal of their work is to build a robot that learns from experimentation how to improve its planning ability. The inductive learning component of the system is case-based. The output of the learner is a Markov model. Given one of the initial subsquares, an angle, and a final subsquare, the Markov model tells the probability of that move. The case-based, or exemplar, approach to operator learning is also explored in [Moore, 1990]. Moore's system efficiently uses the same nearest-neighbor metric used in this paper. Like all case-based approaches, however, this approach is sensitive to the number of attributes and has difficulty accepting background knowledge. Because this is a physical-world domain, no expert's predictions will likely ever be perfect. But because some of the experts are created dynamically, all examples should be well predicted by at least one expert. The heart of the learning task is thus selecting the right expert for a given example. The Grasper system demonstrates explanation-based learning in a robot domain [Bennett, 1990]. Grasper is given an approximate domain theory. In contrast with CCSC's more empirical approach, Grasper uses explanation-based methods to help it tune scalar parameters such as the initial width of a robot's grasper. Grasper requires much more background knowledge than CCSC (it must be given an approximate domain theory), but fewer training examples. The operator hypothesis that results from learning can be represented as a decision list of the form: 3. If d1 then apply Ek , 1 else if d2 then apply Ek , 2 Three algorithms were created and tested. This section describes each one in turn. ... else apply Ek m 3.1. where the dj's are decision rules and Ek's the experts. Section 3 details CCSC, an algorithm for doing expert selection. But first, here is a review of related work. Related Work - CCSC differs from most Machine Learning research in that it creates hypotheses that predicts continuous values. Continuous value prediction is seldom perfect and never completely wrong. Instead, it is correct to a lesser or greater degree. LEARNING ALGORITHMS CASED-BASED LEARNING The simplest algorithm for operator learning is a case-based, nearest-neighbor approach. In the experiments, case nearness was measured by scaling all values to the interval [0,1] and then measuring Euclidean distance. In a second set of experiments, the same similarity metric was applied to a set of constructed attributes. Symmetry allowed each case to represent eight cases. Given enough examples, case-based learning will converge to an hypothesis with minimal error. The disadvantages of case-based learning are two fold. First, the record of past cases is bulky and nearly incomprehensible to humans. Second, the case-based approach is very sensitive to the number of attributes used to describe the state and the parameters. In other words, as the dimensionality of input space increases, performance decreases. 3.2. GREEDY SELECTION OF EXPERTS If engineering experts (in the form of programs) are available, then a simple greedy algorithm can be used to learn. For each training example, <<statei, parmsi>, statei+1>, find the expert, Ej , that best i covers the example and record ji of that expert. The result is a new set of training examples {<<statei, parmsi>, ji> }. These new training examples can be given to a multiple-concept learning system. The result will be a decision function D that, when given a prediction problem <statenew, parmsnew>, predicts the index, jnew of the best expert for that problem. Applying Ej to <statenew, parmsnew> produces statenew+1, the new predicted result of apply the operator to <statenew, parmsnew>. In the experiments of section 4 the decision function D was of the form of a decision list where each decision rule was produced by a version of the ID3 program [Quinlan, 1986]. This greedy learning method can produce concise and comprehensible hypotheses. Moreover, because ID3 is good on problems with high dimensionality, the hypothesis should be less sensitive to the input-space dimensionality. The problem with this approach is that the decision function D might be more complex than necessary, that is, the best expert on a particular training example may not be the best expert for similar examples. The result of using only the locally best experts, may be a more complex, less accurate decision function. 3.3. CCSC: CONCEPTUAL SELECTION OF EXPERTS In general, a learning program should be willing to trade fit with the training examples for greater hypothesis simplicity. The Conceptual Set Covering (CSC) algorithm [Kadie 1990] can make this trade off, but only in discrete, deterministic, errorless domains. For each example, CSC chooses one expert from the set of experts that cover that example. It tries to make this choice so that the syntactic complexity of the final decision function is minimized. The result is a decision list D that when given a new problem <statenew, parmsnew> predicts which expert will cover that problem. Applying the expert to the problem produces a predicted result. In continuous, nondeterministic domains, experts are unlikely to exactly cover an example, so the notion of coverage must be relaxed. The simplest approach is to define cover in terms of an error cut off. Specifically, an expert E is said to cover a training example <<statei, parmsi>, statei+1> if | E(statei, parmsi) - statei+1| < cutoff Because an acceptable error cut off for one domain will not necessarily be an acceptable error cut off in another, CCSC determines the error cut off automatically. The idea of the cut off is to separate acceptable error from unacceptable error. Toward this goal, CCSC makes two assumptions. First, it assumes that the error of the best expert on a particular example is usually acceptable. Call the error of the best expert on example i, besti. The distribution of all the besti's can be plotted as in figure 3. Second, it assumes that the error of the other experts on a particular example is usually unacceptable. Call the errors of the other experts on example i, {otheri} . The distribution of all the otheri's can also be plotted. CCSC sets the cut off to the value that is at the pth percentile of the best distribution and at the (100%-p)th percentile of the other distribution. If, on a particular example, no expert meets the cut off, the best expert is accepted. Figure 3. CCSC generally produces hypotheses that are more concise and comprehensible than those produced by the greedy method. Figure 4 shows an operator hypothesis produced by CCSC. CCSC is, however, only as good as its experts. The error of its hypotheses converges to the minimum error of the experts not to the overall minimal error. The solution is to integrate CCSC with case-based learning by using a case-based learner as one of CCSC experts. The next section evaluates these algorithms in practice. (LAMBDA (X Y TILT DIST1 INCID DIST2 N1 N2) (COND ((OR (AND (>= DIST1 176.13756) (< DIST1 213.34853) (< INCID 13.5)) (AND (< DIST1 213.34853) (< DIST2 91.29100) (>= INCID 13.5) (< INCID 45.00000)) (AND (>= INCID 45.0000) (< DIST1 37.96659) (< N2 25)) (AND (< DIST1 213.34853) (>= INCID 45.00000) (>= N2 25))) (GO-AND-STICK X Y TILT DIST1 INCID DIST2 N1 N2)) ((OR (AND (< DIST1 7.66117) (< INCID 36.5)) (AND (>= DIST1 7.66117) (< INCID 63.5))) (GO-AND-SLAM X Y TILT DIST1 INCID DIST2 N1 N2)) (T (SLIDER X Y TILT DIST1 INCID DIST2 N1 N2)))) Figure 5. Error Curves for Learning on the Raw Data Figure 4. An Operator Hypothesis Produced by CCSC - This hypothesis has an mean error of less than 7.0. 4. a) EVALUATION Three series of experiments were used to test the algorithms. 4.1. RAW DATA EXPERIMENTS In the raw data experiments, three operator learners were tested: the case-base learner, the simple CCSC learner (CCSC with the simple experts of section 2), and the combined learner (CCSC using the simple experts as well as the cased-based learner as an expert). To help measure each learner's sensitivity to the dimensionality of the input space, between zero and three additional input attributes where added. The value of each of these attributes was chosen randomly according to the uniform distribution over the range 0 to 100. When the total number of attributes was three, all the learners do about the same. But as the number of attributes is increased to six, the CCSC learners needed fewer examples to produce hypotheses that make better predictions (figure 5). 4.2. b) TRANSFORMED DATA EXPERIMENTS The second series of experiments tested the same algorithms on transformed data. The transformation allowed all the learners to exploit the symmetry of the tray problem and made it easier to measure the learners' convergent behavior. Referring back to figure 2, the attributes of the transformed data are 1) the distance from a to b, 2) the incident angle φ, and 3) the distance from b to c. Figure 6. Error Curves for Learning on the Transformed Data (The x- and y-axes differ.) When the number of attributes was three and the number of examples was small, all the learners achieve about the same accuracy. As predicted, as the number of examples increases cased-based learning and CCSC with casedbased learning do best (figure 6a). As the number of attributes is increased to six, the CCSC learners show much faster learning (figure 6b). 4.3. GREEDY EXPERT SELECTION VERSUS CONCEPTUAL EXPERT SELECTION Section 3.3 contained a prediction that CCSC's conceptual expert selection method would produce hypotheses that were more accurate than those produced by greedy expert selection. This prediction was tested with repeated runs of CCSC and greedy expert selection on the raw tray data with no extra attributes. CCSC often performed significantly better than the greedy algorithm. On average, CCSC's error rate is 10% lower than the greedy method's error rate. 5. CONCLUSION This paper described the problem of learning the effects of operators from examples.. The operators may have parameters (for example, a parameter that specifies the direction of the tilt). They may also be noisy and nondeterministic. The input to the learning system also includes a set of experts (in the form of programs) some of which may be created automatically. The learning program tries to learn which expert to apply to any particular problem. Several learning algorithms were considered. The best algorithm was a hybrid in which CCSC used the case-based algorithm as one of its experts. This algorithm learned significantly quicker than the case-based learner and unlike, the first version of CCSC, converged toward the minimal error. Work in progress addresses three limitations of CCSC's current implementation. First, the operator hypotheses produced by CCSC should do more than make a prediction; it should also estimate the error of the prediction. Second, CCSC should be evaluated on more problems including synthetic problems generated from mathematical models. Work on such models has begun. Third, some experts should be constructed from primitives such as translate_point. The beginnings of such a system for discrete, errorless domains, is described in [Kadie, 1988]. Despite these limitations, CCSC offers immediate benefits to those who wish to learn operator effects. It shows how background knowledge can be used to improve this type of inductive learning. It is especially useful when the dimensionality of the input space is high. Acknowledgments Support was provided by the Fannie and John Hertz Foundation and ONR grant N00014-88-K124. Thanks to Alan Christiansen of the Carnegie Mellon University School of Computer Science for providing the tilting-tray data. References [Bennett, 1990] Scott W. Bennett. Reducing real-world failures of approximate explanation-based rules. In Proceedings of the Seventh International Conference on Machine Learning, pages 226-234, Morgan Kaufmann Publishers, June 1990. [Christiansen, et al, 1990] Alan D. Christiansen, Matthew T. Mason, and Tom M. Mitchell. Learning reliable manipulation strategies without initial physical models. In IEEE International Conference on Robotics and Automation, Cincinnati, May 1990. [Falkenhainer and Michalski, 1986] Brian Falkenhainer and Ryszard S. Michalski. Integrating qualitative and quantitative discovery: the Abacus system. Machine Learning, 1(4), 1986. [Kadie, 1988] Carl M. Kadie. Diffy-S: learning robot operator schemata from examples. In Proceedings of the Fifth International Conference on Machine Learning, pages 430-436, Morgan Kaufmann Publishers, June 1988. [Kadie, 1990] Carl M. Kadie. Conceptual set covering: improving fit-and-split algorithms. In Proceedings of the Seventh International Conference on Machine Learning, pages 40-48, Morgan Kaufmann Publishers, June 1990. [Mason, et al, 1989] M. T. Mason, A. D. Christiansen, and T. M. Mitchell. Experiments in robot learning. In Proceedings of the Sixth International Workshop on Machine Learning, Ithaca, NY, June 1989. [Moore, 1990] Andrew W. Moore. Acquisition of dynamic control knowledge for a robot manipulator. In Proceedings of the Seventh International Conference on Machine Learning, pages 244-252, Morgan Kaufmann Publishers, June 1990. [Quinlan, 1986] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1), 1986.