Convert body of chapter to SGML. Was embedded text from original doc.

2024-12-15 08:20:16 +08:00 · 1998-04-04 16:32:01 +00:00 · 1998-04-04 16:32:01 +00:00 · c452ddcac6
commit c452ddcac6
parent e01d442174
1 changed files with 257 additions and 127 deletions
--- a/doc/src/sgml/geqo.sgml
+++ b/doc/src/sgml/geqo.sgml
@ -3,78 +3,103 @@
 <Author>
 <FirstName>Martin</FirstName>
 <SurName>Utesch</SurName>
+<Affiliation>
+<Orgname>
+University of Mining and Technology
+</Orgname>
+<Orgdiv>
+Institute of Automatic Control
+</Orgdiv>
+<Address>
+<City>
+Freiberg
+</City>
+<Country>
+Germany
+</Country>
+</Address>
+</Affiliation>
 </Author>
+<Date>1997-10-02</Date>
 </DocInfo>

 <Title>Genetic Query Optimization in Database Systems</Title>

 <Para>
-<ProgramListing>
-<ULink url="utesch@aut.tu-freiberg.de">Martin Utesch</ULink>
+<Note>
+<Title>Author</Title>
+<Para>
+Written by <ULink url="utesch@aut.tu-freiberg.de">Martin Utesch</ULink>
+for the Institute of Automatic Control at the University of Mining and Technology in Freiberg, Germany.
+</Para>
+</Note>

-          Institute of Automatic Control
-        University of Mining and Technology
-                 Freiberg, Germany
-
-                    02/10/1997
-
-
-1.) Query Handling as a Complex Optimization Problem
-====================================================
+<Sect1>
+<Title>Query Handling as a Complex Optimization Problem</Title>

+<Para>
   Among all relational operators the most difficult one to process and
-optimize is the JOIN. The number of alternative plans to answer a query
-grows exponentially with the number of JOINs included in it. Further
-optimization effort is caused by the support of a variety of *JOIN
-methods* (e.g., nested loop, index scan, merge join in Postgres) to
-process individual JOINs and a diversity of *indices* (e.g., r-tree,
-b-tree, hash in Postgres) as access paths for relations.
+optimize is the <FirstTerm>join</FirstTerm>. The number of alternative plans to answer a query
+grows exponentially with the number of <Command>join</Command>s included in it. Further
+optimization effort is caused by the support of a variety of <FirstTerm>join methods</FirstTerm>
+ (e.g., nested loop, index scan, merge join in <ProductName>Postgres</ProductName>) to
+process individual <Command>join</Command>s and a diversity of <FirstTerm>indices</FirstTerm> (e.g., r-tree,
+b-tree, hash in <ProductName>Postgres</ProductName>) as access paths for relations.

-   The current Postgres optimizer implementation performs a *near-
-exhaustive search* over the space of alternative strategies. This query
+<Para>
+   The current <ProductName>Postgres</ProductName> optimizer implementation performs a <FirstTerm>near-
+exhaustive search</FirstTerm> over the space of alternative strategies. This query
 optimization technique is inadequate to support database application
 domains that involve the need for extensive queries, such as artificial
 intelligence.

+<Para>
   The Institute of Automatic Control at the University of Mining and
 Technology, in Freiberg, Germany, encountered the described problems as its
-folks wanted to take the Postgres DBMS as the backend for a decision
+folks wanted to take the <ProductName>Postgres</ProductName> DBMS as the backend for a decision
 support knowledge based system for the maintenance of an electrical
-power grid. The DBMS needed to handle large JOIN queries for the
+power grid. The DBMS needed to handle large <Command>join</Command> queries for the
 inference machine of the knowledge based system.

+<Para>
   Performance difficulties within exploring the space of possible query
 plans arose the demand for a new optimization technique being developed.

-   In the following we propose the implementation of a *Genetic
-Algorithm* as an option for the database query optimization problem.
+<Para>
+   In the following we propose the implementation of a <FirstTerm>Genetic Algorithm</FirstTerm>
+ as an option for the database query optimization problem.


-2.) Genetic Algorithms (GA)
-===========================
+<Sect1>
+<Title>Genetic Algorithms (<Acronym>GA</Acronym>)</Title>

-   The GA is a heuristic optimization method which operates through 
+<Para>
+   The <Acronym>GA</Acronym> is a heuristic optimization method which operates through 
 determined, randomized search. The set of possible solutions for the
-optimization problem is considered as a *population* of *individuals*.
+optimization problem is considered as a <FirstTerm>population</FirstTerm> of <FirstTerm>individuals</FirstTerm>.
 The degree of adaption of an individual to its environment is specified
-by its *fitness*.
+by its <FirstTerm>fitness</FirstTerm>.

+<Para>
   The coordinates of an individual in the search space are represented
-by *chromosomes*, in essence a set of character strings. A *gene* is a
+by <FirstTerm>chromosomes</FirstTerm>, in essence a set of character strings. A <FirstTerm>gene</FirstTerm> is a
 subsection of a chromosome which encodes the value of a single parameter
-being optimized. Typical encodings for a gene could be *binary* or
-*integer*.
+being optimized. Typical encodings for a gene could be <FirstTerm>binary</FirstTerm> or
+<FirstTerm>integer</FirstTerm>.

-   Through simulation of the evolutionary operations *recombination*,
-*mutation*, and *selection* new generations of search points are found
+<Para>
+   Through simulation of the evolutionary operations <FirstTerm>recombination</FirstTerm>,
+<FirstTerm>mutation</FirstTerm>, and <FirstTerm>selection</FirstTerm> new generations of search points are found
 that show a higher average fitness than their ancestors.

-   According to the "comp.ai.genetic" FAQ it cannot be stressed too
-strongly that a GA is not a pure random search for a solution to a
-problem. A GA uses stochastic processes, but the result is distinctly
+<Para>
+   According to the "comp.ai.genetic" <Acronym>FAQ</Acronym> it cannot be stressed too
+strongly that a <Acronym>GA</Acronym> is not a pure random search for a solution to a
+problem. A <Acronym>GA</Acronym> uses stochastic processes, but the result is distinctly
 non-random (better than random). 

-Structured Diagram of a GA:
+<ProgramListing>
+Structured Diagram of a <Acronym>GA</Acronym>:
 ---------------------------

 P(t)    generation of ancestors at a time t
@ -101,128 +126,233 @@ P''(t)  generation of descendants at a time t
 |   +-------------------------------------+
 |   | t := t + 1                          |
 +===+=====================================+
+</ProgramListing>

+<Sect1>
+<Title>Genetic Query Optimization (<Acronym>GEQO</Acronym>) in Postgres</Title>

-3.) Genetic Query Optimization (GEQO) in PostgreSQL
-===================================================
-
-   The GEQO module is intended for the solution of the query
-optimization problem similar to a traveling salesman problem (TSP).
+<Para>
+   The <Acronym>GEQO</Acronym> module is intended for the solution of the query
+optimization problem similar to a traveling salesman problem (<Acronym>TSP</Acronym>).
 Possible query plans are encoded as integer strings. Each string
-represents the JOIN order from one relation of the query to the next.
-E. g., the query tree  /\
-                      /\ 2
-                     /\ 3
-                    4  1  is encoded by the integer string '4-1-3-2',
+represents the <Command>join</Command> order from one relation of the query to the next.
+E. g., the query tree
+<ProgramListing>
+       /\
+      /\ 2
+     /\ 3
+    4  1
+</ProgramListing>
+is encoded by the integer string '4-1-3-2',
 which means, first join relation '4' and '1', then '3', and
-then '2', where 1, 2, 3, 4 are relids in PostgreSQL.
+then '2', where 1, 2, 3, 4 are relids in <ProductName>Postgres</ProductName>.

-   Parts of the GEQO module are adapted from D. Whitley's Genitor
+<Para>
+   Parts of the <Acronym>GEQO</Acronym> module are adapted from D. Whitley's Genitor
 algorithm.

-   Specific characteristics of the GEQO implementation in PostgreSQL
+<Para>
+   Specific characteristics of the <Acronym>GEQO</Acronym> implementation in <ProductName>Postgres</ProductName>
 are:

-o  usage of a *steady state* GA (replacement of the least fit
+<ItemizedList Mark="bullet" Spacing="compact">
+<ListItem>
+<Para>
+Usage of a <FirstTerm>steady state</FirstTerm> <Acronym>GA</Acronym> (replacement of the least fit
   individuals in a population, not whole-generational replacement)
   allows fast convergence towards improved query plans. This is
   essential for query handling with reasonable time;
+</Para>
+</ListItem>

-o  usage of *edge recombination crossover* which is especially suited
-   to keep edge losses low for the solution of the TSP by means of a GA;
+<ListItem>
+<Para>
+Usage of <FirstTerm>edge recombination crossover</FirstTerm> which is especially suited
+   to keep edge losses low for the solution of the <Acronym>TSP</Acronym> by means of a <Acronym>GA</Acronym>;
+</Para>
+</ListItem>

-o  mutation as genetic operator is deprecated so that no repair
-   mechanisms are needed to generate legal TSP tours.
+<ListItem>
+<Para>
+Mutation as genetic operator is deprecated so that no repair
+   mechanisms are needed to generate legal <Acronym>TSP</Acronym> tours.
+</Para>
+</ListItem>
+</ItemizedList>

-   The GEQO module gives the following benefits to the PostgreSQL DBMS
-compared to the Postgres query optimizer implementation:
+<Para>
+   The <Acronym>GEQO</Acronym> module gives the following benefits to the <ProductName>Postgres</ProductName> DBMS
+compared to the <ProductName>Postgres</ProductName> query optimizer implementation:

-o  handling of large JOIN queries through non-exhaustive search;
+<ItemizedList Mark="bullet" Spacing="compact">
+<ListItem>
+<Para>
+Handling of large <Command>join</Command> queries through non-exhaustive search;
+</Para>
+</ListItem>

-o  improved cost size approximation of query plans since no longer
-   plan merging is needed (the GEQO module evaluates the cost for a
+<ListItem>
+<Para>
+Improved cost size approximation of query plans since no longer
+   plan merging is needed (the <Acronym>GEQO</Acronym> module evaluates the cost for a
   query plan as an individual).
+</Para>
+</ListItem>
+</ItemizedList>

+</Sect1>

-References
-==========
+<Sect1>
+<Title>Future Implementation Tasks for <ProductName>Postgres</ProductName> <Acronym>GEQO</Acronym></Title>

-J. Heitk"otter, D. Beasley:
---------------------------
-   "The Hitch-Hicker's Guide to Evolutionary Computation",
-   FAQ in 'comp.ai.genetic',
-   'ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html'
+<Sect2>
+<Title>Basic Improvements</Title>

-Z. Fong:
--------
-   "The Design and Implementation of the Postgres Query Optimizer",
-   file 'planner/Report.ps' in the 'postgres-papers' distribution
+<Sect3>
+<Title>Improve freeing of memory when query is already processed</Title>

-R. Elmasri, S. Navathe:
-----------------------
-   "Fundamentals of Database Systems",
-   The Benjamin/Cummings Pub., Inc.
+<Para>
+With large <Command>join</Command> queries the computing time spent for the genetic query
+optimization seems to be a mere <Emphasis>fraction</Emphasis> of the time
+ <ProductName>Postgres</ProductName>
+needs for freeing memory via routine <Function>MemoryContextFree</Function>,
+file <FileName>backend/utils/mmgr/mcxt.c</FileName>.
+Debugging showed that it get stucked in a loop of routine
+<Function>OrderedElemPop</Function>, file <FileName>backend/utils/mmgr/oset.c</FileName>.
+The same problems arise with long queries when using the normal
+<ProductName>Postgres</ProductName> query optimization algorithm.

+<Sect3>
+<Title>Improve genetic algorithm parameter settings</Title>

-=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
-*         Things left to done for the PostgreSQL                    *
-=           Genetic Query Optimization (GEQO)                       =
-*              module implementation                                *
-=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
-* Martin Utesch		      * Institute of Automatic Control      *
-=                             = University of Mining and Technology =
-* utesch@aut.tu-freiberg.de   * Freiberg, Germany                   *
-=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
-
-
-1.) Basic Improvements
-===============================================================
-
-a) improve freeing of memory when query is already processed:
-------------------------------------------------------------
-with large JOIN queries the computing time spent for the genetic query
-optimization seems to be a mere *fraction* of the time Postgres
-needs for freeing memory via routine 'MemoryContextFree',
-file 'backend/utils/mmgr/mcxt.c';
-debugging showed that it get stucked in a loop of routine
-'OrderedElemPop', file 'backend/utils/mmgr/oset.c';
-the same problems arise with long queries when using the normal
-Postgres query optimization algorithm;
-
-b) improve genetic algorithm parameter settings:
------------------------------------------------
-file 'backend/optimizer/geqo/geqo_params.c', routines
-'gimme_pool_size' and 'gimme_number_generations';
+<Para>
+In file <FileName>backend/optimizer/geqo/geqo_params.c</FileName>, routines
+<Function>gimme_pool_size</Function> and <Function>gimme_number_generations</Function>,
 we have to find a compromise for the parameter settings
 to satisfy two competing demands:
-1.  optimality of the query plan
-2.  computing time
+<ItemizedList Spacing="compact">
+<ListItem>
+<Para>
+Optimality of the query plan
+</Para>
+</ListItem>
+<ListItem>
+<Para>
+Computing time
+</Para>
+</ListItem>
+</ItemizedList>

-c) find better solution for integer overflow:
---------------------------------------------
-file 'backend/optimizer/geqo/geqo_eval.c', routine
-'geqo_joinrel_size';
-the present hack for MAXINT overflow is to set the Postgres integer
-value of 'rel->size' to its logarithm;
-modifications of 'struct Rel' in 'backend/nodes/relation.h' will
-surely have severe impacts on the whole PostgreSQL implementation.
+<Sect3>
+<Title>Find better solution for integer overflow</Title>

-d) find solution for exhausted memory:
--------------------------------------
-that may occur with more than 10 relations involved in a query,
-file 'backend/optimizer/geqo/geqo_eval.c', routine
-'gimme_tree' which is recursively called;
-maybe I forgot something to be freed correctly, but I dunno what;
-of course the 'rel' data structure of the JOIN keeps growing and
-growing the more relations are packed into it;
-suggestions are welcome :-(
+<Para>
+In file <FileName>backend/optimizer/geqo/geqo_eval.c</FileName>, routine
+<Function>geqo_joinrel_size</Function>,
+the present hack for MAXINT overflow is to set the <ProductName>Postgres</ProductName> integer
+value of <StructField>rel->size</StructField> to its logarithm.
+Modifications of <StructName>Rel</StructName> in <FileName>backend/nodes/relation.h</FileName> will
+surely have severe impacts on the whole <ProductName>Postgres</ProductName> implementation.
+
+<Sect3>
+<Title>Find solution for exhausted memory</Title>
+
+<Para>
+Memory exhaustion may occur with more than 10 relations involved in a query.
+In file <FileName>backend/optimizer/geqo/geqo_eval.c</FileName>, routine
+<Function>gimme_tree</Function> is recursively called.
+Maybe I forgot something to be freed correctly, but I dunno what.
+Of course the <StructName>rel</StructName> data structure of the <Command>join</Command> keeps growing and
+growing the more relations are packed into it.
+Suggestions are welcome :-(


-2.) Further Improvements
-===============================================================
-Enable bushy query tree processing within PostgreSQL;
+<Sect2>
+<Title>Further Improvements</Title>
+
+<Para>
+Enable bushy query tree processing within <ProductName>Postgres</ProductName>;
 that may improve the quality of query plans.

-</ProgramListing>
+<BIBLIOGRAPHY>
+<TITLE>
+References
+</TITLE>
+<PARA>Reference information for <Acronym>GEQ</Acronym> algorithms.
+</PARA>
+<BIBLIOENTRY>
+
+<BOOKBIBLIO>
+<TITLE>
+The Hitch-Hiker's Guide to Evolutionary Computation
+</TITLE>
+<AUTHORGROUP>
+<AUTHOR>
+<FIRSTNAME>J&ouml;rg</FIRSTNAME>
+<SURNAME>Heitk&ouml;tter</SURNAME>
+</AUTHOR>
+<AUTHOR>
+<FIRSTNAME>David</FIRSTNAME>
+<SURNAME>Beasley</SURNAME>
+</AUTHOR>
+</AUTHORGROUP>
+<PUBLISHER>
+<PUBLISHERNAME>
+InterNet resource
+</PUBLISHERNAME>
+</PUBLISHER>
+<ABSTRACT>
+<Para>
+FAQ in <ULink url="news://comp.ai.genetic">comp.ai.genetic</ULink>
+is available at <ULink url="ftp://ftp.Germany.EU.net/pub/research/softcomp/EC/Welcome.html">Encore</ULink>.
 </Para>
+</ABSTRACT>
+</BOOKBIBLIO>
+
+<BOOKBIBLIO>
+<TITLE>
+The Design and Implementation of the Postgres Query Optimizer
+</TITLE>
+<AUTHORGROUP>
+<AUTHOR>
+<FIRSTNAME>Z.</FIRSTNAME>
+<SURNAME>Fong</SURNAME>
+</AUTHOR>
+</AUTHORGROUP>
+<PUBLISHER>
+<PUBLISHERNAME>
+University of California, Berkeley Computer Science Department
+</PUBLISHERNAME>
+</PUBLISHER>
+<ABSTRACT>
+<Para>
+File <FileName>planner/Report.ps</FileName> in the 'postgres-papers' distribution.
+</Para>
+</ABSTRACT>
+</BOOKBIBLIO>
+
+<BOOKBIBLIO>
+<TITLE>
+Fundamentals of Database Systems
+</TITLE>
+<AUTHORGROUP>
+<AUTHOR>
+<FIRSTNAME>R.</FIRSTNAME>
+<SURNAME>Elmasri</SURNAME>
+</AUTHOR>
+<AUTHOR>
+<FIRSTNAME>S.</FIRSTNAME>
+<SURNAME>Navathe</SURNAME>
+</AUTHOR>
+</AUTHORGROUP>
+<PUBLISHER>
+<PUBLISHERNAME>
+The Benjamin/Cummings Pub., Inc.
+</PUBLISHERNAME>
+</PUBLISHER>
+</BOOKBIBLIO>
+
+</BIBLIOENTRY>
+</BIBLIOGRAPHY>
+
 </Chapter>