2004-03-05 11:24:50 +08:00
|
|
|
XML-handling functions for PostgreSQL
|
|
|
|
=====================================
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
Development of this module was sponsored by Torchbox Ltd. (www.torchbox.com)
|
2004-03-14 11:19:13 +08:00
|
|
|
It has the same BSD licence as PostgreSQL.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This version of the XML functions provides both XPath querying and
|
|
|
|
XSLT functionality. There is also a new table function which allows
|
|
|
|
the straightforward return of multiple XML results. Note that the current code
|
|
|
|
doesn't take any particular care over character sets - this is
|
|
|
|
something that should be fixed at some point!
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
Installation
|
|
|
|
------------
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
The current build process will only work if the files are in
|
|
|
|
contrib/xml in a PostgreSQL 7.3 or 7.4 source tree which has been
|
|
|
|
configured and built (If you alter the subdir value in the Makefile
|
|
|
|
you can place it in a different directory in a PostgreSQL tree).
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
Before you begin, just check the Makefile, and then just 'make' and
|
|
|
|
'make install'.
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This code requires libxml to be previously installed.
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
Description of functions
|
|
|
|
------------------------
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
The first set of functions are straightforward XML parsing and XPath queries:
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-08 04:41:27 +08:00
|
|
|
xml_valid(document) RETURNS bool
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This parses the document text in its parameter and returns true if the
|
|
|
|
document is well-formed XML.
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_string(document,query) RETURNS text
|
|
|
|
xpath_number(document,query) RETURNS float4
|
|
|
|
xpath_bool(document,query) RETURNS bool
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
These functions evaluate the XPath query on the supplied document, and
|
|
|
|
cast the result to the specified type.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_nodeset(document,query,toptag,itemtag) RETURNS text
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This evaluates query on document and wraps the result in XML tags. If
|
|
|
|
the result is multivalued, the output will look like:
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
<toptag>
|
|
|
|
<itemtag>Value 1 which could be an XML fragment</itemtag>
|
|
|
|
<itemtag>Value 2....</itemtag>
|
|
|
|
</toptag>
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
If either toptag or itemtag is an empty string, the relevant tag is omitted.
|
|
|
|
There are also wrapper functions for this operation:
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_nodeset(document,query) RETURNS text omits both tags.
|
|
|
|
xpath_nodeset(document,query,itemtag) RETURNS text omits toptag.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_list(document,query,seperator) RETURNS text
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This function returns multiple values seperated by the specified
|
|
|
|
seperator, e.g. Value 1,Value 2,Value 3 if seperator=','.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_list(document,query) RETURNS text
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This is a wrapper for the above function that uses ',' as the seperator.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_table
|
|
|
|
-----------
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
This is a table function which evaluates a set of XPath queries on
|
|
|
|
each of a set of documents and returns the results as a table. The
|
|
|
|
primary key field from the original document table is returned as the
|
|
|
|
first column of the result so that the resultset from xpath_table can
|
|
|
|
be readily used in joins.
|
2001-08-21 08:39:20 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
The function itself takes 5 arguments, all text.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpath_table(key,document,relation,xpaths,criteria)
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
key - the name of the "key" field - this is just a field to be used as
|
|
|
|
the first column of the output table i.e. it identifies the record from
|
|
|
|
which each output row came.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
document - the name of the field containing the XML document
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
relation - the name of the table or view containing the documents
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
xpaths - multiple xpath expressions separated by |
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
criteria - The contents of the where clause. This needs to be specified,
|
|
|
|
so use "true" or "1=1" here if you want to process all the rows in the
|
|
|
|
relation.
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
NB These parameters (except the XPath strings) are just substituted
|
|
|
|
into a plain SQL SELECT statement, so you have some flexibility - the
|
|
|
|
statement is
|
2001-07-30 22:59:02 +08:00
|
|
|
|
2004-03-05 11:24:50 +08:00
|
|
|
SELECT <key>,<document> FROM <relation> WHERE <criteria>
|
|
|
|
|
|
|
|
so those parameters can be *anything* valid in those particular
|
|
|
|
locations. The result from this SELECT needs to return exactly two
|
|
|
|
columns (which it will unless you try to list multiple fields for key
|
|
|
|
or document). Beware that this simplistic approach requires that you
|
|
|
|
validate any user-supplied values to avoid SQL injection attacks.
|
|
|
|
|
|
|
|
Using the function
|
|
|
|
|
|
|
|
The function has to be used in a FROM expression. This gives the following
|
|
|
|
form:
|
|
|
|
|
|
|
|
SELECT * FROM
|
|
|
|
xpath_table('article_id',
|
|
|
|
'article_xml',
|
|
|
|
'articles',
|
|
|
|
'/article/author|/article/pages|/article/title',
|
|
|
|
'date_entered > ''2003-01-01'' ')
|
|
|
|
AS t(article_id integer, author text, page_count integer, title text);
|
|
|
|
|
|
|
|
The AS clause defines the names and types of the columns in the
|
|
|
|
virtual table. If there are more XPath queries than result columns,
|
|
|
|
the extra queries will be ignored. If there are more result columns
|
|
|
|
than XPath queries, the extra columns will be NULL.
|
|
|
|
|
|
|
|
Note that I've said in this example that pages is an integer. The
|
|
|
|
function deals internally with string representations, so when you say
|
|
|
|
you want an integer in the output, it will take the string
|
|
|
|
representation of the XPath result and use PostgreSQL input functions
|
|
|
|
to transform it into an integer (or whatever type the AS clause
|
|
|
|
requests). An error will result if it can't do this - for example if
|
|
|
|
the result is empty - so you may wish to just stick to 'text' as the
|
|
|
|
column type if you think your data has any problems.
|
|
|
|
|
|
|
|
The select statement doesn't need to use * alone - it can reference the
|
|
|
|
columns by name or join them to other tables. The function produces a
|
|
|
|
virtual table with which you can perform any operation you wish (e.g.
|
|
|
|
aggregation, joining, sorting etc). So we could also have:
|
|
|
|
|
|
|
|
SELECT t.title, p.fullname, p.email
|
|
|
|
FROM xpath_table('article_id','article_xml','articles',
|
|
|
|
'/article/title|/article/author/@id',
|
|
|
|
'xpath_string(article_xml,''/article/@date'') > ''2003-03-20'' ')
|
|
|
|
AS t(article_id integer, title text, author_id integer),
|
|
|
|
tblPeopleInfo AS p
|
|
|
|
WHERE t.author_id = p.person_id;
|
|
|
|
|
|
|
|
as a more complicated example. Of course, you could wrap all
|
|
|
|
of this in a view for convenience.
|
|
|
|
|
|
|
|
XSLT functions
|
|
|
|
--------------
|
|
|
|
|
|
|
|
The following functions are available if libxslt is installed (this is
|
|
|
|
not currently detected automatically, so you will have to amend the
|
|
|
|
Makefile)
|
|
|
|
|
|
|
|
xslt_process(document,stylesheet,paramlist) RETURNS text
|
|
|
|
|
|
|
|
This function appplies the XSL stylesheet to the document and returns
|
|
|
|
the transformed result. The paramlist is a list of parameter
|
|
|
|
assignments to be used in the transformation, specified in the form
|
|
|
|
'a=1,b=2'. Note that this is also proof-of-concept code and the
|
|
|
|
parameter parsing is very simple-minded (e.g. parameter values cannot
|
|
|
|
contain commas!)
|
|
|
|
|
|
|
|
Also note that if either the document or stylesheet values do not
|
|
|
|
begin with a < then they will be treated as URLs and libxslt will
|
|
|
|
fetch them. It thus follows that you can use xslt_process as a means
|
|
|
|
to fetch the contents of URLs - you should be aware of the security
|
|
|
|
implications of this.
|
|
|
|
|
|
|
|
There is also a two-parameter version of xslt_process which does not
|
|
|
|
pass any parameters to the transformation.
|
|
|
|
|
|
|
|
If you have any comments or suggestions, please do contact me at
|
|
|
|
jgray@azuli.co.uk. Unfortunately, this isn't my main job, so I can't
|
|
|
|
guarantee a rapid response to your query!
|