ANN: xml-object.r , and...a question about REBOL's built-in parse-xml

[1/7] from: gavin::mckenzie::sympatico::ca at: 4-Oct-2001 22:45

Folks, First, there's a new rev of xml-object.r (v 1.0.4) available at: http://www3.sympatico.ca/gavin.mckenzie/ This release fixes an error with whitespace processing. And, I changed my switch statements based on the recent switch/type? thread. I've noticed some limitations in xml-object. If you have element with an attribute and a subelement with the same name, bad things happen. This should really be considered poor form in XML. However, it is legitimately possible to encounter an attribute and subelement of the same local-name within different namespaces. I could also improve my mixed-content processing somewhat...anyway, more work to do. Now...on to a question. I've been building some REBOL-Server-Pages where I query against a MS-SQL-Server and get back an XML result set. So, I've been using the built-in REBOL parse-xml rather than my parse-xml+ script, and I noticed something that became very frustrating: when parse-xml encounters a XML declaration (<?xml version ...?>) it calls a function... check-version: func [version][print ["XML Version:" version]] ...which has the nasty side effect of printing out "XML Version" with the version number. This message, of course, messes up my carefully crafter HTML page that is produced from my REBOL server page. Anyway, my quick fix was to use my parse-xml+ function, but given that the XML I was processing was tightly controlled and straightforward I was actually expecting to stick to the built-in function. Oh well, I suppose I could hack the build-in xml-language object that parse-xml uses and stub out the check-version function, but that seems beside the point. Has anyone else used parse-xml and considered this a real problem? I do hope that some future rev of REBOL will drop the check-version print. Gavin.

[2/7] from: chris:langreiter at: 5-Oct-2001 9:09

Re: ANN: xml-object.r , and...a question about REBOL's built-in parse-xm

> >Has anyone else used parse-xml and considered this a real problem? I do >hope that some future rev of REBOL will drop the check-version print. >

I hope so too. Otherwise you'll continue to find the line xml-language/check-version: func [v][return] in every single REBOL script dealing with XML in a parse-xmly way. May I raise the question what the point of this print-out is or was?! Or is it just a a not-so-subtle display of RT's disregard of XML as data exchange format (which I don't share, though I prefer native REBOL exchange as well)? BTW, Gavin, your xml-object script is a godsend. RT should include it in future REBOL releases. -- Chris __ Vanilla NOW: http://www.langreiter.com/space/vanilla-download

[3/7] from: joel::neely::fedex::com at: 5-Oct-2001 2:00

Hi, Gavin, Gavin F. McKenzie wrote:

> I've noticed some limitations in xml-object. If you have > element with an attribute and a subelement with the same > name, bad things happen. This should really be considered > poor form in XML.. >

Sorry, but I must emphatically disagree. This is equivalent to saying that recursive function call are bad form. XML markup shows semantic structure, and it is entirely legitimate that such structure be recursive in nature. One of the first serious applications I wrote in REBOL (in fact it was one of the main reasons I began using REBOL) was an XML- based web site generator which combines content from individual HTML files with an XML document that represents the structure of the site. It generates per-page "navigation bars" from the knowledge of where each page fits into the overall site, and generates the final pages by inserting content and navigation into templates. (Sorry for the long-winded background, but it is the reason for the example below.) A simplified version of the site file has content such as this (with ellipses standing for details beside the current point): <site docroot="/opt/netscape/suitespot/docs/devgroup/" source="/export/home/sitedev/devgroup/" ... > <page title="Home" file="index.html" ... > <page title="Our Mission" file="mission.html" .../> <page title="Our People" file="people.html" .../> <page title="Visit Us" file="map.html" .../> </page> <page title="Projects" file="proj.html" ... > <page title="Widgets" file="pr.3094.html" .../> <page title="Frobs" file="pr.3128.html" .../> <page title="Cruft" file="pr.3312.html" ... > <page title="Biggie" file="pr.3467.html" ... > <page title="ROI" file="roi.3467.html" .../> <page title="Budget" file="bud.3467.html" .../> </page> </page> ... </site> It is entirely reasonable to have some pages with sub-pages and others without. Pages are represented with PAGE elements whose location (nested within other PAGEs or not) in the XML document shows where they fit into the site structure. Since none of the information about a page (attributes of the PAGE element) is dependent on where the page is in the site, the site can be re-structured simply by moving one or more PAGE elements to a new place in the tree and re-running the generator (usually a 15- to 30-second effort). Although the "recursion" is indirect, standard HTML allows the nesting of tables and framesets. XHTML (essentially writing HTML with XML notation conventions) should allow these as well.

> I could also improve my mixed-content processing somewhat...anyway, more > work to do.

<<quoted lines omitted: 7>>

> This message, of course, messes up my carefully crafter HTML > page that is produced from my REBOL server page.

Disabling that one function is easy. I've made other modifications to xml-parser for other purposes as well. Here's some sample XML ...

>> foo: {

{ <?xml version="2.5" ?> { <motor productID="375-2385"> { <assembly productID="238-2356"> { <assembly productID="795-5837"/> { <assembly productID="123-4567"/> { </assembly> { <assembly productID="987-6543"> { </motor> { } == { <?xml version="2.5" ?> <motor productID="375-2385"> <assembly productID="238-2356"> <assembly productID="795-5837"... ... which shows your problem when parsed.

>> parse-xml foo

XML Version: 2.5 == [document none [["motor" ["productID" "375-2385"] ["^/ " ["assembly" ["productID" "238-2356"] ["^/ " ["assembly" ["pro... So, let's disable the offending function ...

>> xml-language: make xml-language [

[ check-version: func [version][] [ ] ... and parse again.

>> parse-xml foo

== [document none [["motor" ["productID" "375-2385"] ["^/ " ["assembly" ["productID" "238-2356"] ["^/ " ["assembly" ["pro... HTH! -jn- -- The end of all our exploring will be to arrive where we started and know the place for the first time. -- T.S. Eliot joel-dot-neely-FIX-PUNCTUATION-at-fedex-dot-com

[4/7] from: gavin:mckenzie:sympatico:ca at: 5-Oct-2001 8:28

Re: ANN: xml-object.r , and...a question about REBOL's built-in parse-x

Hi Joel, I think we've misunderstood each other. I should have included an example with my comments as clarification; sorry. Nested structures with repeating names are absolutely ok. What I was referring to is the following: <foo bar="something"> <bar>something</bar> </foo> In that example foo has both a child element named 'bar' and an attribute named 'bar'. While it is perfectly legal to do this, it is considered (by many) to be poor form because it makes the representation of the XML in objects and exposure into scripting engines (such as Active Scripting or some other script engine) problematic. The xml-object.r script doesn't handle the above case very well. What you were referring to is doing something like: <foo> <bar>something</bar> <bar>something</bar> </bar> </foo> And of course, this is ok -- in fact it is extremely useful. The ability to represent repeating nested or 'recursive' structures is a very important capability, as you rightly point out. And, I'm happy to say, xml-object.r handles it ok too. Gavin.

[5/7] from: deryk::iitowns::com at: 5-Oct-2001 22:39

On Friday 05 October 2001 08:28, you wrote:

> What you were referring to is doing something like: > <foo>

<<quoted lines omitted: 6>>

> capability, as you rightly point out. And, I'm happy to say, xml-object.r > handles it ok too.

[[deryk--trek] deryk]$ xmllint lint lint:5: error: Opening and ending tag mismatch: foo and bar </bar> ^ lint:6: error: Extra content at the end of the document </foo> ^ _almost_ ;)

[6/7] from: gavin:mckenzie:sympatico:ca at: 5-Oct-2001 12:19

Yeah...the dangers of hand-typing XML and not completely paying attention. <foo> <bar> <bar>something</bar> </bar> </foo> Better. Gavin.

[7/7] from: joel::neely::fedex::com at: 5-Oct-2001 15:28

Re: ANN: xml-object.r , and...a question about REBOL's built-inparse-xml

Hi, Gavin, Thanks for the clarification! Gavin F. McKenzie wrote:

> What I was referring to is the following: > <foo bar="something">

<<quoted lines omitted: 6>>

> engines (such as Active Scripting or some other script engine) > problematic.

With my understanding fixed, I admit I disagree less ;-), but still don't this as much of an issue of concern. It seems to me that the concept of attributes is that of name/value pairs that are "parts of" an entity, whereas the concept of a child entity is a subordination issue -- a different relationship. Every time I've played with XML and XML parsing, those have been represented distinctly (e.g. in Perl, a hash for the name/value pairs in attributes and an array for contents; in REBOL a block of name/value pairs for attributes and a separate block of contents), so I wouldn't expect any real implementation issues. Just my $0.02... -jn- -- This sentence contradicts itself -- no actually it doesn't. -- Doug Hofstadter joel<dot>neely<at>fedex<dot>com

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted