World: r3wp
[XML] xml related conversations
older newer | first last |
Christophe 7-Nov-2005 [300] | More recent and up-to-date (and used by the french community) is RUn : http://rebol-unit.sourceforge.net/ |
Geomol 7-Nov-2005 [301] | But that'll add to the size. I like RebXML to take up minimal space. |
Christophe 7-Nov-2005 [302] | > Some more ideas: I think the idea behind rebxml is great - build some common format representing xml in REBOL blocks. Some more ideas/wishes: > nodes like [elem "chapter" attribs [name "value" id "0815"] [ elem "sect" attribs [ id "5x12"] [ ....]] Our first solution (actually the one we're now using in production) was similar to that. But it brings a lot of ovehead to the data and the data adressing is far to be intuitive : aaa/elem/bbb/elem/ccc/attribs/name instead of aaa/bbb/ccc/name for instance. Not the most suitable solution as we experimented. |
Geomol 7-Nov-2005 [303x2] | I agree. I think, if comments are to be handled in RebXML, they should be represented as strings. Then the hurdle to distinguish them from data strings has to be solved. |
It would be triviel to parse a RebXML block and add the node names (elem, attribs and comment), if that format is desired, but RebXML itself should be with as little overhead as possible. | |
Christophe 7-Nov-2005 [305] | Geomol: why do you need to handle comments ? Aren't they there to facilitate the _reading_ of the XML code ? You'd not need them if you want to manipulate the data, right? |
Geomol 7-Nov-2005 [306] | Right, but Carsten asked for comments, so: output: rebxml2xml xml2rebxml <XML file> will make output the same as the original XML input. |
Christophe 7-Nov-2005 [307x2] | BTW, we called our project (not having find a better name): EasyXML. Just for the record :-) |
Ok, Geomol, I missed the point | |
Volker 7-Nov-2005 [309] | how about using some extra char? elem! attrib? aaa!/bbb!/ccc?/name ? |
Christophe 7-Nov-2005 [310] | In this case, perhaps you could consider the comments as a special case of an empty tag, marking it with an heading "--" for example. It would not create a lot of overhead i think |
Geomol 7-Nov-2005 [311] | I need to sleep on it. :-) |
CarstenK 8-Nov-2005 [312] | Christophe: Thanks for the rebol-unit link, how different is EasyXML from rebXML? Another question: how near to XML 1.0 should the REBOL implementation be? If yes, so the block format needs a document block with doctype information and children (elements, text, comments, processing instructions and attributes) and of course namespaces. How about DTD support and external entities like this: <?xml version="1.0"?> <!DOCTYPE root [ <!ENTITY test SYSTEM "external.xml"> ]> <root> &test; </root> They don't need to be preserved but should be resolved. Geomol: I fully agree with you, to have a small format, but I think it would be nice if it supports the basic XML nodes. These are only my wishes of course ..., maybe we don't need extra words for elems and attributes, only for comments or PIs as special types of element children? |
Geomol 8-Nov-2005 [313] | Carsten, I've uploaded new versions of the RebXML scripts to: http://home.tiscali.dk/john.niclasen/rebxml/ Comments are now handled as strings, they are simple preserved without modifications, and in rebxml2xml I then check for "<!--" in the start of the string to distinguish them from other string data. Sending xml-data through first xml2rebxml and then rebxml2xml should only change white-space within tags. Try the new versions and let me know, if it works. |
Christophe 8-Nov-2005 [314x2] | Carsten: "how different is EasyXML from rebXML?" I don't know :-) The most of our REBOL development is conditioned by the need of my job. Now I need an easy way to access to the parsed data. Xpath is an easy way. So we are creating a structure which facilitate the access to nested data. And it's fun :-) Now it could be john create something similar, and that we like it and adopt it. Who knows ? |
Has anybody think about a rigth data structure to use with a SAX-implementation ? I was thinking of the hash! and its performence for level 1 data retrieval. Perhaps an appropriate data structure could be a binary array labeling each element with a concatenation of the access path. Like this: <aaa attaaa="aaa1"><bbb>contentbbb</bbb></aaa> becomes make hash! [aaa id2 aaa-attaaa "aaa1" aaa-bbb "contentbbb"] based on a mapping table make hash! [id1 aaa id2 bbb] or something similar... just a rough though ! | |
BrianH 8-Nov-2005 [316] | SAX apis don't work like that. They generate a series of events, not a series of data. |
Christophe 8-Nov-2005 [317] | I thought SAx was about finding the most suitable data structure - not a tree representation, which is DOM. I don't know if the event handling part is mandatory (BTW, to whom ?). isn't all about accessing XML data the best way a PL can ? |
BrianH 8-Nov-2005 [318x2] | For SAX, the event handling is the data model, the whole thing that makes it efficient. |
The only difference is whether it is push (callbacks) or pull (state machine, I think). | |
Christophe 8-Nov-2005 [320] | Ok, I'm not a SAX specialist :-/ for my understanding, could you give an example of how <aaa attaaa="aaa1"><bbb>contentbbb</bbb></aaa> should be SAX-handled ? |
BrianH 8-Nov-2005 [321] | If you say "I want to do a SAX-style XML parser", you mean event handling. Other data models have their own apis to copy, or don't so you have to come up with something new :) |
Christophe 8-Nov-2005 [322] | So we can call it RebSAX approach :-)) ? |
BrianH 8-Nov-2005 [323x5] | As for that data, let's assume a normal, fine-grained model. I'll just list the events: tag "aaa" attribute "attaaa" "aaa1" end tag tag "bbb" end tag contentbbb tag "/bbb" end tag tag "/aaa" end tag If you use a more coarse-grained model, you could have an event for a whole tag, its attributes, namespaces and such, rather than seperate events for each. This might be more appropriate for a more powerful language like REBOL. Fine-grained events are really more appropriate for languages with poor data structure support, like C or rebcode. |
Balancing the detail of the events against the function-call overhead of the language may be appropriate. One advantage to SAX-like apis is that you can register handlers for certain events and ignore others you aren't interested in, making your code even more efficient. | |
Those tag "/bbb" end tag events might be better named closetag "bbb" since close tags aren't supposed to have attributes anyway. | |
The important thing is to make sure that the events or data structures are a good map of the semantic model of XML. They have standards abut that too. | |
(abut = about) | |
CarstenK 8-Nov-2005 [328] | John: I''ve downloaded the scripts and will check them. |
Christophe 8-Nov-2005 [329] | Did you have a look at the source of 'parse-xml ? Is this what is meant to be event-driven ? |
BrianH 8-Nov-2005 [330] | No, parse-xml generates a (broken, incomplete) DOM tree. Gavin McKenzie's xml-parse is more like a SAX parser. |
Christophe 8-Nov-2005 [331] | hum... i will digg a little more into the the theory i think. I had learnad another approach to that. Thanks anyway for showing the way ! |
CarstenK 9-Nov-2005 [332x3] | I've also had a look inside xml-parse, it seems to be really like SAX - ready to use. But nobody is maintaining it, I think. As far as I understand, somebody could create a Handler to get the desired block structure (for instance a Handler for RebXML or any other model). I have to learn about this in REBOL. A question: how can I measure memory for a block or an object tree in REBOL? |
RebXML: I did some testing with rebxml, the documents I used can be found here: http://www.simplix.de/rebol/resources/xml/xmltests.zip There is also a simple script that reads the XML docs in and writes them back. Some problems I found: - empty attributes, I have fixed this in the zip - entities in content: all should be escaped, because they can be found there, otherwise a " gets &quot; - comments after last element missed - comments before first element - missing line feed - missing PIs in output Another question: encoding - it seems that all output files will be written in iso-8859-1 ? | |
I have no idea about comparision of XML documents (input and output of rebxml for instance ) to ensure correctness, but it seems to be difficult. | |
Geomol 9-Nov-2005 [335x2] | About memory for block or object, If you mean in bytes internally in REBOL, I don't know. But you could save the block or object to a file and see a size that way. You can of course see the length of a serie with: length? |
About encoding in RebXML, rebxml2xml let you produce utf-8 by specifying the /utf-8 refinement: rebxml2xml/utf-8 <some rebxml data> | |
CarstenK 9-Nov-2005 [337] | With length? i need some recursion, otherwise I get only the first level of the block if it is nested? How to serialize an object tree in REBOL - is there some function available? |
Geomol 9-Nov-2005 [338x2] | Carsten, a recursive function to count length of blocks with nested blocks: total-length?: func [b [block!] /local n] [ n: 0 foreach e b [if block? e [n: n + total-length? e] n: n + 1] ] |
total-length? will count elements, and another block is also an element. | |
CarstenK 9-Nov-2005 [340] | John: Thank you, I'll play with it. I found this python tool - maybe some interessting ideas there: http://uche.ogbuji.net/uche.ogbuji.net/tech/4suite/amara/quickref He uses objects but I like the idea for accessing xml - replacing the dots with slashes it looks for me like REBOL: doc/a/nodeName doc/a/b/1 ... doc/xml |
Chris 10-Nov-2005 [341x3] | Catching up a little. Be interesting to summarise this thread as there are many different ideas expressed. rebxml looks interesting for loading, saving and likely extracting xml, but still perhaps difficult to manipulate. |
note: this group isn't showing on the web site, is this due to [web public] instead of [web-public] ? | |
I've also noticed a tendency to kick the DOM (no doubt for good reason) -- though worth noting that it is a complete api to xml and it is a standard api, I wouldn't underestimate the value of the latter, particularly when it comes to Rebol advocacy... | |
Geomol 11-Nov-2005 [344] | RebXML is meant for conversion to/from the RebXML format and other formats (incl. XML). I use the RebXML format with NicomDoc, which makes it a lot easier to handle document formats. Let's say, you've got an XML file, and want to convert it to a format easily read by some application, then you first use xml2rebxml to get the XML file to RebXML format. Then make a converter from RebXML to the final format by renaming the rebxml2xml script and change it to do the output, that is wanted. rebxml2xml holds the structure of the RebXML format, so it's easier to start with that script. Search for "output" in rebxml2xml. Maybe I should make a converter from RebXML to some format very easily manipulated directly within REBOL, like the python tool, Carsten found. |
Chris 11-Nov-2005 [345x2] | But this is the issue here with Rebol and XML, there are solutions that suit one XML operation or another. Aiming for loosely implementing DOM gives us loading, extraction, modification, and saving without affecting the integrity of the data structure. Examples: changing the title of an HTML page, adding an entry to an RSS file, etc. |
Using DOM methods, you can do this albeit clumsily, but completely. All through a set of standard functions, with no need to manipulate the structure directly. | |
Pekr 11-Nov-2005 [347] | hmm, couldn't we just somehow mix the aproach, so to have some streamed dom? :-) I don't like the idea of having 10MB XML interchange file to load into memory .... |
Chris 11-Nov-2005 [348] | Any less than you'd want a 10mb Rebol interchange file? What % of cases would this be an issue? |
Volker 11-Nov-2005 [349] | xml is used to store word-files, rebol not? :) |
older newer | first last |