r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[XML] xml related conversations

BrianH
8-Nov-2005
[316]
SAX apis don't work like that. They generate a series of events, 
not a series of data.
Christophe
8-Nov-2005
[317]
I thought SAx was about finding the most suitable data structure 
- not a tree representation, which is DOM.

I don't know if the event handling part is mandatory (BTW, to whom 
?).
isn't all about accessing XML data the best way a PL can ?
BrianH
8-Nov-2005
[318x2]
For SAX, the event handling is the data model, the whole thing that 
makes it efficient.
The only difference is whether it is push (callbacks) or pull (state 
machine, I think).
Christophe
8-Nov-2005
[320]
Ok, I'm not a SAX specialist :-/
for my understanding, could you give an example of how 
<aaa attaaa="aaa1"><bbb>contentbbb</bbb></aaa>
should be SAX-handled ?
BrianH
8-Nov-2005
[321]
If you say "I want to do a SAX-style XML parser", you mean event 
handling. Other data models have their own apis to copy, or don't 
so you have to come up with something new :)
Christophe
8-Nov-2005
[322]
So we can call it RebSAX approach :-)) ?
BrianH
8-Nov-2005
[323x5]
As for that data, let's assume a normal, fine-grained model. I'll 
just list the events:

tag "aaa"
attribute "attaaa" "aaa1"
end tag
tag "bbb"
end tag
contentbbb
tag "/bbb"
end tag
tag "/aaa"
end tag


If you use a more coarse-grained model, you could have an event for 
a whole tag, its attributes, namespaces and such, rather than seperate 
events for each. This might be more appropriate for a more powerful 
language like REBOL. Fine-grained events are really more appropriate 
for languages with poor data structure support, like C or rebcode.
Balancing the detail of the events against the function-call overhead 
of the language may be appropriate. One advantage to SAX-like apis 
is that you can register handlers for certain events and ignore others 
you aren't interested in, making your code even more efficient.
Those
    tag "/bbb"
    end tag
events might be better named
    closetag "bbb"
since close tags aren't supposed to have attributes anyway.
The important thing is to make sure that the events or data structures 
are a good map of the semantic model of XML. They have standards 
abut that too.
(abut = about)
CarstenK
8-Nov-2005
[328]
John: I''ve downloaded the scripts and will check them.
Christophe
8-Nov-2005
[329]
Did you have a look at the source of 'parse-xml ? Is this what is 
meant to be event-driven ?
BrianH
8-Nov-2005
[330]
No, parse-xml generates a (broken, incomplete) DOM tree. Gavin McKenzie's 
xml-parse is more like a SAX parser.
Christophe
8-Nov-2005
[331]
hum... i will digg a little more into the the theory i think. I had 
learnad another approach to that.
Thanks anyway for showing the way !
CarstenK
9-Nov-2005
[332x3]
I've also had a look inside xml-parse, it seems to be really like 
SAX - ready to use. But nobody is maintaining it, I think. As far 
as I understand, somebody could create a Handler to get the desired 
block structure (for instance a Handler for RebXML or any other model). 
I have to learn about this in REBOL.


A question: how can I measure memory for a block or an object tree 
in REBOL?
RebXML: I did some testing with rebxml, the documents I used can 
be found here:
 http://www.simplix.de/rebol/resources/xml/xmltests.zip

There is also a simple script that reads the XML docs in and writes 
them back.

Some problems I found:
- empty attributes, I have fixed this in the zip

- entities in content: all should be escaped, because they can be 
found there, otherwise a &quot; gets &amp;quot;
- comments after last element missed
- comments before first element - missing line feed
- missing PIs in output


Another question: encoding - it seems that all output files will 
be written in iso-8859-1 ?
I have no idea about comparision of XML documents (input and output 
of rebxml for instance ) to ensure correctness, but it seems to be 
difficult.
Geomol
9-Nov-2005
[335x2]
About memory for block or object, If you mean in bytes internally 
in REBOL, I don't know. But you could save the block or object to 
a file and see a size that way. You can of course see the length 
of a serie with: length?
About encoding in RebXML, rebxml2xml let you produce utf-8 by specifying 
the /utf-8 refinement:
rebxml2xml/utf-8 <some rebxml data>
CarstenK
9-Nov-2005
[337]
With length? i need some recursion, otherwise I get only the first 
level of the block if it is nested? How to serialize an object tree 
in REBOL - is there some function available?
Geomol
9-Nov-2005
[338x2]
Carsten, a recursive function to count length of blocks with nested 
blocks:

total-length?: func [b [block!] /local n] [
	n: 0
	foreach e b [if block? e [n: n + total-length? e] n: n + 1]
]
total-length? will count elements, and another block is also an element.
CarstenK
9-Nov-2005
[340]
John: Thank you, I'll play with it.

I found this python tool - maybe some interessting ideas there:
http://uche.ogbuji.net/uche.ogbuji.net/tech/4suite/amara/quickref

He uses objects but I like the idea for accessing xml - replacing 
the dots with slashes it looks for me like REBOL:
doc/a/nodeName
doc/a/b/1
...
doc/xml
Chris
10-Nov-2005
[341x3]
Catching up a little.  Be interesting to summarise this thread as 
there are many different ideas expressed.  rebxml looks interesting 
for loading, saving and likely extracting xml, but still perhaps 
difficult to manipulate.
note: this group isn't showing on the web site, is this due to [web 
public] instead of [web-public] ?
I've also noticed a tendency to kick the DOM (no doubt for good reason) 
-- though worth noting that it is a complete api to xml and it is 
a standard api, I wouldn't underestimate the value of the latter, 
particularly when it comes to Rebol advocacy...
Geomol
11-Nov-2005
[344]
RebXML is meant for conversion to/from the RebXML format and other 
formats (incl. XML). I use the RebXML format with NicomDoc, which 
makes it a lot easier to handle document formats. Let's say, you've 
got an XML file, and want to convert it to a format easily read by 
some application, then you first use xml2rebxml to get the XML file 
to RebXML format. Then make a converter from RebXML to the final 
format by renaming the rebxml2xml script and change it to do the 
output, that is wanted. rebxml2xml holds the structure of the RebXML 
format, so it's easier to start with that script. Search for "output" 
in rebxml2xml.

Maybe I should make a converter from RebXML to some format very easily 
manipulated directly within REBOL, like the python tool, Carsten 
found.
Chris
11-Nov-2005
[345x2]
But this is the issue here with Rebol and XML, there are solutions 
that suit one XML operation or another.  Aiming for loosely implementing 
DOM gives us loading, extraction, modification, and saving without 
affecting the integrity of the data structure.  Examples: changing 
the title of an HTML page, adding an entry to an RSS file, etc.
Using DOM methods, you can do this albeit clumsily, but completely. 
 All through a set of standard functions, with no need to manipulate 
the structure directly.
Pekr
11-Nov-2005
[347]
hmm, couldn't we just somehow mix the aproach, so to have some streamed 
dom? :-) I don't like the idea of having 10MB XML interchange file 
to load into memory ....
Chris
11-Nov-2005
[348]
Any less than you'd want a 10mb Rebol interchange file?  What % of 
cases would this be an issue?
Volker
11-Nov-2005
[349]
xml is used to store word-files, rebol not? :)
CarstenK
12-Nov-2005
[350x2]
in the moment i play a little bit with xml-parse.r, it has a lot 
of things done, some are still open (like  <!ENTITY ...> parsing) 
and it is like SAX - I try to implement some handlers to learn REBOL, 
but it's still in progess. A benefit of xml-parse is, that there 
would be only one parser and some kind of standard API and the handler 
could then generate rebxml or some other desired format
DOM: in java APIs there were allways problems with dom - big amount 
of memory, not optimized for a language, so there was a need for 
optimized tools like JDOM, XOM or DOM4J, they all prefer SAX for 
parsing and have their own internal model - of course the API is 
special for all these tools and no standard like DOM
Volker
12-Nov-2005
[352]
I guess in rebol we have fewer problems than java, as rebol is dynamic 
and java has to emulate that? So it cant map its own classes because 
the format is not known at compile-time? While we can. And then xml 
in memory should be in the order of rebol-blocks?
Maxim
13-Nov-2005
[353]
out of the blue, can anyone point me to the (or one) official XML 
spec ? (if there are many, it should be the one most used on windows 
and in things like PHP)

thanks!
Chris
14-Nov-2005
[354]
http://www.w3.org/TR/REC-xml/
Maxim
14-Nov-2005
[355x2]
thanks Chris !
will be reading top to bottom ...  not that this is any fun...  ;-)
Christophe
27-Nov-2005
[357]
Has somebody already give a try to a SAX implementation ?
Will
8-Jan-2006
[358]
http://tech.motion-twin.com/xmllight.html
Maxim
22-Mar-2006
[359x4]
xml is such bloat.. I am parsing xml these days and for two characters 
of data, I often have a 100+ characters of nested stupidity.
an empiric test (subjective to the xml structure and tag names obviously, 
but this IS a real world xml file)
693 kb in xml form   ==>  90 kb  in nested rebol blocks
I left the tabs at 2 spaces in the rebol output, so that the comparison 
is fair.
Anton
23-Mar-2006
[363]
no need to convince us :-)
[unknown: 9]
23-Mar-2006
[364]
Agreed.  So, write a Rebol block ML that does everything as well 
as XML, and we will support it.
Thør
4-Apr-2006
[365]
manual resync...