Parse and and recursion local variables?

[1/23] from: petr::krenzelok::trz::cz at: 17-Mar-2007 12:15

Hi, not being much skilled in parsing, I tried to do a little parser for my own rsp-tag system (I know there are few robust systems out there, but I want to learn via implementing my own). Basically the idea is to use kind of comment tags, which still allow web designer to display html content. The most attractive system for me was Gabriele's Temple, but it is not finished, nor supported, so I want my own one, much simpler :-) My tags look like:  subsection html code  I want to detect particular sections and invoke particular modules/code for them, submitting it section content. Basically it works, but I also wanted to try my very primitive parser to do some recursion. And it seems to me I tracked down, why it does not work - inside of nested sections, when the recursive rule is applied, I think that rsp-section-name is not kept local to particular recursion level? Can I make it local by putting e.g. parse into function body, defining a word (rsp-tag-nested) as a local variable? Or the issue is more complex and my rules simply build insufficiently? Also - if I don't use to [a | b | c], as we don't have any ;-), I have to skip by one char. That makes stripping out html subsections a bit difficult (where to put correct markers), as my rsp-html or html rules simply are "1 skip". But maybe that could be solved by defining what the proper html charset is? Sorry if my questions are rather primitive to our parse gurus here :-) -pekr- --------------------- REBOL [] template: { this is the beginning of html:-) Hello x! Hello y! Hello z! Hello w! another html code  subsection html code  finishing mark-w html code  this is the end :-) } out: copy "" ;--- uncommend following to get incorrect recursion behavior ;rsp-begin: ["" "]-->"] ;rsp-end: [""] ;--- comment following if enabling above ones ... rsp-begin: ["" (print copy/part s e) break | skip]] rsp-end: ["" (print copy/part s e) break | skip]] ;just to distinguish for eventual debugging ... html: [copy char skip (append out char)] rsp-html: [copy char skip (append out char)] rsp-section: [rsp-begin any [rsp-end break | rsp-section | rsp-html] ] rsp-rules: [any [rsp-section | html] end] parse/all template rsp-rules probe out halt

[2/23] from: lmecir:mbox:vol:cz at: 17-Mar-2007 17:54

Hi Pekr, ...

> My tags look like: >

<<quoted lines omitted: 9>>

> (rsp-tag-nested) as a local variable? Or the issue is more complex and > my rules simply build insufficiently?

I tried your code below uncommenting the marked lines and didn't reveal what you meant by "incorrect behaviour". Can you tell me what you expected?

> --------------------- > REBOL []

<<quoted lines omitted: 28>>

> probe out > halt

-L

[3/23] from: petr:krenzelok:trz:cz at: 17-Mar-2007 18:40

OK Ladislav, here we go. I bet I am doing something incorrectly. Simply put, once rsp-section recursion is applied for mark-u section, once it returns back to one level up, rsp-tag-name remains to be set to mark-u, whereas what I would like to achieve is - parser keeping that variable local for certain iteration, so once it would return from mark-u section back to finish parent mark-w section, having it set to mark-w. I tried to enclose the example in the function and define rsp-tag-name as local variable as you can see, but to no luck ... but maybe my aproach is incorrent conceptually anyway :-) Thanks, Petr ----------------------- REBOL [] template: { this is the beginning of html:-) Hello x! Hello y! Hello z! Hello w! another html code  subsection html code  finishing mark-w html code  this is the end :-) } ;parse-test-recursion: func [/local rsp-tag-name][ out: copy "" rsp-begin: ["" "]-->" (print [rsp-tag-name "start"])] rsp-end: ["" (print [rsp-tag-name "end"])] ;rsp-begin: ["" (print copy/part s e) break | skip]] ;rsp-end: ["" (print copy/part s e) break | skip]] ;just to distinguish for eventual debugging ... html: [copy char skip (append out char)] rsp-html: [copy char skip (append out char)] rsp-section: [rsp-begin any [rsp-end break | rsp-section (print ["finished subsection" rsp-tag-name]) | rsp-html]] rsp-rules: [any [rsp-section | html] end] parse/all template rsp-rules ;] ;parse-test-recursion probe out halt

[4/23] from: lmecir:mbox:vol:cz at: 17-Mar-2007 19:12

Petr Krenzelok napsal(a):

> OK Ladislav, here we go. I bet I am doing something incorrectly. Simply > put, once rsp-section recursion is applied for mark-u section, once it

<<quoted lines omitted: 7>>

> Thanks, > Petr

you are not too far away, I guess. Instead of using one PARSE, you could define your own PARSE-TAG function etc. But, I take this as an opportunity to promote my way ;-). If you take a look at http://www.fm.vslib.cz/~ladislav/rebol/parseen.r and try: rsp-section: [ do-block [ use [rsp-tag-name] [ rsp-begin: [...] rsp-end: [...] [rsp-begin any [rsp-end (print ["finished section" rsp-tag-name]) break | rsp-section | rsp-html]] ] ] ] you may find out, that it works -L

[5/23] from: volker:nitsch:gm:ail at: 17-Mar-2007 19:22

My typicall way to do that is my own stack. stack: copy[] parse rule[ (insert/only stack reduce[local1 local2]) rule ;recursion (set[local local2] first stack remove stack) ] On 3/17/07, Petr Krenzelok <petr.krenzelok-trz.cz> wrote:

> Hi, > not being much skilled in parsing, I tried to do a little parser for my

<<quoted lines omitted: 58>>

> To unsubscribe from the list, just send an email to > lists at rebol.com with unsubscribe as the subject.

-- -Volker Any problem in computer science can be solved with another layer of indirection. But that usually will create another problem. David Wheeler

[6/23] from: lmecir:mbox:vol:cz at: 17-Mar-2007 19:33

Ladislav Mecir napsal(a):

> Petr Krenzelok napsal(a): >> OK Ladislav, here we go. I bet I am doing something incorrectly. Simply

<<quoted lines omitted: 15>>

> But, I take this as an opportunity to promote my way ;-). If you take a > look at http://www.fm.vslib.cz/~ladislav/rebol/parseen.r and try:

sorry, correction, you would most probably need copy/deep [...] i.e. rsp-section: [ do-block [ use [rsp-tag-name] copy/deep [ rsp-begin: [...] rsp-end: [...] [rsp-begin any [rsp-end (print ["finished section" rsp-tag-name]) break | rsp-section | rsp-html]] ] ] -L

[7/23] from: petr:krenzelok:trz:cz at: 17-Mar-2007 19:34

Volker Nitsch wrote:

> My typicall way to do that is my own stack. > stack: copy[]

<<quoted lines omitted: 3>>

> (set[local local2] first stack remove stack) > ]

Guys, got to go to have few beers tonight with my friends, but - that kinda sucks ;-) I can guarantee you, that most novices, like me, will swear like mad when developing and touching first recursion. I think, that rebol's aproach of implicit global nature of words, even those inside functions, is the main hell for novices :-) One quick question - how is that with function recursion? Can I have local function value to one level of recursion call? And then shared one? I know I can build my own stack, but ... :-) And if it is possible with functions, parser should explicitly behave like that, even if it breaks rebol rules :-) Will try with recursive function or stack later ... -pekr-

[8/23] from: moliad:gmai:l at: 17-Mar-2007 19:09

hi Pekr, I have not completely followed the code part, as its complex but the recursion issue is simply by the very nature of parse... parse rules are not a stacked function calls. they are branches of execution with automatic series pointer rollback on error. so as you are traversing a series, you really only jump and come back ... no stack push... as you have no variables to push on the parse. if rebol did an explicit copy of the parse rules (thus localizing each rule at each instances), I can tell you that memory consumption and speed drop would not only be dramatic, but would render the parser unusable in any large dataset handling. the current parser is a stream analyser.. where you decide what to do next... the fact that the stream has a graph, tree, or recursive data organisation is not parse's fault. in the current implementation, we are able to parse 700MB files using a 15000 lines of parse rules (cause I know someone who has such a setup) and it screams!... if we added any kind of copy... any real usage would just crawl crash rebol and need GBs RAM. as we speak I have a 1300 line parse rule which handles several kb of string in 0.02 seconds (or less). So, this being said I know parse is a bitch to use at first... hell I gave up trying each year for the last 7 years...but for some reason, I gave it another try (again) in the last months... and well, I finally "GOT" it. its a strange process, but it suddenly becomes so clear all becomes obvious. The only thing I can say (from my own experience) ... don't give up... really do go to the end of your implementation and eventually you might GET it too. ;-) The only thing I can say about parse, is that its usually MUCH easier (and faster too) to parse the original string data and construct your data set AS a loadable string. in this way, you just brease through the data linearily (VERY FAST, no stack issues) and append all nesting in the loadable string as you go. simple generic html loading example: hope this helps -MAx ;--------------------------------------------------- rebol [] ;- ;- RULES html: context [ output: "" ;protect the global space data: none attr: none val: none attrstr: none alphabet: "abcdefghijklmnopqrstuvwxyz" not-quotes: complement charset {"} alpha: union charset alphabet charset 1234567890abcdefghijklmnopqrstuvwxyz_ABCDEFGHIJKLMNOPQRSTUVWXYZ nalpha: complement alpha path: not-quotes ;union alpha charset "%-+:&=./\" space: charset [#" " #"^/" #"^-"] spaces: [any space] attribute: [ spaces copy attr some alpha {=} spaces [ copy val some alpha | {"} copy val any not-quotes {"}] (append output rejoin [attr " {" val "}"]) ] in-tag: [ "<" copy data some alpha ( append output join data "[^/" ) spaces any attribute spaces ">" ] out-tag: [ ["</" thru ">"] ( append output "^/]") ] content: [copy data to "<" (if all [data not empty? trim copy data] [append output rejoin [ "{" data "}"]]) ] ;href: [ spaces {href="} copy attrs any path {"}] ;link-tag: ["<A" spaces some [href (append parsed-links attrs ) | attribute ] ">"] ;[[copy ref-url href (print ref-url)] | attribute ]] rule: [some [content [ out-tag | in-tag ] ] ] ] parse/all {<html> <body> <h3 >tada</h3> there you go :-) </body> </html>} html/rule probe load html/output html-blk: load html/output ; XPATH anyone ;-) probe html-blk/html/body/p/font/color ask "..." On 3/17/07, Petr Krenzelok <petr.krenzelok-trz.cz> wrote:

[9/23] from: lmecir:mbox:vol:cz at: 18-Mar-2007 17:37

Petr Krenzelok napsal(a):

> Volker Nitsch wrote: >> My typicall way to do that is my own stack.

<<quoted lines omitted: 18>>

> breaks rebol rules :-) Will try with recursive function or stack later ... > -pekr-

here are the results obtained: mark-x start mark-x end mark-y start mark-y end mark-z start mark-z end mark-w start mark-u start mark-u end finished subsection mark-w mark-w end ...and here is the code: include http://www.fm.vslib.cz/~ladislav/rebol/parseen.r template: { this is the beginning of html:-) Hello x! Hello y! Hello z! Hello w! another html code  subsection html code  finishing mark-w html code  this is the end :-) } out: copy "" ;just to distinguish for eventual debugging ... html: [copy char skip (append out char)] rsp-html: [copy char skip (append out char)] rsp-section: [ use [rsp-begin rsp-end rsp-tag-name] copy/deep [ rsp-begin: ["" "]-->" (print [rsp-tag-name "start"])] rsp-end: ["" (print [rsp-tag-name "end"])] [rsp-begin any [rsp-end break | do rsp-section (print ["finished subsection" rsp-tag-name]) | rsp-html]] ] ] parseen/all template [any [do rsp-section | html] end] probe out -L

[10/23] from: volker:nitsch:g:mail at: 18-Mar-2007 22:38

Am Samstag, den 17.03.2007, 19:34 +0100 schrieb Petr Krenzelok:

> Volker Nitsch wrote: > > My typicall way to do that is my own stack.

<<quoted lines omitted: 16>>

> with functions, parser should explicitly behave like that, even if it > breaks rebol rules :-) Will try with recursive function or stack later ...

I agree completely. That, the missing thru[a | b] and the clumsy charsets are the biggest showstoppers for beginners IMHO. there are workarounds, but they need a lot explaining/insights.

[11/23] from: moliad:gm:ail at: 18-Mar-2007 17:33

thru [a | b] is not only a problem in learning... its a valid extension cause it can simply many complex rules, by not having to explicitely implement all possible variations as rules. but I am almost sure that it would lead to regexp like slowdown in some rules ' :-/ why are charsets clumsy? -MAx On 3/18/07, Volker <volker.nitsch-gmail.com> wrote:

[12/23] from: petr:krenzelok:trz:cz at: 19-Mar-2007 10:02

> I agree completely. That, the missing thru[a | b] and the clumsy > charsets are the biggest showstoppers for beginners IMHO. there are > workarounds, but they need a lot explaining/insights. >

I know that to/thru [a | b | c] MIGHT slow down parser. Noone says, it has to do 3x find, evaluate index of match and return at lowest index. It can internally behave just like any [a | b | c | skip]. The thing is, that we said, that REBOL is about being easy. And above addition would make parse easier to understand/use for novices. I am not sure it would teach them to incline to bad habit. As for me, it is about - being able to use parse, or not using parse at all! Or we need new docs, explaining more properly how parse internally works, why you can't easily make your variables local, even if you wish to, etc. My opinion is, that when we find ourselves using workarounds or shortcuts nearly all the time, we need to rethink the concept once again, extend it, or introduce some mezzanine shortcut. I hope 'parse is on the radar for R3, and that we get some helpers in-there ... Petr

[13/23] from: volker:nitsch:gm:ail at: 19-Mar-2007 12:46

Am Sonntag, den 18.03.2007, 17:33 -0500 schrieb Maxim Olivier-Adlhoch:

> thru [a | b] is not only a problem in learning... its a valid extension > cause it can simply many complex rules, by not having to explicitely

<<quoted lines omitted: 3>>

> why are charsets clumsy? > -MAx

any [ thru [ a | b ] ] is similar to as any[ a | b | skip ] and thru[ a | b ] to some[ a break | b break | skip] So its there. If you know how. As with stacks and parse-recursion ;) Slowdown yes :) charsets are clumsy because they are defined somewhere and not in the rule. And they are look ugly, this #"c". digits: charset[ #0" - #"9" ] rule: [ some[ digit ] ] Thats natural for BNF-academics. But rule: [ some [ 0 - 9 ] ] would be much nicer.

[14/23] from: petr:krenzelok:trz:cz at: 19-Mar-2007 13:14

Volker Nitsch napsal(a):

> My typicall way to do that is my own stack. > stack: copy[]

<<quoted lines omitted: 3>>

> (set[local local2] first stack remove stack) > ]

Volker, thanks for the idea, with little bit of thinking I might get the final result ... ------------------- REBOL [] template: { this is the beginning of html:-) Hello x! Hello y! Hello z! Hello w! another html code  subsection html code  subsubsection html code   finishing mark-w html code  this is the end :-) } stack: context [ add: func [values][insert/only stack reduce values] remove: func [values][system/words/remove stack if not empty? stack [set values first stack]] stack: copy [] probe: does [system/words/probe stack] ] out: copy "" rsp-begin: ["" "]-->" (print [rsp-tag-name "start"]) (stack/add [rsp-tag-name])] rsp-end: ["" (print [rsp-tag-name "end"]) (stack/remove [rsp-tag-name])] ;just to distinguish for eventual debugging ... html: [copy char skip (append out char)] rsp-html: [copy char skip (append out char)] rsp-section: [rsp-begin any [rsp-end break | rsp-section (print ["back at section" rsp-tag-name]) | rsp-html]] rsp-rules: [any [rsp-section | html] end] parse/all template rsp-rules probe out halt

[15/23] from: petr:krenzelok:trz:cz at: 19-Mar-2007 13:29

> charsets are clumsy because they are defined somewhere and not in the > rule. And they are look ugly, this #"c".

<<quoted lines omitted: 3>>

> rule: [ some [ 0 - 9 ] ] > would be much nicer.

Hopefully also rule: [some [0 .. 9]] as a new range datatype in R3? :-) Petr

[16/23] from: anton:wilddsl:au at: 19-Mar-2007 23:34

How about this possible notation: #"0-9" which should create a charset just as: charset [#"0" - #"9"] Anton.

[17/23] from: christian:ensel:gmx at: 19-Mar-2007 22:22

> I know that to/thru [a | b | c] MIGHT slow down parser. Noone says, it > has to do 3x find, evaluate index of match and return at lowest index. > It can internally behave just like any [a | b | c | skip]. The thing is, > that we said, that REBOL is about being easy. And above addition would > make parse easier to understand/use for novices.

I think it would be really nice for all-but-guru-level parse-rule authors to have the TO [RULE-1 | RULE-2 | ... | RULE-N] and THRU [RULE-1 | RULE-2 | ... | RULE-N] rules working as abbrevations as Petr describes them - matching rules in the order they are given instead of matching at the lowest index. Just point out in prominent places that the rules are greedy, point out that the first matching rule gets applied. It's just more readable than Volker's ANY [RULE-1 | RULE-2 | ... | RULE-N | SKIP] and THRU [RULE-1 BREAK | RULE-2 BREAK | ... | RULE-N | SKIP] idioms. Which, by the way, I've just printed, framed and hung up over my bed so that hopefully I'll never ever forget about them ... Anyway, dreaming of Petr's TO and THRU, what immediatly springs to mind then are TO/FIRST [RULE-1 | RULE-2 | ... | RULE-N] and THRU/FIRST [RULE-1 | RULE-2 | ... | RULE-N] working in the reg-ex way. Looks to me like a natural extension of the parse dialect without breaking existing PARSE rules. -- Christian

[18/23] from: petr:krenzelok:trz:cz at: 19-Mar-2007 22:54

> Anyway, dreaming of Petr's TO and THRU, what immediatly springs to mind > then are TO/FIRST [RULE-1 | RULE-2 | ... | RULE-N] and THRU/FIRST > [RULE-1 | RULE-2 | ... | RULE-N] working in the reg-ex way. Looks to > me like a natural extension of the parse dialect without breaking > existing PARSE rules. >

Christian, actually what I had in mind was "the index" kind of thing applying FIRST occurance of rule1 | rule 2 | etc. IIRC, when I first proposed addition of above, I called new word inside parse dialect 'FIRST. That enhancement proposal was on-line at Robert's or Nenad's site, don't remember. Or look here: http://www.colellachiara.com/soft/Misc/parse-rep.html http://www.fm.vslib.cz/~ladislav/rebol/rep.html Petr

[19/23] from: moliad:gma:il at: 19-Mar-2007 17:08

btw, what people do not immediately see is that to and thru do not match rules, they only "match" charsets or strings, they really are like find and find/tail . actually, to and thru really only skip content, they don't match it. which is a big difference with the rest of parse which must match it. Which is why parsing is so hard. one must identify all the patterns which can match, often innwards out . I too, was tempted into using to and thru when I started... but I quickly understood that I could not go very far with parse, and in fact, probably one of the reasons I eventually didn't get to use it. inserting rules here would be extremely powerfull, but also very taxing and possibly even impossible, as it would mean actually tring all the rules at every byte. the first index of any of the rules which match completely would be returned... the compound effect of having many of these rules could lead to impposible parses or exponentially slow rules, which is usually not the case with even very large parse dialects. this is probably very close to internal regexp use in fact, but also why its so slow and as such becomes almost unuable on any long string or when a few conditional rule depths are built on any serious regexp string... I've used regexp and was dumbfounded by how quickly it slows down... even on current computers. not trying to bust the bubble, just trying to explain why some of the things are like they are. Parse is meant to be screaming fast, for many reasons... a lot of rebol is built using it (view, LNS, etc). we could argue that adding those things add to the options, true, but they might also become the de facto use, since they are easier to adopt, yet in the long run, might give a bad view of parse, which becomes "so slow". And few of use would learn and use the "real" parsing. Funny I'm sooo opinionated when a few months ago I was still clueless, ;-) -MAx On 3/19/07, Christian Ensel <christian.ensel-gmx.de> wrote:

[20/23] from: christian:ensel:gmx at: 20-Mar-2007 0:14

> Christian, actually what I had in mind was "the index" kind of thing > applying FIRST occurance of rule1 | rule 2 | etc. IIRC, when I first > proposed addition of above, I called new word inside parse dialect > 'FIRST.

Petr, yes, I imagined FIRST too, but then had difficulties coming up with a second (no pun intended) word to go for the difference between TO and THRU. Hence the refinement suggestion. But, Max is of course right in suggesting to leave TO and THRU untouched for the sake of parsing speed alone.

> we could argue that adding those things add to the options, true, but they > might also become the de facto use, since they are easier to adopt, yet in > the long run, might give a bad view of parse, which becomes "so slow". And > few of use would learn and use the "real" parsing.

I don't think I can agree with the reasoning on why to shy away from including some means to use PARSE the reg-ex way, though, Max. There might be reasons not to do it, may it be limited resources at RT or other. But not to feature them just to prevent inappropiate use looks a bit overcautious to me. (I don't see hammers getting too much bad press because of those people trying to drive screws into walls with them. It's just not the hammer's fault; and most people know this.) At least I'm confident that technically there's a way to include something as suggested without negative impact on existing PARSE's appliance. Be it a new key-word MATCH [RULE-1 | RULE-2 | ...] with /TO xor /THRU and optional /FIRST refinement or whatever else. I've seen people asking for something like this for so many years now ... - Christian

[21/23] from: petr:krenzelok:trz:cz at: 20-Mar-2007 7:21

Christian Ensel wrote:

>> Christian, actually what I had in mind was "the index" kind of thing >> applying FIRST occurance of rule1 | rule 2 | etc. IIRC, when I first

<<quoted lines omitted: 4>>

> with a second (no pun intended) word to go for the difference between TO > and THRU. Hence the refinement suggestion.

Christian, the problem is, that 'parse can't handle paths, or am I wrong? So - the refinement way is unlikely. But maybe I am wrong? Could anyone elaborate, if our keywords could use refinements eventually? Petr

[22/23] from: petr:krenzelok:trz:cz at: 20-Mar-2007 7:41

Maxim Olivier-Adlhoch wrote:

> btw, > what people do not immediately see is that to and thru do not match rules,

<<quoted lines omitted: 6>>

> started... but I quickly understood that I could not go very far with parse, > and in fact, probably one of the reasons I eventually didn't get to use it.

Max, what do you mean by "only skip content"? I am not C guru, but how actually do you think, that 'find works? IMO it HAS TO check each char of unindexed content, to find out, if the string you search for is contained in the searched string, no? And if so, it actually works in the "match" way anyway? The obstacle of to/thru is exactly the lack of lowest index match returned first. Becase if you search [[to "some string" | to "other string"]], you usually want to stop at the lowest index, whereas such rule would apply for whatever "some string" occurence. Now who says, that to/thru [a | b | c] should work 3x 'find way internally? My understanding is, that you expect that you think it would do index? find a, index? find b, index? find c, find minimum of indexes, returning such rule as a match. So yes, here it is a slow down, because if the string will be long, and you will have many options to search for, it can get slow. But, why should it necessarily work that way? Internally it could work [a | b | c | skip] way, no? Or am I missing something? PS: I would like some official places (Carl, Gabriele, Ladislav) to tell us, what is planned for parse. In fact, I started this thread to raise the community voice, because I think this topi was raised in 2001 already? However, I am not sure ML is not kind of dead channel to RT, du not remember when last Carl was here, so not sure if RT tracks it. But at least this topic is covered here, so users can search it on rebol.org, which is always good ... Petr

[23/23] from: chris-ross:gill at: 20-Mar-2007 8:13

Petr Krenzelok wrote:

> Christian, the problem is, that 'parse can't handle paths, or am I > wrong? So - the refinement way is unlikely. But maybe I am wrong? Could > anyone elaborate, if our keywords could use refinements eventually?

Parse can now handle paths. Currently refinements are literal (that is -- parse [/local] [/local] -- holds), so I don't see any reason they couldn't technically be used in the dialect. Whether they should or not is another matter... Also, if the feature was implemented, I'd be in favour of 'first-of -- eg. [first-of [rule-1 | rule-2]] -- seems a little more descriptive than 'first or 'match. - Chris

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted