Mailing List Archive: 49091 messages
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

[REBOL] Re: A little parse help

From: brian:hawley at: 28-Aug-2001 1:36

A little late, but... At 08:21 PM 8/24/01 +0200, Stefan Falk wrote:
>Hi, >thanks, this worked! > >just two questions, >what's the last :mark1 there for?
At every step of the parse process, there are two implicit parameters: the series that you are processing and the current position within that series. In parse rules you can assign the series (at its current position) to a word (x) by putting the set-word (x:) in the rules at a given point. You can also reset the implicit parse series (and position) to the value assigned to a word by putting the get-word (:x) at a given point in the rules. If you are changing the series you are working on while you are parsing it, you need to make sure that parse is able to keep track of its implicit position setting. This is not a problem if you are changing the series in front of or at the implicit position, like this: [to "foo" x: (remove/part x 3)] In this case, the implicit position at the point x is set is before the part of the series that is being changed, so parse is not going to get confused. However, if the implicit parse position is after the part of the series that is being changed, like this: [to "<foo" x: thru ">" y: (remove/part x y)] then parse is going to get confused about its implicit position, especially if the length of the series is any different as a result of the change. To deal with this you have to reset the parse position after such changes, like this: [to "<foo" x: thru ">" y: (z: remove/part x y) :z] Does that make sense?
>and how do I change it to parse until <br> or a space " "? > >("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work..
This is a common problem. The general workaround to the problem of scanning until the first of a set of alternate values (a follow set) is to refactor the problem into one of scanning through the values that aren't in the set that you are scanning for. That may sound confusing. In your case, it would be easier if you chose to scan until all html tags, not just <br>. Then you could just scan to the "<" character and use code like this: url-chars: complement charset {< "^(tab)^(newline)} rule: [to "http://" x: some url-chars y: (do something)] Note that url-chars is the complement of the set of chars in that string, or all chars _not_ in the follow set. If you can't distinguish your follow set by looking at one character at a time (say you only want to go to <br> tags but skip other tags) then you have two solutions. You may be able to extend the previous charset solution with more charsets that exclude each of the rest of the letters in the values of the follow set - awkward, but it can be fast for simple follow sets. Or, you can refactor your subrules using tail recursion, like this: non-tag-char: complement charset "<" url-chars: [ some non-tag-char [end | "<br>" | "<" url-chars] ] ; Note the tail recursive reference in the last part rule: [to "http://" x: url-chars y: (do something)] Here's a better example, printing out the first paragraph in html, including nested paragraphs, assuming proper closure: non-lt: complement charset "<" p-rule: [ "<p" [">" | " " thru ">"] ; Consume tag p-rule-cont ; Continue ] p-rule-cont: [ ; Consume non-tag characters any non-lt [ "</p>" ; Close tag | p-rule p-rule-cont ; Nested paragraph, continue | "<" p-rule-cont ; Something else, continue ] ] rule: [to "<p" copy tmp p-rule (print tmp) to end] There are a few factors to note in this example: - You need to make sure that you have a fix-point, a point that the recursion will stop, in this case the end tag. - You need to make sure that every recursive rule will at least consume something before recursing, or it won't stop until the stack overflows. - Parse doesn't backtrack through parens (embedded code). This means that you should put off the embedded code until the point that you can be sure that you have recognized the correct alternate - in this case, after the rule. - Parse does a better job of minimizing recursion overhead than the regular REBOL interpreter does, so this recursion isn't as likely to overflow the stack. I hope this all helps Brian Hawley