Mailing List Archive: Re: A little parse help

[REBOL] Re: A little parse help

From: brian:hawley at: 28-Aug-2001 1:36


A little late, but...

At 08:21 PM 8/24/01 +0200, Stefan Falk wrote:
>Hi,
>thanks, this worked!
>
>just two questions,
>what's the last :mark1 there for?

At every step of the parse process, there are two implicit
parameters: the series that you are processing and the
current position within that series. In parse rules you
can assign the series (at its current position) to a word
(x) by putting the set-word (x:) in the rules at a given
point. You can also reset the implicit parse series (and
position) to the value assigned to a word by putting the
get-word (:x) at a given point in the rules.

If you are changing the series you are working on while
you are parsing it, you need to make sure that parse is
able to keep track of its implicit position setting. This
is not a problem if you are changing the series in front
of or at the implicit position, like this:

   [to "foo" x: (remove/part x 3)]

In this case, the implicit position at the point x is set
is before the part of the series that is being changed,
so parse is not going to get confused. However, if the
implicit parse position is after the part of the series
that is being changed, like this:

   [to "<foo" x: thru ">" y: (remove/part x y)]

then parse is going to get confused about its implicit
position, especially if the length of the series is any
different as a result of the change. To deal with this
you have to reset the parse position after such changes,
like this:

   [to "<foo" x: thru ">" y: (z: remove/part x y) :z]

Does that make sense?

>and how do I change it to parse until <br> or a space " "?
>
>("<br>" | " ") or (to "<br>" | to " ") doesn't seem to work..

This is a common problem. The general workaround to the
problem of scanning until the first of a set of alternate
values (a follow set) is to refactor the problem into one
of scanning through the values that aren't in the set that
you are scanning for. That may sound confusing.

In your case, it would be easier if you chose to scan until
all html tags, not just <br>. Then you could just scan to
the "<" character and use code like this:

   url-chars: complement charset {< "^(tab)^(newline)}
   rule: [to "http://" x: some url-chars y: (do something)]

Note that url-chars is the complement of the set of chars
in that string, or all chars _not_ in the follow set.

If you can't distinguish your follow set by looking at one
character at a time (say you only want to go to <br> tags
but skip other tags) then you have two solutions. You may
be able to extend the previous charset solution with more
charsets that exclude each of the rest of the letters in
the values of the follow set - awkward, but it can be fast
for simple follow sets. Or, you can refactor your subrules
using tail recursion, like this:

   non-tag-char: complement charset "<"
   url-chars: [
     some non-tag-char [end | "<br>" | "<" url-chars]
   ] ; Note the tail recursive reference in the last part
   rule: [to "http://" x: url-chars y: (do something)]

Here's a better example, printing out the first paragraph in
html, including nested paragraphs, assuming proper closure:

   non-lt: complement charset "<"
   p-rule: [
     "<p" [">" | " " thru ">"] ; Consume tag
     p-rule-cont ; Continue
   ]
   p-rule-cont: [
     ; Consume non-tag characters
     any non-lt [
       "</p>" ; Close tag
       | p-rule p-rule-cont ; Nested paragraph, continue
       | "<" p-rule-cont ; Something else, continue
     ]
   ]
   rule: [to "<p" copy tmp p-rule (print tmp) to end]

There are a few factors to note in this example:
- You need to make sure that you have a fix-point, a point
   that the recursion will stop, in this case the end tag.
- You need to make sure that every recursive rule will at
   least consume something before recursing, or it won't
   stop until the stack overflows.
- Parse doesn't backtrack through parens (embedded code).
   This means that you should put off the embedded code until
   the point that you can be sure that you have recognized
   the correct alternate - in this case, after the rule.
- Parse does a better job of minimizing recursion overhead
   than the regular REBOL interpreter does, so this recursion
   isn't as likely to overflow the stack.

I hope this all helps

Brian Hawley