r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search

World: r3wp

[Parse] Discussion of PARSE dialect

Yes, Brett has built a lot of very cool stuff. Haven't seen him around 
for a while though.
So, this is my first attempt to do regular expressions in REBOL. 
Type on your console:
do http://bolek.techno.cz/reb/regex.r

Some things are missing and it can sometimes run in endless loop 
when it shouldn't, so please be benevolent :)

But at least the email regex can be translated and parsed succesfully.
it's quite small
nice:)... just you should print regex not regexp in the example (or 
rename the function)
and... it would be good to have just a function which returns the 
translated Rebol parse block
regex rules look quite complicated!
Graham yes, they are :)
Oldes: regex vs. regexp on google is 12mil vs 8mil, so that's the 
reason, but I can rename it (it was called regexp yesterday ;)
And yes, function returning just parse rules will be done, this is 
just a work in progress
I don't mind how you caal the function.. just the printed one is 
different from the one used which is confusing as first what I did 
was copy and paste what you printed and got en error that regexp 
function do not exists
and anyway... 12 or 8 millions google rusults  is not a big difference 
if your page is not listed between first 20 pages:)
oh I see now what you mean...that print is from test function, I'll 
change it. And that google score - I used it just to compare what 
name will be better, I in no way don't expect to be there with my 
page :) It's just what I do when I'm not sure how some word spells 
(svatba vs. svadba and so on ;) - I put both terms in Google and 
the one with better score is probably the right one ;)
you can use... http://www.googlefight.com/or make a Rebol version... 
it's quite easy
if you were talking about bitsets.... it reminded me that it would 
be good to have some common rules in Rebol3 available for parsing...
I had googlefight in Krabot, if you remeber :)
like characters and digits?
I mean.... I have to write.... spaces: charset " ^/^-^M" and so on 
on so many places in code
and whitespaces, yes
it could be working like in stylize in VID
but maybe it's not so important...
in the file i posted is a function REGSET that converts small bit 
of regex to bitset, it's syntax seems to be easier than charset's 
syntax (charset [#"a" - #"z" #"0" - #"9"] vs regset "a-z0-9")
Very nice Boleslav! What regex engine/syntax are you going for compatibility 
with (if any)?

Charset syntax is probably that way because it's a dialect, and Carl 
wanted a string as input to be easy, without escapes and such; just 
my guess.
Graham, the best book I know of on regexs is Jeff Friedl's 'Mastering 
Regular Expressions'. He has an email validating regex (i.e. it just 
matches the RFC822 spec for an email address) which is almost 5K 
Thanks Gregg. I started with some examples from http://regular-expressions.info, 
just to see, if I can do it, so after fixing bugs and adding some 
feature still missing, I'll see what next, if anything. There are 
already some incompatibilities - carret ^  cannot be used in rebol 
strings the way it's used in regex, so i'm using tilde ~ instead.
There are several different regex dialects. Are you following one 
of those, or making another?
(Running commentary as I read your code)
You should wrap your code in a context.
You should seperate the regex compilation phase from its application 
phase, and just write a wrapper that calls both in order. The compilation 
phase is often more complex than just applying the results, so if 
you are using the regex repeatedly you should just compile it once.
i like regset, in theory. You seem to be applying block parse syntax 
to strings - it could be simpler.
I wanted to do some basic regex set, some common denominator, partly 
as an exrcesise in parse and partly to stop people looking at rebol 
saying "it has no regex". so now it has ;)
So I'm not sure what dialect of regex is the best one.

Block parse syntax to strings - you mean that char! ? Yes, it's probably 
not doing anything, the code needs some cleanup.
I plan to wrap it in context, yes.
The char! 1 is useless. I am looking at that code now (reworking 
Oldes, I though about just a translator from regex to parse rules 
and I'm not sure it will be easy, I'm using my 'tail-parse that matches 
rules in reversed order that is better for regex syntax. Maybe there's 
some other way.
Reverse order is better? I think you aren't backtracking right in 
your generated rules.
this is the problem with [some "a" "a"]. This is equivalent of "a*a" 
in regex which is perfectly valid, but problematic in parse. This 
is simple example, but it can get quite complicated so I'm not sure 
I can handle all cases. The reversed order seemed simpler. But you 
will probably prove me wrong :)
Are you supporting grouping? Haven't gotten to that point in the 
code yet.
Do regex dialects support decreasing character ranges in their sets?
grouping - just the bitsets created with regset, groups like [cat|dog] 
not yet supported
Which regex dialect does grouping with [ and ], I thought they used 
( and ).
Are you supporting . wildcard characters?
oh yes you're right...just an old rebol habit, using square brackets 
everywhere :)
wildcards - well dot is any character except newline
Right now it should support regex according tohttp://www.regular-expressions.info/quickstart.html
more or less
BTW, "a*a" is directly equivalent to [any "a" "a"], not some.
It still won't work directly.
Rebolek, I am rewriting the regset function. Do you want to support 
decreasing ranges or do you want the characters to be added individually 
lick they are with charset? I can do either.
; Version with support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                loop 1 + (
                    either b > e [b - x: e] [e - x: b]
                ) [
                    insert out x
                    x: 1 + x
            ) |
            x: skip (insert out first x)
    if negate? [out: complement out]

; Version without support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                either b > e [
                    insert insert insert out b #"-" e
                ] [
                    loop 1 + e - b [
                        insert out b
                        b: 1 + b
            ) |
            x: skip (insert out first x)
    if negate? [out: complement out]
Most of the changes were made to make it faster and to use less memory 

- It is faster for parse to match a one-character string than a character 
- Insert is faster than union, and makes no temporaries.

- If you are capturing a single character, I think [a: skip (a: first 
a)] is faster than [copy a skip (a: first a)].

- Path access is slower than the equivalent native, so [first a] 
instead of [a/1].

- The fastest loop is loop, even with the math to calculate the number 
of times.
I may end up with the wrong data type to do the [x: 1 + x] rather 
than [x: x + 1], same with b in the second version.
Aside from the one-time bind, repeat may be faster than loop with 
a self-incremented index.