r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
26-May-2007
[1851x12]
Rebolek, I am rewriting the regset function. Do you want to support 
decreasing ranges or do you want the characters to be added individually 
lick they are with charset? I can do either.
; Version with support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                loop 1 + (
                    either b > e [b - x: e] [e - x: b]
                ) [
                    insert out x
                    x: 1 + x
                ]
            ) |
            x: skip (insert out first x)
        ]
    ]
    if negate? [out: complement out]
    out
]

; Version without support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                either b > e [
                    insert insert insert out b #"-" e
                ] [
                    loop 1 + e - b [
                        insert out b
                        b: 1 + b
                    ]
                ]
            ) |
            x: skip (insert out first x)
        ]
    ]
    if negate? [out: complement out]
    out
]
Most of the changes were made to make it faster and to use less memory 
overhead.

- It is faster for parse to match a one-character string than a character 
value.
- Insert is faster than union, and makes no temporaries.

- If you are capturing a single character, I think [a: skip (a: first 
a)] is faster than [copy a skip (a: first a)].

- Path access is slower than the equivalent native, so [first a] 
instead of [a/1].

- The fastest loop is loop, even with the math to calculate the number 
of times.
I may end up with the wrong data type to do the [x: 1 + x] rather 
than [x: x + 1], same with b in the second version.
Aside from the one-time bind, repeat may be faster than loop with 
a self-incremented index.
I haven't tried to support escaped characters or subrange checks 
yet. The subrange check may be faster to support by copy/part on 
the source, or faster yet by inlining the code into the greater regex 
engine. I'm not sure how the helper variables would be affected by 
the tasking support of R3, but they sure aren't reentrant on R2.
The standard backtracking of parse only happens upon alternation. 
To support the *, + and ? behavior of regexes, you either have to 
roll your own backtracking or have the compiler convert to using 
first and follow sets. In contrast to how your links describe the 
behavior of regex engines, it might be easier to make parse support 
lazy rather than greedy behavior of its iterators - they are greedy 
by default, but so much so that the first and follow sets shouldn't 
overlap.
It might be a good idea to run a peephole optimizer on the patterns 
before compiling them, to convert ones like "a*a" to "aa*".
It looks like repeat doesn't bind its argument, so it is definitely 
faster in this case.
; Version with support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                either b > e [
                    insert out e
                    repeat x b - e [insert out e + x]
                ] [
                    insert out b
                    repeat x e - b [insert out b + x]
                ]
            ) |
            x: skip (insert out first x)
        ]
    ]
    if negate? [out: complement out]
    out
]

; Version without support for decreasing ranges
regset: func [expression /local out negate? b e x] [
    negate?: false
    out: make bitset! []
    parse/all expression [
        opt ["~" (negate?: true)]
        some [
            "-" (insert out #"-") |
            b: skip "-" e: skip (
                b: first b  e: first e
                either b > e [
                    insert insert insert out b #"-" e
                ] [
                    insert out b
                    repeat x e - b [insert out b + x]
                ]
            ) |
            x: skip (insert out first x)
        ]
    ]
    if negate? [out: complement out]
    out
]
; Slightly faster
bitset-to-string: func [b [bitset!] /local s x] [
    s: copy either find b 0 ["^(00)"] [""]
    repeat x 255 [
        if find b x [append s to-char x]
    ]
    s
]
I'll look more at this tomorrow.
Rebolek
27-May-2007
[1863]
Hi Brian, thanks for support, I was out for a sleep :)
Anton
27-May-2007
[1864]
BrianH, what ? Repeat does bind its argument, doesn't it ?
>> repeat n 4 []
>> n
** Script Error: n has no value
** Near: n
BrianH
27-May-2007
[1865]
Yeah, so it does. I wonder why the docs don't say (will be local) 
like it does for foreach. It still ends up faster than loop when 
you have to keep track of an index or a counter.
Dockimbel
27-May-2007
[1866]
Brian, you've stated that  "It is faster for parse to match a one-character 
string than a character value." It seems to me that the opposite 
statement is true. (matching a char! is faster than matching a on-character 
string!)
BrianH
27-May-2007
[1867x2]
It seems to me that the opposite _should_ be true, but parse converts 
the character to a string before matching it - no conversion is performed 
for string values. It's just one of those weird things.
This is one of those things that I actually profiled to determine 
for sure. Strange, but true.
Ladislav
28-May-2007
[1869]
my measurements show:

>> time-block [parse "a" ["a"]] 0.05
== 3.83615493774414E-7
>> time-block [parse "a" [#"a"]] 0.05
== 3.61204147338867E-7

, i.e. the opposite
BrianH
28-May-2007
[1870]
Which version? Nevermind, my timing differences may just be a multitasking 
artifact.
Ladislav
28-May-2007
[1871]
>> rebol/version
== 1.3.2.3.1
BrianH
28-May-2007
[1872x2]
Strange. When I tested before the difference turned out to be about 
the same as the overhead of string conversion. Quite the cooincidence. 
I guess I should have repeated my timings more :(
Too small a sample for a busy computer.
Ladislav
28-May-2007
[1874x2]
use can use time-block to make sure (it's on my site)
you can use...
Rebolek
28-May-2007
[1876]
nice to see char is faster than string, so I'm not going to rewrite 
this ;) Anyway, new version with BrianH's version of regex and bitset-to-string 
and with endless loop in tail-parse fixed is posted on http://bolek.techno.cz/reb/regex.r
Anton
28-May-2007
[1877]
That's funny, I thought we had determined that string was faster 
than char.
BrianH
28-May-2007
[1878x2]
It's probably close enough to ignore either way.
Rebolek, I gather you made the parse go in reverse to handle rules 
like "a+a" better. How does your reverse code handle "aa+", or "aa+a" 
- same problem?
Dockimbel
28-May-2007
[1880]
Here's another benchmark:

>> data: head insert/dup make string! 10'000'000 #"a" 10'000'000

>> t0: now/time/precise loop 10 [parse data [some "a"]] now/time/precise 
- t0
== 0:00:06.078

>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise 
- t0
== 0:00:04.296


Running this test several times shows that char! matching is, in 
average, 30 % faster than string! matching.
BrianH
28-May-2007
[1881x2]
Well there you go. That's different numbers than last time, but more 
dramatic. It's just a #, easy fix :)
Thanks for answering that question though.
Sunanda
28-May-2007
[1883]
Just tried your benchmark on my machine.....I get similar % faster 
for char! matching.
Though the elapse times vary depending on the version of REBOL.
Dockimbel
28-May-2007
[1884]
Didn't want to sound "dramatic", but just wanted to provide a more 
accurate measure. Sure whatever datatype is used (char! or string!) 
in regex.r, that won't  change much the overall speed. ;-)
BrianH
28-May-2007
[1885x2]
I'm more concerned about the first and follow sets - doing that wrong 
could mean a slowdown of orders of magnitude.
Actually, big O slowdowns.
Rebolek
28-May-2007
[1887]
BrianH: "How does your reverse code handle "aa+", or "aa+a"" - damn... 
:)

Anyway, today I was thinking about different way how to do it, had 
no time to code it yet.
It's slow, I know...
BrianH
28-May-2007
[1888x2]
Are you familiar with the theories behind parser generators? That 
is what you are doing. I studied the theories in college - hence 
the talks about first and follow sets.
I've been thinking about ways to do this too - you got me going. 
I was also thinking of a way to cache charsets for later reuse.
Maxim
28-May-2007
[1890x3]
didn't you guys try using a charset?  I would guess its an order 
of magnitude faster still... no?
well, checking with one letter didn't give any changes (bitset and 
char seem the same) but just change that with an OR of two letter 
and the char goes to 4 times slower and the bitset stays exactly 
the same... so do use bitsets as often as you can.
>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise 
- t0
== 0:00:06.389


>> t0: now/time/precise loop 10 [parse data [some [#"b" | #"a"]]] 
now/time/precise - t0
== 0:00:25.496

>> b: charset "ab"

>> t0: now/time/precise loop 10 [parse data [some b]] now/time/precise 
- t0
== 0:00:06.379
Rebolek
29-May-2007
[1893]
I'm not sure how can I test for end in case like this:
r: ["a" "b" "c" end] parse "abc" [some [r/1 (r: next r)]]
anybody?
Oldes
29-May-2007
[1894]
r: ["a" "b" "c" break] parse "abc" [some [r/1 (r: next r)]]
Rebolek
29-May-2007
[1895]
great, thanks
Oldes
29-May-2007
[1896]
Here are C sources for Regex engine used in MySQL5.0.9 http://leithal.cool-tools.co.uk/sourcedoc/mysql509/html/dir_000079.html
Rebolek
29-May-2007
[1897x4]
So I left tail-parse becuase of the "aa+" problem mentioned by BrianH. 
I wrote 'lazy-parse that works different and it seems to handle both 
cases "a+a" and "aa+" well.
>> lazy-parse "aaa" [[#"a"][some #"a"][#"a"]]
== true
>> lazy-parse "aaa" [[#"a"][some #"a"]]
== true
>> lazy-parse "aaa" [[some #"a"][#"a"]]
lazy-parse: func [string rules /local parsed?][
;does not support ANY yet (wrong results)
	parsed?: false
	rules: copy rules
	parse string [
		some [
			b: rules/1 e:
			(
				parsed?: true
				rules: next rules
				if empty? e [
					either empty? rules [
						parsed?: true
						rules: copy [break]
					] [
						e: next b
						if empty? e [parsed?: false]
					]
				]
			)
			:e
		]
	]
	parsed?
]
Warning: this does not suport ANY rule yet.