World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
BrianH 26-May-2007 [1840x2] | Are you supporting grouping? Haven't gotten to that point in the code yet. |
Do regex dialects support decreasing character ranges in their sets? | |
Rebolek 26-May-2007 [1842] | grouping - just the bitsets created with regset, groups like [cat|dog] not yet supported |
BrianH 26-May-2007 [1843x2] | Which regex dialect does grouping with [ and ], I thought they used ( and ). |
Are you supporting . wildcard characters? | |
Rebolek 26-May-2007 [1845x4] | oh yes you're right...just an old rebol habit, using square brackets everywhere :) |
wildcards - well dot is any character except newline | |
Right now it should support regex according tohttp://www.regular-expressions.info/quickstart.html | |
more or less | |
BrianH 26-May-2007 [1849x14] | BTW, "a*a" is directly equivalent to [any "a" "a"], not some. |
It still won't work directly. | |
Rebolek, I am rewriting the regset function. Do you want to support decreasing ranges or do you want the characters to be added individually lick they are with charset? I can do either. | |
; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") | b: skip "-" e: skip ( b: first b e: first e loop 1 + ( either b > e [b - x: e] [e - x: b] ) [ insert out x x: 1 + x ] ) | x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") | b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ loop 1 + e - b [ insert out b b: 1 + b ] ] ) | x: skip (insert out first x) ] ] if negate? [out: complement out] out ] | |
Most of the changes were made to make it faster and to use less memory overhead. - It is faster for parse to match a one-character string than a character value. - Insert is faster than union, and makes no temporaries. - If you are capturing a single character, I think [a: skip (a: first a)] is faster than [copy a skip (a: first a)]. - Path access is slower than the equivalent native, so [first a] instead of [a/1]. - The fastest loop is loop, even with the math to calculate the number of times. | |
I may end up with the wrong data type to do the [x: 1 + x] rather than [x: x + 1], same with b in the second version. | |
Aside from the one-time bind, repeat may be faster than loop with a self-incremented index. | |
I haven't tried to support escaped characters or subrange checks yet. The subrange check may be faster to support by copy/part on the source, or faster yet by inlining the code into the greater regex engine. I'm not sure how the helper variables would be affected by the tasking support of R3, but they sure aren't reentrant on R2. | |
The standard backtracking of parse only happens upon alternation. To support the *, + and ? behavior of regexes, you either have to roll your own backtracking or have the compiler convert to using first and follow sets. In contrast to how your links describe the behavior of regex engines, it might be easier to make parse support lazy rather than greedy behavior of its iterators - they are greedy by default, but so much so that the first and follow sets shouldn't overlap. | |
It might be a good idea to run a peephole optimizer on the patterns before compiling them, to convert ones like "a*a" to "aa*". | |
It looks like repeat doesn't bind its argument, so it is definitely faster in this case. | |
; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") | b: skip "-" e: skip ( b: first b e: first e either b > e [ insert out e repeat x b - e [insert out e + x] ] [ insert out b repeat x e - b [insert out b + x] ] ) | x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") | b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ insert out b repeat x e - b [insert out b + x] ] ) | x: skip (insert out first x) ] ] if negate? [out: complement out] out ] | |
; Slightly faster bitset-to-string: func [b [bitset!] /local s x] [ s: copy either find b 0 ["^(00)"] [""] repeat x 255 [ if find b x [append s to-char x] ] s ] | |
I'll look more at this tomorrow. | |
Rebolek 27-May-2007 [1863] | Hi Brian, thanks for support, I was out for a sleep :) |
Anton 27-May-2007 [1864] | BrianH, what ? Repeat does bind its argument, doesn't it ? >> repeat n 4 [] >> n ** Script Error: n has no value ** Near: n |
BrianH 27-May-2007 [1865] | Yeah, so it does. I wonder why the docs don't say (will be local) like it does for foreach. It still ends up faster than loop when you have to keep track of an index or a counter. |
Dockimbel 27-May-2007 [1866] | Brian, you've stated that "It is faster for parse to match a one-character string than a character value." It seems to me that the opposite statement is true. (matching a char! is faster than matching a on-character string!) |
BrianH 27-May-2007 [1867x2] | It seems to me that the opposite _should_ be true, but parse converts the character to a string before matching it - no conversion is performed for string values. It's just one of those weird things. |
This is one of those things that I actually profiled to determine for sure. Strange, but true. | |
Ladislav 28-May-2007 [1869] | my measurements show: >> time-block [parse "a" ["a"]] 0.05 == 3.83615493774414E-7 >> time-block [parse "a" [#"a"]] 0.05 == 3.61204147338867E-7 , i.e. the opposite |
BrianH 28-May-2007 [1870] | Which version? Nevermind, my timing differences may just be a multitasking artifact. |
Ladislav 28-May-2007 [1871] | >> rebol/version == 1.3.2.3.1 |
BrianH 28-May-2007 [1872x2] | Strange. When I tested before the difference turned out to be about the same as the overhead of string conversion. Quite the cooincidence. I guess I should have repeated my timings more :( |
Too small a sample for a busy computer. | |
Ladislav 28-May-2007 [1874x2] | use can use time-block to make sure (it's on my site) |
you can use... | |
Rebolek 28-May-2007 [1876] | nice to see char is faster than string, so I'm not going to rewrite this ;) Anyway, new version with BrianH's version of regex and bitset-to-string and with endless loop in tail-parse fixed is posted on http://bolek.techno.cz/reb/regex.r |
Anton 28-May-2007 [1877] | That's funny, I thought we had determined that string was faster than char. |
BrianH 28-May-2007 [1878x2] | It's probably close enough to ignore either way. |
Rebolek, I gather you made the parse go in reverse to handle rules like "a+a" better. How does your reverse code handle "aa+", or "aa+a" - same problem? | |
Dockimbel 28-May-2007 [1880] | Here's another benchmark: >> data: head insert/dup make string! 10'000'000 #"a" 10'000'000 >> t0: now/time/precise loop 10 [parse data [some "a"]] now/time/precise - t0 == 0:00:06.078 >> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise - t0 == 0:00:04.296 Running this test several times shows that char! matching is, in average, 30 % faster than string! matching. |
BrianH 28-May-2007 [1881x2] | Well there you go. That's different numbers than last time, but more dramatic. It's just a #, easy fix :) |
Thanks for answering that question though. | |
Sunanda 28-May-2007 [1883] | Just tried your benchmark on my machine.....I get similar % faster for char! matching. Though the elapse times vary depending on the version of REBOL. |
Dockimbel 28-May-2007 [1884] | Didn't want to sound "dramatic", but just wanted to provide a more accurate measure. Sure whatever datatype is used (char! or string!) in regex.r, that won't change much the overall speed. ;-) |
BrianH 28-May-2007 [1885x2] | I'm more concerned about the first and follow sets - doing that wrong could mean a slowdown of orders of magnitude. |
Actually, big O slowdowns. | |
Rebolek 28-May-2007 [1887] | BrianH: "How does your reverse code handle "aa+", or "aa+a"" - damn... :) Anyway, today I was thinking about different way how to do it, had no time to code it yet. It's slow, I know... |
BrianH 28-May-2007 [1888x2] | Are you familiar with the theories behind parser generators? That is what you are doing. I studied the theories in college - hence the talks about first and follow sets. |
I've been thinking about ways to do this too - you got me going. I was also thinking of a way to cache charsets for later reuse. | |
older newer | first last |