World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Rebolek 26-May-2007 [1848]	more or less
BrianH 26-May-2007 [1849x14]	BTW, "a*a" is directly equivalent to [any "a" "a"], not some.
	It still won't work directly.
	Rebolek, I am rewriting the regset function. Do you want to support decreasing ranges or do you want the characters to be added individually lick they are with charset? I can do either.
	; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e loop 1 + ( either b > e [b - x: e] [e - x: b] ) [ insert out x x: 1 + x ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ loop 1 + e - b [ insert out b b: 1 + b ] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ]
	Most of the changes were made to make it faster and to use less memory overhead. - It is faster for parse to match a one-character string than a character value. - Insert is faster than union, and makes no temporaries. - If you are capturing a single character, I think [a: skip (a: first a)] is faster than [copy a skip (a: first a)]. - Path access is slower than the equivalent native, so [first a] instead of [a/1]. - The fastest loop is loop, even with the math to calculate the number of times.
	I may end up with the wrong data type to do the [x: 1 + x] rather than [x: x + 1], same with b in the second version.
	Aside from the one-time bind, repeat may be faster than loop with a self-incremented index.
	I haven't tried to support escaped characters or subrange checks yet. The subrange check may be faster to support by copy/part on the source, or faster yet by inlining the code into the greater regex engine. I'm not sure how the helper variables would be affected by the tasking support of R3, but they sure aren't reentrant on R2.
	The standard backtracking of parse only happens upon alternation. To support the *, + and ? behavior of regexes, you either have to roll your own backtracking or have the compiler convert to using first and follow sets. In contrast to how your links describe the behavior of regex engines, it might be easier to make parse support lazy rather than greedy behavior of its iterators - they are greedy by default, but so much so that the first and follow sets shouldn't overlap.
	It might be a good idea to run a peephole optimizer on the patterns before compiling them, to convert ones like "aa" to "aa".
	It looks like repeat doesn't bind its argument, so it is definitely faster in this case.
	; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert out e repeat x b - e [insert out e + x] ] [ insert out b repeat x e - b [insert out b + x] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ insert out b repeat x e - b [insert out b + x] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ]
	; Slightly faster bitset-to-string: func [b [bitset!] /local s x] [ s: copy either find b 0 ["^(00)"] [""] repeat x 255 [ if find b x [append s to-char x] ] s ]
	I'll look more at this tomorrow.
Rebolek 27-May-2007 [1863]	Hi Brian, thanks for support, I was out for a sleep :)
Anton 27-May-2007 [1864]	BrianH, what ? Repeat does bind its argument, doesn't it ? >> repeat n 4 [] >> n Script Error: n has no value Near: n
BrianH 27-May-2007 [1865]	Yeah, so it does. I wonder why the docs don't say (will be local) like it does for foreach. It still ends up faster than loop when you have to keep track of an index or a counter.
Dockimbel 27-May-2007 [1866]	Brian, you've stated that "It is faster for parse to match a one-character string than a character value." It seems to me that the opposite statement is true. (matching a char! is faster than matching a on-character string!)
BrianH 27-May-2007 [1867x2]	It seems to me that the opposite _should_ be true, but parse converts the character to a string before matching it - no conversion is performed for string values. It's just one of those weird things.
BrianH 27-May-2007 [1867x2]	This is one of those things that I actually profiled to determine for sure. Strange, but true.
Ladislav 28-May-2007 [1869]	my measurements show: >> time-block [parse "a" ["a"]] 0.05 == 3.83615493774414E-7 >> time-block [parse "a" [#"a"]] 0.05 == 3.61204147338867E-7 , i.e. the opposite
BrianH 28-May-2007 [1870]	Which version? Nevermind, my timing differences may just be a multitasking artifact.
Ladislav 28-May-2007 [1871]	>> rebol/version == 1.3.2.3.1
BrianH 28-May-2007 [1872x2]	Strange. When I tested before the difference turned out to be about the same as the overhead of string conversion. Quite the cooincidence. I guess I should have repeated my timings more :(
BrianH 28-May-2007 [1872x2]	Too small a sample for a busy computer.
Ladislav 28-May-2007 [1874x2]	use can use time-block to make sure (it's on my site)
Ladislav 28-May-2007 [1874x2]	you can use...
Rebolek 28-May-2007 [1876]	nice to see char is faster than string, so I'm not going to rewrite this ;) Anyway, new version with BrianH's version of regex and bitset-to-string and with endless loop in tail-parse fixed is posted on http://bolek.techno.cz/reb/regex.r
Anton 28-May-2007 [1877]	That's funny, I thought we had determined that string was faster than char.
BrianH 28-May-2007 [1878x2]	It's probably close enough to ignore either way.
BrianH 28-May-2007 [1878x2]	Rebolek, I gather you made the parse go in reverse to handle rules like "a+a" better. How does your reverse code handle "aa+", or "aa+a" - same problem?
Dockimbel 28-May-2007 [1880]	Here's another benchmark: >> data: head insert/dup make string! 10'000'000 #"a" 10'000'000 >> t0: now/time/precise loop 10 [parse data [some "a"]] now/time/precise - t0 == 0:00:06.078 >> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise - t0 == 0:00:04.296 Running this test several times shows that char! matching is, in average, 30 % faster than string! matching.
BrianH 28-May-2007 [1881x2]	Well there you go. That's different numbers than last time, but more dramatic. It's just a #, easy fix :)
BrianH 28-May-2007 [1881x2]	Thanks for answering that question though.
Sunanda 28-May-2007 [1883]	Just tried your benchmark on my machine.....I get similar % faster for char! matching. Though the elapse times vary depending on the version of REBOL.
Dockimbel 28-May-2007 [1884]	Didn't want to sound "dramatic", but just wanted to provide a more accurate measure. Sure whatever datatype is used (char! or string!) in regex.r, that won't change much the overall speed. ;-)
BrianH 28-May-2007 [1885x2]	I'm more concerned about the first and follow sets - doing that wrong could mean a slowdown of orders of magnitude.
BrianH 28-May-2007 [1885x2]	Actually, big O slowdowns.
Rebolek 28-May-2007 [1887]	BrianH: "How does your reverse code handle "aa+", or "aa+a"" - damn... :) Anyway, today I was thinking about different way how to do it, had no time to code it yet. It's slow, I know...
BrianH 28-May-2007 [1888x2]	Are you familiar with the theories behind parser generators? That is what you are doing. I studied the theories in college - hence the talks about first and follow sets.
BrianH 28-May-2007 [1888x2]	I've been thinking about ways to do this too - you got me going. I was also thinking of a way to cache charsets for later reuse.
Maxim 28-May-2007 [1890x3]	didn't you guys try using a charset? I would guess its an order of magnitude faster still... no?
	well, checking with one letter didn't give any changes (bitset and char seem the same) but just change that with an OR of two letter and the char goes to 4 times slower and the bitset stays exactly the same... so do use bitsets as often as you can.
	>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise - t0 == 0:00:06.389 >> t0: now/time/precise loop 10 [parse data [some [#"b" \| #"a"]]] now/time/precise - t0 == 0:00:25.496 >> b: charset "ab" >> t0: now/time/precise loop 10 [parse data [some b]] now/time/precise - t0 == 0:00:06.379
Rebolek 29-May-2007 [1893]	I'm not sure how can I test for end in case like this: r: ["a" "b" "c" end] parse "abc" [some [r/1 (r: next r)]] anybody?
Oldes 29-May-2007 [1894]	r: ["a" "b" "c" break] parse "abc" [some [r/1 (r: next r)]]
Rebolek 29-May-2007 [1895]	great, thanks
Oldes 29-May-2007 [1896]	Here are C sources for Regex engine used in MySQL5.0.9 http://leithal.cool-tools.co.uk/sourcedoc/mysql509/html/dir_000079.html
Rebolek 29-May-2007 [1897]	So I left tail-parse becuase of the "aa+" problem mentioned by BrianH. I wrote 'lazy-parse that works different and it seems to handle both cases "a+a" and "aa+" well.
older newer	first last