World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Rebolek 26-May-2007 [1820x2]	I had googlefight in Krabot, if you remeber :)
Rebolek 26-May-2007 [1820x2]	like characters and digits?
Oldes 26-May-2007 [1822]	I mean.... I have to write.... spaces: charset " ^/^-^M" and so on on so many places in code
Rebolek 26-May-2007 [1823]	and whitespaces, yes
Oldes 26-May-2007 [1824x2]	it could be working like in stylize in VID
Oldes 26-May-2007 [1824x2]	but maybe it's not so important...
Rebolek 26-May-2007 [1826]	in the file i posted is a function REGSET that converts small bit of regex to bitset, it's syntax seems to be easier than charset's syntax (charset [#"a" - #"z" #"0" - #"9"] vs regset "a-z0-9")
Gregg 26-May-2007 [1827x2]	Very nice Boleslav! What regex engine/syntax are you going for compatibility with (if any)? Charset syntax is probably that way because it's a dialect, and Carl wanted a string as input to be easy, without escapes and such; just my guess.
Gregg 26-May-2007 [1827x2]	Graham, the best book I know of on regexs is Jeff Friedl's 'Mastering Regular Expressions'. He has an email validating regex (i.e. it just matches the RFC822 spec for an email address) which is almost 5K IIRC.
Rebolek 26-May-2007 [1829]	Thanks Gregg. I started with some examples from http://regular-expressions.info, just to see, if I can do it, so after fixing bugs and adding some feature still missing, I'll see what next, if anything. There are already some incompatibilities - carret ^ cannot be used in rebol strings the way it's used in regex, so i'm using tilde ~ instead.
BrianH 26-May-2007 [1830x5]	There are several different regex dialects. Are you following one of those, or making another?
	(Running commentary as I read your code)
	You should wrap your code in a context.
	You should seperate the regex compilation phase from its application phase, and just write a wrapper that calls both in order. The compilation phase is often more complex than just applying the results, so if you are using the regex repeatedly you should just compile it once.
	i like regset, in theory. You seem to be applying block parse syntax to strings - it could be simpler.
Rebolek 26-May-2007 [1835]	I wanted to do some basic regex set, some common denominator, partly as an exrcesise in parse and partly to stop people looking at rebol saying "it has no regex". so now it has ;) So I'm not sure what dialect of regex is the best one. Block parse syntax to strings - you mean that char! ? Yes, it's probably not doing anything, the code needs some cleanup. I plan to wrap it in context, yes.
BrianH 26-May-2007 [1836]	The char! 1 is useless. I am looking at that code now (reworking it).
Rebolek 26-May-2007 [1837]	Oldes, I though about just a translator from regex to parse rules and I'm not sure it will be easy, I'm using my 'tail-parse that matches rules in reversed order that is better for regex syntax. Maybe there's some other way.
BrianH 26-May-2007 [1838]	Reverse order is better? I think you aren't backtracking right in your generated rules.
Rebolek 26-May-2007 [1839]	this is the problem with [some "a" "a"]. This is equivalent of "a*a" in regex which is perfectly valid, but problematic in parse. This is simple example, but it can get quite complicated so I'm not sure I can handle all cases. The reversed order seemed simpler. But you will probably prove me wrong :)
BrianH 26-May-2007 [1840x2]	Are you supporting grouping? Haven't gotten to that point in the code yet.
BrianH 26-May-2007 [1840x2]	Do regex dialects support decreasing character ranges in their sets?
Rebolek 26-May-2007 [1842]	grouping - just the bitsets created with regset, groups like [cat\|dog] not yet supported
BrianH 26-May-2007 [1843x2]	Which regex dialect does grouping with [ and ], I thought they used ( and ).
BrianH 26-May-2007 [1843x2]	Are you supporting . wildcard characters?
Rebolek 26-May-2007 [1845x4]	oh yes you're right...just an old rebol habit, using square brackets everywhere :)
	wildcards - well dot is any character except newline
	Right now it should support regex according tohttp://www.regular-expressions.info/quickstart.html
	more or less
BrianH 26-May-2007 [1849x14]	BTW, "a*a" is directly equivalent to [any "a" "a"], not some.
	It still won't work directly.
	Rebolek, I am rewriting the regset function. Do you want to support decreasing ranges or do you want the characters to be added individually lick they are with charset? I can do either.
	; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e loop 1 + ( either b > e [b - x: e] [e - x: b] ) [ insert out x x: 1 + x ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ loop 1 + e - b [ insert out b b: 1 + b ] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ]
	Most of the changes were made to make it faster and to use less memory overhead. - It is faster for parse to match a one-character string than a character value. - Insert is faster than union, and makes no temporaries. - If you are capturing a single character, I think [a: skip (a: first a)] is faster than [copy a skip (a: first a)]. - Path access is slower than the equivalent native, so [first a] instead of [a/1]. - The fastest loop is loop, even with the math to calculate the number of times.
	I may end up with the wrong data type to do the [x: 1 + x] rather than [x: x + 1], same with b in the second version.
	Aside from the one-time bind, repeat may be faster than loop with a self-incremented index.
	I haven't tried to support escaped characters or subrange checks yet. The subrange check may be faster to support by copy/part on the source, or faster yet by inlining the code into the greater regex engine. I'm not sure how the helper variables would be affected by the tasking support of R3, but they sure aren't reentrant on R2.
	The standard backtracking of parse only happens upon alternation. To support the *, + and ? behavior of regexes, you either have to roll your own backtracking or have the compiler convert to using first and follow sets. In contrast to how your links describe the behavior of regex engines, it might be easier to make parse support lazy rather than greedy behavior of its iterators - they are greedy by default, but so much so that the first and follow sets shouldn't overlap.
	It might be a good idea to run a peephole optimizer on the patterns before compiling them, to convert ones like "aa" to "aa".
	It looks like repeat doesn't bind its argument, so it is definitely faster in this case.
	; Version with support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert out e repeat x b - e [insert out e + x] ] [ insert out b repeat x e - b [insert out b + x] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ] ; Version without support for decreasing ranges regset: func [expression /local out negate? b e x] [ negate?: false out: make bitset! [] parse/all expression [ opt ["~" (negate?: true)] some [ "-" (insert out #"-") \| b: skip "-" e: skip ( b: first b e: first e either b > e [ insert insert insert out b #"-" e ] [ insert out b repeat x e - b [insert out b + x] ] ) \| x: skip (insert out first x) ] ] if negate? [out: complement out] out ]
	; Slightly faster bitset-to-string: func [b [bitset!] /local s x] [ s: copy either find b 0 ["^(00)"] [""] repeat x 255 [ if find b x [append s to-char x] ] s ]
	I'll look more at this tomorrow.
Rebolek 27-May-2007 [1863]	Hi Brian, thanks for support, I was out for a sleep :)
Anton 27-May-2007 [1864]	BrianH, what ? Repeat does bind its argument, doesn't it ? >> repeat n 4 [] >> n Script Error: n has no value Near: n
BrianH 27-May-2007 [1865]	Yeah, so it does. I wonder why the docs don't say (will be local) like it does for foreach. It still ends up faster than loop when you have to keep track of an index or a counter.
Dockimbel 27-May-2007 [1866]	Brian, you've stated that "It is faster for parse to match a one-character string than a character value." It seems to me that the opposite statement is true. (matching a char! is faster than matching a on-character string!)
BrianH 27-May-2007 [1867x2]	It seems to me that the opposite _should_ be true, but parse converts the character to a string before matching it - no conversion is performed for string values. It's just one of those weird things.
BrianH 27-May-2007 [1867x2]	This is one of those things that I actually profiled to determine for sure. Strange, but true.
Ladislav 28-May-2007 [1869]	my measurements show: >> time-block [parse "a" ["a"]] 0.05 == 3.83615493774414E-7 >> time-block [parse "a" [#"a"]] 0.05 == 3.61204147338867E-7 , i.e. the opposite
older newer	first last