r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Rebolek
28-May-2007
[1876]
nice to see char is faster than string, so I'm not going to rewrite 
this ;) Anyway, new version with BrianH's version of regex and bitset-to-string 
and with endless loop in tail-parse fixed is posted on http://bolek.techno.cz/reb/regex.r
Anton
28-May-2007
[1877]
That's funny, I thought we had determined that string was faster 
than char.
BrianH
28-May-2007
[1878x2]
It's probably close enough to ignore either way.
Rebolek, I gather you made the parse go in reverse to handle rules 
like "a+a" better. How does your reverse code handle "aa+", or "aa+a" 
- same problem?
Dockimbel
28-May-2007
[1880]
Here's another benchmark:

>> data: head insert/dup make string! 10'000'000 #"a" 10'000'000

>> t0: now/time/precise loop 10 [parse data [some "a"]] now/time/precise 
- t0
== 0:00:06.078

>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise 
- t0
== 0:00:04.296


Running this test several times shows that char! matching is, in 
average, 30 % faster than string! matching.
BrianH
28-May-2007
[1881x2]
Well there you go. That's different numbers than last time, but more 
dramatic. It's just a #, easy fix :)
Thanks for answering that question though.
Sunanda
28-May-2007
[1883]
Just tried your benchmark on my machine.....I get similar % faster 
for char! matching.
Though the elapse times vary depending on the version of REBOL.
Dockimbel
28-May-2007
[1884]
Didn't want to sound "dramatic", but just wanted to provide a more 
accurate measure. Sure whatever datatype is used (char! or string!) 
in regex.r, that won't  change much the overall speed. ;-)
BrianH
28-May-2007
[1885x2]
I'm more concerned about the first and follow sets - doing that wrong 
could mean a slowdown of orders of magnitude.
Actually, big O slowdowns.
Rebolek
28-May-2007
[1887]
BrianH: "How does your reverse code handle "aa+", or "aa+a"" - damn... 
:)

Anyway, today I was thinking about different way how to do it, had 
no time to code it yet.
It's slow, I know...
BrianH
28-May-2007
[1888x2]
Are you familiar with the theories behind parser generators? That 
is what you are doing. I studied the theories in college - hence 
the talks about first and follow sets.
I've been thinking about ways to do this too - you got me going. 
I was also thinking of a way to cache charsets for later reuse.
Maxim
28-May-2007
[1890x3]
didn't you guys try using a charset?  I would guess its an order 
of magnitude faster still... no?
well, checking with one letter didn't give any changes (bitset and 
char seem the same) but just change that with an OR of two letter 
and the char goes to 4 times slower and the bitset stays exactly 
the same... so do use bitsets as often as you can.
>> t0: now/time/precise loop 10 [parse data [some #"a"]] now/time/precise 
- t0
== 0:00:06.389


>> t0: now/time/precise loop 10 [parse data [some [#"b" | #"a"]]] 
now/time/precise - t0
== 0:00:25.496

>> b: charset "ab"

>> t0: now/time/precise loop 10 [parse data [some b]] now/time/precise 
- t0
== 0:00:06.379
Rebolek
29-May-2007
[1893]
I'm not sure how can I test for end in case like this:
r: ["a" "b" "c" end] parse "abc" [some [r/1 (r: next r)]]
anybody?
Oldes
29-May-2007
[1894]
r: ["a" "b" "c" break] parse "abc" [some [r/1 (r: next r)]]
Rebolek
29-May-2007
[1895]
great, thanks
Oldes
29-May-2007
[1896]
Here are C sources for Regex engine used in MySQL5.0.9 http://leithal.cool-tools.co.uk/sourcedoc/mysql509/html/dir_000079.html
Rebolek
29-May-2007
[1897x5]
So I left tail-parse becuase of the "aa+" problem mentioned by BrianH. 
I wrote 'lazy-parse that works different and it seems to handle both 
cases "a+a" and "aa+" well.
>> lazy-parse "aaa" [[#"a"][some #"a"][#"a"]]
== true
>> lazy-parse "aaa" [[#"a"][some #"a"]]
== true
>> lazy-parse "aaa" [[some #"a"][#"a"]]
lazy-parse: func [string rules /local parsed?][
;does not support ANY yet (wrong results)
	parsed?: false
	rules: copy rules
	parse string [
		some [
			b: rules/1 e:
			(
				parsed?: true
				rules: next rules
				if empty? e [
					either empty? rules [
						parsed?: true
						rules: copy [break]
					] [
						e: next b
						if empty? e [parsed?: false]
					]
				]
			)
			:e
		]
	]
	parsed?
]
Warning: this does not suport ANY rule yet.
Grrr, I should make more test before posting...forget last four posts
Rebolek
31-May-2007
[1902]
'regset is now on rebol.org, rest of parser is still in works.
btiffin
31-May-2007
[1903]
Rebolek...You posted the magic prime 797'th Script.  Well Done!  
:)
Rebolek
31-May-2007
[1904]
Hm thanks, but I was aiming for 800th one... ;)
btiffin
31-May-2007
[1905]
Yep...I've been waiting...  :)
Rebolek
7-Jun-2007
[1906]
just a quick idea:

FORALL is implemented as mezzanine function. It calls FORSKIP which 
is mezzanine also. As you can see, it's probably not the fastest 
method. So here's the idea. Cannnot be FORALL rewritten to make it 
faster and is it possible to use PARSE to do this?
So I tried and came up with this simple function:

parall: func [
	'word body
	/local data
][
	data: get :word
	parse data compose/deep [

  some [(to set-word! word) any-type! (to paren! [do bind body :word])]
	]
]

(parall is just a shortcut for parse version of forall).


this is very simple function written in five minutes and not very 
well checked, it needs some additional work (eg. it does not return 
same value as 'forall etc).

So let's do some test (using Ladislav's %timblk.r):
>> n: copy [] repeat i 100 [append n i]

== [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4...
>> time-block [forall n [b: n/1 + 1]] 0.05
== 3.623046875E-4
>> time-block [parall n [b: n/1 + 1]] 0.05
== 3.814697265625E-6
>> 3.62e-4 / 3.81e-6
== 95.0131233595801

95x faster? whooo....

and what about bigger block?

>> n: copy [] repeat i 10000 [append n i]

== [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 4...
>> time-block [forall n [b: n/1 + 1]] 0.05
== 3.540625E-2
>> time-block [parall n [b: n/1 + 1]] 0.05
== 3.7994384765625E-6
>> 3.54e-2 / 3.8e-6
== 9315.78947368421

9000x ? omg...

comments?
Gabriele
7-Jun-2007
[1907]
the devil is in the details... you need to handle break :)
Rebolek
7-Jun-2007
[1908]
oh yes, break...didn't use it for years, so almost completely forget 
about it :)
BrianH
7-Jun-2007
[1909x4]
Don't forget continue.
First tip: You don't have to bind the body - it just uses the existing 
binding of the word.
Netx tip: What native does forskip decompose down to? Try using that 
first.
; Try against this, the forskip code with the skip part taken out
forall: func [
    "Evaluates a block for every value in a series."
    [catch throw]

    'word [word!] {Word set to each position in series and changed as 
    a result}
    body [block!] "Block to evaluate each time"
    /local orig result
][
    if not any [
        series? get word
        port? get word

    ] [throw make error! {forall expected word argument to refer to a 
    series or port!}]
    orig: get word
    while [any [not tail? get word (set word orig false)]] [
        set/any 'result do body
        set word next get word
        get/any 'result
    ]
]
Rebolek
7-Jun-2007
[1913x2]
BrianH: I know it can be done using while loop in forskip, it was 
more an experiment on using parse as foreach/forall kind of function.
to see how fast it can be
BrianH
7-Jun-2007
[1915x3]
Well, if you can figure out how to make it handle break and continue, 
let us know.
Or for that matter, ports.
If you want to test the speed of parse, replace the any-type! with 
a skip - the forall you are comparing it to doesn't do that test.
Chris
7-Jun-2007
[1918]
What is 'continue?
BrianH
7-Jun-2007
[1919]
Sort of like the opposite of break. It breaks one iteration of a 
loop and continues at the top of the next iteration. Many languages 
have it, as does REBOL 3.
Gabriele
8-Jun-2007
[1920]
notice that, forall was initially done that way, and converted to 
use forskip for simplicity (ie. avoiding having the same code twice 
in the source)
Rebolek
8-Jun-2007
[1921]
I tried to enclose parse in loop 1 [] and it seems to handle break. 
I guess you'll probably prove me wrong, Brian :)

parall: func [
	'word body
	/loc data
][
	loop 1 [
		data: get :word
		parse data compose/deep [
			some [(to set-word! word) skip (to paren! [do body])]
		]
	]
]

re: continue - this is not r3 ;)


>> n: [1 2 3 4 5] parall n [if n/1 = 4 [break/return "break"] if 
n/1 > 4 [print "bad"]]
== "break"
Gabriele
8-Jun-2007
[1922x4]
good :) now... if Ladislav was here he'd probably find at least ten 
wrong things in that function. ;)
you need to add [throw] for example, and return the last return value 
of the body
but... as long as we're not replacing forall... this could be a very 
fast alternative to use in cases when you don't need all the features...
and actually... i'd call parse directly in that case ;)