• Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r4wp

[Rebol School] REBOL School

BrianH
8-Aug-2012
[752x7]
Here's a version of Doc's with all datatypes handled, even unset. 
It even avoids accidentally converting lit-words to words and lit-paths 
to paths:

rle: func [s [series!] /local out pos1 pos2 cont][
	out: make block! 1
	parse/all s [
		any [
			pos1: skip some [
				pos2: skip (cont: if any [
					not-equal? unset? :pos1/1 unset? :pos2/1
					strict-not-equal? :pos1/1 :pos2/1
				] [[end skip]]) cont |
				pos2: (repend out [offset? :pos1 :pos2 :pos1/1]) break
			]
		]
	]
	out
]
There was no point in using parse/case since the comparisons were 
being done by strict-not-equal?, not parse.
Don't know how the speed compares to Endo's original though.
Whoops, it doesn't handle runs of unset values. More adjustment needed.
This version handles unsets too:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 1
	emit: [(repend out [offset? :pos1 :pos2 :pos1/1])]
	parse/all s [
		any [
			pos1: unset! some [pos2: unset! | pos2: emit break] |
			pos1: skip some [
				pos2: unset! :pos2 emit break |
				pos2: skip (
					cont: if strict-not-equal? :pos1/1 :pos2/1 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]
	out
]
Slight improvement:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 1
	emit: [(repend out [offset? :pos1 :pos2 :pos1/1])]
	parse/all s [any [
		pos1: unset! any unset! pos2: emit |
		pos1: skip some [
			pos2: unset! :pos2 emit break |
			pos2: skip (
				cont: if strict-not-equal? :pos1/1 :pos2/1 [[end skip]]
			) cont |
			pos2: emit break
		]
	]]
	out
]
For old versions of R2, you might want to change IF STRICT-NOT-EQUAL? 
to UNLESS STRICT-EQUAL?. There used to be a bug in STRICT-NOT-EQUAL? 
in R2, but I forget how far back it was fixed, one of the last couple 
versions.
Endo
9-Aug-2012
[759x2]
The last version crash on string! values:
rle ""
rle "xxx"
both crashes the View 2.7.8.3.1
Here is the benchmark results (execution time for 1.000.000 calls)


>> benchmark [rle [a a a b b c A A a a]] ;BrianH (the old one, not 
the crashing ones)
== 0:00:29.765

>> benchmark [rle [a a a b b c A A a a]] ;endo
== 0:00:32.953
DocKimbel
9-Aug-2012
[761]
Endo: you should rather bench on one long series rather than 1M times 
on a small one in order to avoid function calls overhead and get 
a more fair comparison. When I try with a 1M size string with random 
a,b,c chars, my parse solution is twice faster than the mezz one 
(Brian's one is crashing so can't test it). I was expecting a greater 
difference though.
Steeve
9-Aug-2012
[762]
... test
Sunanda
9-Aug-2012
[763]
Talking of test, I am trying to write a simple function that checks 
if a data item matches a rebol datatype, so for example:

    print is-it-a? "number?" "12.5"
    == true
    print is-it-a? "number?" "xxx"
    == false
    print is-it-a? "number?" "?"
    == false


Except my function goes bad on that third example -- it prints the 
console help text.


Any thoughts on how to check incoming values without executing them 
as code? Thanks

    is-it-a?: func [
       data-type [string!]
       value [string!]
    ][
      data-type: first load/all data-type  
      error? try [
          value: first load/all value 
          return do reduce [data-type value]
      ]
      false
    ]
Steeve
9-Aug-2012
[764]
is-a: func [f v][
	not not all [
		f: get/any load f
		any-function? :f
		f load v
	]
]
Sunanda
9-Aug-2012
[765]
Thanks Steeve, that's much more robust than my code :)

Just for info .....


....It needs some error trapping to handle un-loadable values, eg:
    >> is-a "number?" "33.e"
    ** Syntax Error: Invalid decimal -- 33.e


....And (like my code) it's not so good with 'true and 'false owing 
to the way REBOL works:
    >> is-a "logic?" "true"
    == false

But it'll do the job!
BrianH
9-Aug-2012
[766x4]
Weird, I found the bug in R2 parse that causes the crash:
>> parse "" [integer!]
== false
>> parse "" [unset!]
== true  ; should be false
Two rules it is then. This doesn't crash, and is optimized for strings 
while we're at it. It's probably slower when doing blocks than Doc's, 
but it handles all datatypes:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 1
	emit: [(repend out [offset? :pos1 :pos2 first :pos1])]
	parse/case/all :s pick [[
		any [pos1: skip (cont: first :pos1) any cont pos2: emit]
	] [
		any [
			pos1: unset! any unset! pos2: emit |
			pos1: skip some [
				end pos2: emit break |
				pos2: unset! :pos2 emit break |
				pos2: skip (

     cont: unless strict-equal? first :pos1 first :pos2 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]] any-string? :s
	out
]


It also works around the strict-not-equal? bug in pre-2.7.7 R2, and 
using FIRST instead of path access is another speedup in R2 (path 
access is faster in R3).
Change the repend to chained inserts and it gets noticably faster, 
due to less mezzanine overhead:

rle: func [s [series!] /local out emit pos1 pos2 cont][
	out: make block! 2

 emit: [(out: insert/only insert out offset? :pos1 :pos2 first :pos1)]
	parse/case/all :s pick [[
		any [pos1: skip (cont: first :pos1) any cont pos2: emit]
	] [
		any [
			pos1: unset! any unset! pos2: emit |
			pos1: skip some [
				end pos2: emit break |
				pos2: unset! :pos2 emit break |
				pos2: skip (

     cont: unless strict-equal? first :pos1 first :pos2 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]] any-string? :s
	head out
]
You can get rid of this line too for a slight speedup:
				end pos2: emit break |

It's a leftover when I was trying to work around the bug in PARSE.
Maxim
9-Aug-2012
[770]
providing an optional output buffer  (which the user can pre-allocate 
to some ideal size)  would make a VERY big difference on large inputs.


usually, when it goes into the hundreds of thousands, repetitive 
series re-allocation on growing mutable series,  will kill any kind 
of optimisation you can dream of. 


rle: func [s [series!]   /into out  [block!]  /local out emit pos1 
pos2 cont][
	out: any [ out     make block! 2 ]
 ...
]


this is especially effective on repetitive calls to the above function 
and using clear on the given buffer so that it auto-grows to an optimal 
size and is fast on later calls.


just today, I was doing some encryption benchmarking and when I hit 
strings larger than 1MB it was taking several minutes... thats until 
I realized that it was my dataset generator (a looped string insert) 
which was taking 98% of the cpu time.  !
BrianH
9-Aug-2012
[771x3]
If you follow the /into option standard you can do chained calls 
to RLE too:


rle: func [s [series!] /into out [any-block!] /local emit pos1 pos2 
cont][
	unless into [out: make block! 2]

 emit: [(out: insert/only insert :out offset? :pos1 :pos2 first :pos1)]
	parse/case/all :s pick [[
		any [pos1: skip (cont: first :pos1) any cont pos2: emit]
	] [
		any [
			pos1: unset! any unset! pos2: emit |
			pos1: skip some [
				pos2: unset! :pos2 emit break |
				pos2: skip (

     cont: unless strict-equal? first :pos1 first :pos2 [[end skip]]
				) cont |
				pos2: emit break
			]
		]
	]] any-string? :s
	either into [:out] [head :out]
]
>> head rle/into "ddeeee" rle/into "aaabbc" make block! 10
== [3 #"a" 2 #"b" 1 #"c" 2 #"d" 4 #"e"]
Functions with /into work like INSERT when chained together. This 
works with other block types too, though OFFSET? will be slow with 
the list! type.
Maxim
9-Aug-2012
[774]
now that's a pretty nice RLE encoder.     :-)
BrianH
9-Aug-2012
[775x3]
R3 version, same /into option:


rle: funct ["Run length encode" s [series!] /into output [any-block!]] 
[
	unless into [output: make block! 2] x: none

 r: either any-block? :s [qr: copy [quote 1] [(qr/2: :x) any qr]] 
 [[any x]]
	parse/case :s [any [pos1: set x skip r pos2: (

  output: reduce/into [subtract index? :pos2 index? :pos1 :x] :output
	)]]
	either into [:output] [head :output]
]
Sorry, same unset problems, have to use POKE:


rle: funct ["Run length encode" s [series!] /into output [any-block!]] 
[
	unless into [output: make block! 2] x: none

 r: either any-block? :s [qr: copy [quote 1] [(poke qr 2 :x) any qr]] 
 [[any x]]
	parse/case :s [any [pos1: set x skip r pos2: (

  output: reduce/into [subtract index? :pos2 index? :pos1 :x] :output
	)]]
	either into [:output] [head :output]
]
>> mold/all rle reduce [() () 'a () 1]
== "[2 #[unset!] 1 a 1 #[unset!] 1 1]"
Endo
10-Aug-2012
[778x3]
It is what I like about this community :)

I knew that when I write a RLE function, BrianH will come up a much 
better version. Doc and others joined as well and now we have a very 
good function. Just like the CSV tools. Thanks.
Ehm.. what about the decoder? how do I decode unset! values?
I was using somthing like:

decode-rle: func [b /local r] [r: copy [] foreach [x y] b [loop x 
[append r y]]]
decode-rle: func [b /local r i] [

 i: 0  r: make block! foreach [x y] b [i: i + x] ;better for big blocks?
 foreach [x y] b [loop x [append r y]]
]
BrianH
10-Aug-2012
[781x3]
In mezzanine style:

decode-rle: func [
	"Decode a run length encoded block"
	rle [any-block!] "Block of [integer value]"

 /into "Insert into a buffer instead (returns position after insert)"
	output [series!] "The output buffer (modified)"
	/local x
] [
	unless into [
		x: 0  foreach [i v] :rle [x: x + :i]  output: make block! x
	]

 foreach [i v] :rle [output: insert/only/dup :output get/any 'v :i]
	either into [:output] [head :output]
]


Instead of testing for strict format compliance of the input block, 
it uses get-words to keep people from sneaking in functions and then 
passes the length value to + and INSERT/dup, counting on the type 
tests of those functions to do the screening for us.
You're right, having the make block! take the foreach expression 
as a parameter is safe; I forgot that make block! can take none as 
a parameter.
That should work in R3 as well. Though FORSKIP might be faster than 
FOREACH in R3, the simplicity of the code might be worth it.
Maxim
10-Aug-2012
[784]
I like the detail of using  :i   to prevent function hacking.   I 
should use it more often.
BrianH
10-Aug-2012
[785x5]
Trick I picked up when securing the mezzanines. It's slightly faster 
to evaluate too since it does less work.
The reason I use :output there isn't to prevent function hacking, 
it's to prevent converting lit-path! values to the path! type.
Sometimes you want to allow someone to pass in functions and then 
let them evaluate, as long as you have a good semantic model for 
what is supposed to happen and are careful about how you call them. 
The ARRAY, EXTRACT and REPLACE functions in R3 and R2 2.7.7+ are 
a good example of this.
Slightly more optimal version for R3, taking advantage of how get-words 
and get-paths mean GET/any, and how FORSKIP is faster than FOREACH:
decode-rle: func [
	"Decode a run length encoded block"
	rle [any-block!] "Block of [integer value]"

 /into "Insert into a buffer instead (returns position after insert)"
	output [series!] "The output buffer (modified)"
	/local x
] [
	unless into [
		x: 0  output: make block! forskip rle 2 [x: x + :rle/1]
	]
	forskip rle 2 [output: insert/only/dup :output :rle/2 :rle/1]
	either into [:output] [head :output]
]
Maxim
10-Aug-2012
[790]
these would be nice funcs to add to mezz in R3 and R2-forwards
BrianH
10-Aug-2012
[791x2]
Darn, just found a bug in ARRAY for R2 and R3. Litwords are converted 
to words and litpaths are converted to paths.
This is so obscure that I doubt it has affected any existing code 
though.
Arnold
10-Aug-2012
[793x2]
Discovered that d: 1.1.1 then d/1/2: 0 and d:/1/3: 0 and then d/1/2: 
d/1/3: 1 results in d == 1.1.0 ??

This keeps me just inches away from releasing my script before my 
holiday.
Only thing to add is 1 small function to reduce the moves when the 
king is under attack. I discovered some weird VID behaviour too where 
alert boxes have strange formats they inherited from earlier defines 
fields.
Steeve
10-Aug-2012
[795]
An alternative for R3 (strings and blocks)
rle: func [s /local p e o][
	o: copy []
	parse/case s [
		any [

   p: skip any [e: if (p/1 == e/1) skip] (repend o [offset? p e p/1])
		]
	]
	o
]
BrianH
11-Aug-2012
[796x6]
Steeve, that's basically the same as my R2 RLE's block rule, but 
with the IF workaround replaced with IF. It has a few gotchas:
- Executes function values in block data
- Doesn't handle unset! or error! values

- Converts lit-paths to paths and lit-words to words before comparison 
and again before putting in the output.
- Lots of intermediate block creation overhead

- Considers bindings of words when comparing them, not just case-sensitive 
spelling


The first 3 can be handled by using :p/1 and :e/1 instead of p/1 
and e/1, and the fourth by using REDUCE/into instead of REPEND. The 
last one can't be handled by any built-in function or operator in 
R3 (see http://issue.cc/r3/1834for details) but you could do a combination 
of functions and operators to get case-sensitive comparison without 
considering bindings. PARSE/case's QUOTE operation is the fastest 
method for doing that at the moment.


Nice job on neatly bypassing the relaxed bounds checking of R3 blocks. 
Though the if (p/1 == e/1) would succeed if p/1 is none and e is 
at the end of the block, the skip would still fail. That trick saves 
one e: operation.
R3's == operator handles unset and error values better than R2's 
though, which is why the explicit unset! testing in the rule can 
be removed.
Having a strict line of progression for the R3 equalities turned 
out to be a bad idea, since the binding check seems to be tripping 
up case checks.
Unfortunately, R3 development was put on hold before that could be 
fixed.
If you *want* to consider binding when doing your comparisons, perhaps 
for more lossless in-memory compression, then Steeve's IF == method 
is the way to go. If you want true lossless compression then you 
could even use =? to make sure that only runs of exact references 
compress.
The advantages of == or =? comparison over PARSE QUOTE would be lost 
if you serialize the data and save it to a file or send it over a 
network. REBOL syntax doesn't keep track of those distinctions.