r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Core] Discuss core issues

Terry
17-May-2010
[16691x2]
>> a

== [23 43 55 28 345 99 320 48 22 23 95 884 1000000 999999 999998 
999997 999996 999995 999994 999993 999992 999991 999990 999989 999...
>>

(a is a block with 1, 000,012  integers) 


ind: func[series value x][


	st: now/time/precise
		
	while [x > 0][
		
		result: make series 0
		series: head series

  parse series[any [series: 1 1 value (append result index? series) 
  | skip]] 
		x: x - 1
	]
	et: now/time/precise
	fin: et - st
	print fin
	result 
]

feach: func[series value x][

	result: make series 0 
	st: now/time/precise
	
	while [x > 0][

  foreach[s p v] series [if s = value [append result reduce [s p v]]]
		x: x - 1
	]
	et: now/time/precise
	fin: et - st
	print fin
	result 
]

>> ind a 23 10
0:00:01.249
== [1 10 999990]
>>

>> feach a 23 10
0:00:01.01

== [23 43 55 23 95 884 23 43 55 23 95 884 23 43 55 23 95 884 23 43 
55 23 95 884 23 43 55 23 95 884 23 43 55 23 95 884 23 43 55 23 9...
>>

10 iterations each.. 


foreach is the winner speed wise.. as a bonus, If i use foreach, 
I don't need the index?
Still this is too slow.. it's fine for 1M data, but 10M and it grinds 
hard.

.. there must be a faster way to find integers in a block (or hash) 
using SELECT, FIND or INDEX?
Maxim
17-May-2010
[16693x7]
yes... you do a find, within a while block.   that was the fastest 
loop I did for matching exact SAME? strings.  I'd bet its the best 
here too.
or did you test that already?
the problem is that with hash! find will return both keys and data 
though.
also, until is a bit faster than while IIRC.
might want to refer to this big loop analysis I did last year...


http://www.pointillistic.com/open-REBOL/moa/scream/rebol-iterator-comparison.txt
you'd be surprised that the fastest search method is remove-each 
(if you account for inserting/appending as part of the loop).   but 
you'll need to copy the initial block each time, which might take 
its own time.
in your tests, this causes your loops to slow down a lot:

result: make series 0

you should:

result: make series length? series. 


because when appending, you will be re-allocating the result over 
and over... and the GC is the loop killer in every single big dataset 
tests I've done.  not because its bad, per se, but because you are 
forcing it to operate.
Terry
17-May-2010
[16700]
Thanks Maxim, I'll check it out.
Maxim
17-May-2010
[16701x2]
I'm building a little example which I think will outperform your 
examples... I'm curious.
I'm just about to test it.
Terry
17-May-2010
[16703]
moving the result: make series 0 out of the while loop had a 10 milli 
improvement over 10 iterations.
Maxim
17-May-2010
[16704]
that might account for the rarity of results.  if your data has a 
lot of occurences, then that number will increase, since your result 
block will grow much more.
Gregg
17-May-2010
[16705]
You may need to move beyond brute force Terry, and set up a data 
structure to optimize the searching.
Terry
17-May-2010
[16706x6]
There must be a way.

An index is a symbol that represents some value. What I need is to 
add some metadata to that index, and make that searchable.
the goal is a blazing key/value store that's as fast as pulling by 
index
:)
I thought i had it with find/any .. but it doesn't work on blocks
I haven't tried it, but my guess is storing or converting values 
to string to use find/any on will be slower than foreach. as in foreach 
[key value] n [...
and i don't think you can search keys in hash or map! without using 
foreach?
I suppose the while loop with counter, is adding extra burden in 
my examples as well so real world would be faster
Maxim
17-May-2010
[16712]
nope... since that is 10 ops within hundreds of millions.
Terry
17-May-2010
[16713x2]
The other thing i considered was using pairs! to as a pair of symbols, 
but can't search those either without foreach
ie: [ 23x54 "value 1" 984x2093 "value 2"]
Maxim, the foreach results against strings in your doc is an interesting 
result.
Maxim
17-May-2010
[16715]
right now, for sparse data, I've got an algorithm that traverses 
5 times faster than your foreach example.  but if its really dense, 
then it will be slower.
Terry
17-May-2010
[16716]
I'm thinking 10,000,000 values min, and preferably to max memory
Maxim
17-May-2010
[16717x3]
10 x 10 million items, with a single value within the block...  

0:00:00.234  using a 1.5GHz laptop.
vs: 
0:00:01.594 for your foreach example.
plus record-size is variable, and is supplied as a parameter to the 
function.
Terry
17-May-2010
[16720]
not bad.. with index? i was getting around 2.5million/sec against 
100,000
Maxim
17-May-2010
[16721]
find-fast: func [
	series
	value
	record-length
	i "iterations"
	/local result st s
][
	
	st: now/precise
	while [i > 0][
		result: make series 0 ;length? series
		s: head series
		until [
			not if s: find  s value [
				either 0 = mod -1 + (index? s) record-length [
					insert tail result copy/part series record-length
					not tail? s: next s
				][
					s: next s
				]
			]
		]
		i: i - 1
	]
	
	print difference now/precise st
	print length? result
	
	result
]
Terry
17-May-2010
[16722x2]
nice name
..
we can call the winner find-fastest
Maxim
17-May-2010
[16724x2]
this can probably be optimised further...  but note that the algorithm 
scales with density of your block
density = ratio of matches vs size of block
Terry
17-May-2010
[16726x4]
can it be modified for searching keys?
and especially pattern matching of some sort?
I gues find/part would work if the data were string? no?
On your doc, regarding foreach you mentioned.. 

blocks get faster as you grab more at a time, while strings slow 
down, go figure!
 
but the stats look the opposite?
Maxim
17-May-2010
[16730x2]
if you look at the speeds,  cycling through two items at a time should 
give roughly twice the ops/second.


when you look at the results... blocks end up being faster than this, 
and strings are less than this.
btw, in my find-fast above... when search matchs aprox 1/10 times, 
it ends up being twice as slow as the foreach... 

so speed will depend on the dataset, as always.
Andreas
17-May-2010
[16732]
maxim, your find-fast returns the values, not the indices (not sure 
if that was intended)
Maxim
17-May-2010
[16733]
so does terry's
Andreas
17-May-2010
[16734]
his "feach" yes, but not his "ind" adaptation
Maxim
17-May-2010
[16735x2]
I'm comparing to his feach, which was faster... but if I just returned 
the index, it would be faster still (no copy/part).
I'm trying out something else, which might be even faster...
Andreas
17-May-2010
[16737x2]
indices?-maxim-1: func [series value /local result][
    result: make block! length? series
    until [
        not if series: find series value [
            append result index? series
            series: next series
        ]
    ]
    result
]
that's your above find, stripped of the record-length stuff, and 
returning indices instead of the actual values
Maxim
17-May-2010
[16739]
btw... I discovered  //  instead   of   mod,   which is twice as 
fast..... turns out  MOD is a mezz... never noticed that.
Andreas
17-May-2010
[16740]
prin "Initialising ... "
random/seed now/precise
dim1: 10'000'000
dim2: dim1 / 2
needle: random dim2
haystack: make block! dim1
loop dim1 [append haystack random dim2]
print "done"