r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Core] Discuss core issues

TomBon
18-May-2009
[13761]
what is the fastest way to count the occurence of each numbers
within a block containing 5000+ numbers?  

e.g [1 3 6 2 4 9 33 6 67 2 12 23 34 12 2 4 56 2 ...]

calculating the max and min with a simple loop and counter is 
very time consuming for thousends of these blocks. 
inserting into mysql and doing a 'group by' also. 
any ideas for a faster solution?
Henrik
18-May-2009
[13762x2]
what if you sort the block first?
then FIND on the first occurrence of 1, of 2, of 3, of 4, etc. and 
store the indexes.
TomBon
18-May-2009
[13764x3]
yes henrik, nice idea in this moment I was thinking similar but with 
copy/part but storing the index is much faster.
the numbers for the  FIND can be supplied from a copy of the block 
with UNIQE
yep, will give it a try, thx henrik...
Sunanda
18-May-2009
[13767]
I usually use Joel's tally function -- it saves me the time needed 
to write a faster function:
http://www.rebol.org/ml-display-message.r?m=rmlKDDS
Izkata
18-May-2009
[13768x2]
I think the fastest would be a variation of radix sort, if you know 
the max/min..

Setup for example:
>> Data: array/initial 5000 does [random 200]

== [36 119 148 112 87 59 176 21 95 138 183 28 1 119 14 74 39 78 141 
99 56 71 47 174 92 81 157 182 122 123 98 73 35 170 150 11 183 8...

>> MaxVal: fold Data 0 :max ;Just use a simple loop, I defined fold 
long ago for simplicity
== 200

What I mean:
>> Vals: array/initial MaxVal 0

== [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
>> foreach Val Data [Vals/:Val: Vals/:Val + 1]
== 27

>> repeat I MaxVal [if 0 <> Vals/:I [print rejoin [I {, } Vals/:I 
{ times}]]]
1, 24 times
2, 20 times
..........
199, 20 times
200, 16 times
Tally looks nice, though, and not as memory-intensive
Dockimbel
18-May-2009
[13770x2]
Here's one, maybe not fully optimized, but should be fast enough 
:

count-occurences: func [data [block!] /local keys count idx][
	keys: make hash! unique sort data
	count: array/initial length? keys 0
	parse data [
		some [
			set v skip (
				idx: index? find keys v
				poke count idx 1 + count/:idx
			)
		]
	]
	reduce [keys count]
]
>> data: array/initial 5000 does [random 200]

== [36 119 148 112 87 59 176 21 95 138 183 28 1 119 14 74 39 78 141 
99 56 71 47 174 92 81 157 182 122
123 98 73 35 170 150 11 183 8...
>> t: now/time/precise
== 22:01:04.156
>> res: count-occurences data

== [make hash! [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
21 22 23 24 25 26 27 28 29 30 31 32
 33 34 35 36 37 38 39 40 41 4...
>> now/time/precise - t
0:00:00.031
>> res/1

== make hash! [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
21 22 23 24 25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40 41 42...
>> res/2

== [24 20 19 28 30 18 45 28 19 24 25 23 36 13 17 29 28 26 17 28 23 
24 28 22 33 31 25 27 25 28 29 31 14
 24 20 21 30 20 29 26 19 26 2...
Steeve
18-May-2009
[13772x2]
Hmm... i think tally func cuold be speed up with the Tombon's trick 
(using UNIQUE)
like, this:

fast-tally: func [b [block!] /local c t u i v] [
	c: sort copy b
	u: unique c
	t: make block! length? u
	i: 1
	v: u/1
	foreach val next u [
		c: find c val
		t: insert t as-pair v negate i - i: index? c
		v: val
	]
	c: u: none
	head t
]

probe fast-tally [1 4 5 6 3 4  6 2 7 4 3 4]
== [1x1 2x1 3x2 4x4 5x1 6x2]
Maxim
18-May-2009
[13774]
steve... why not use block parsing ?  ;-D
Steeve
18-May-2009
[13775]
i'll kill you ;-)
Maxim
18-May-2009
[13776]
LOL
Steeve
18-May-2009
[13777]
hum, there is a bug, the last value (7) is not precessed, but u got 
the idea...
Maxim
18-May-2009
[13778]
steeve you've got to stop giving examples that don't work!   ;-P
Steeve
18-May-2009
[13779]
but they work... in my mind
Izkata
18-May-2009
[13780]
arraycount: func [Data /local MaxVal Vals][
    MaxVal: Data/1
    foreach Val Data [if Val > MaxVal [MaxVal: Val]]
    Vals: array/initial MaxVal 0
    foreach Val Data [Vals/:Val: Vals/:Val + 1]
    Vals
]

So, so far:
>> Data: array/initial 5000 does [random 200]

>> time [tally copy Data] 50
== 0:00:00.561255
>> time [fast-tally copy Data] 50
== 0:00:00.34082
>> time [count-occurences copy Data] 50 
== 0:00:00.610159
>> time [arraycount copy Data] 50      
== 0:00:00.250419
Steeve
18-May-2009
[13781]
Don't know if it's faster, but it's a bug free and consumes less 
memory.

fast-tally: func [b [block!] /local c u i] [
	c: sort copy b
	u: unique c
	i: 1
	forall u [
		c: any [find c u/2 tail c]
		u/1: as-pair u/1 negate i - i: index? c
	]
	c: none
	head u
]
Izkata
18-May-2009
[13782]
Looks mostly the same in speed:
>> time [fast-tally1 copy Data] 50
== 0:00:00.379225
>> time [fast-tally2 copy Data] 50
== 0:00:00.392215
Steeve
18-May-2009
[13783x3]
hum, i forgot, forall is slow within R2...
Should be a little better...

fast-tally: func [b [block!] /local c u i] [
	c: sort copy b
	u: unique c
	i: 1
	until [
		c: any [find c u/2 tail c]
		u/1: as-pair u/1 negate i - i: index? c
		tail? u: next u
	]
	c: none
	head u
]
izkata, in R2 maximum-of is a native function.
with that, your solution should be defintivly the fastest
Izkata
18-May-2009
[13786x2]
Sorting seems to be the big bottleneck:


arraycount with sorted data (using maximum-of, same results as sorted):
>> time [arraycount copy SDat] 50
== 0:00:00.172956
fast-tally with unsorted data:
>> time [fast-tally copy Data] 50
== 0:00:00.357852
fast-tally with the sort step removed, with pre-sorted data:
>> time [fast-tally copy SDat] 50 
== 0:00:00.103735
So which is best depends on what Tom is using it for, I guess
TomBon
19-May-2009
[13788]
this is great, very fast code. thanks a lot for this.
the code is used for generating dynamic parsing rules 
for financial market data. soon I will release a free
software based on this data.
Steeve
19-May-2009
[13789x2]
Seems that Tally is faster with R3.
2 reasons:
SORT is faster
MAXIMUM-OF is not anymore native 

(btw some of us claimed that it was a bad idea to have maximum-of 
 not anymore native in R3, But will claimed in the desert, so now 
we can see the drawback)
(Tally faster than radix, i meant)
BrianH
19-May-2009
[13791]
MAXIMUM-OF was rarely used, and is now mostly native rather than 
fully native. The majority of its processing is the loop, which is 
now native. MAXIMUM-OF and MINIMUM-OF are so rarely used that they 
might not make the mezzanine cut; they might be moved to R3/Plus.
Izkata
19-May-2009
[13792x2]
I would definitely prefer a native fold-type function over maximum-of 
and minimum-of
which I think was mentioned under the name accumulate?
BrianH
19-May-2009
[13794]
We tried, but that has definitely been declared an R3/Plus function, 
not native. Making ACCUMLATE? native wouldn't help much - the mezzanine 
version is really fast already. You won't be able to get functional 
language speed even from a native because that requires compiler 
support, and REBOL doesn't have a compiler at all.
Izkata
19-May-2009
[13795]
Okay, good to know.  Glad it's in, at least
BrianH
19-May-2009
[13796x2]
R3/Plus has a low barrier to entry, because it will be modular. People 
can just not include the modules they don't need.


Conversely, there will be a high standard for inclusion as mezzanine 
or built-in native code because you won't as easily be able to remove 
it if you don't need it. Many R2 mezzanines, and even some natives 
won't make the cut.
The plugin architecture will make it less necessary to include native 
code you don't need. R3/Plus lets us say yes without bloating the 
core. R3 is going to be much tighter than R2.
Graham
19-May-2009
[13798]
Such as ?  I'd had to have an cgi script running and find that functions 
I need aren't included.
BrianH
19-May-2009
[13799]
Use the needs header to load the modules you need and you can be 
sure they'll be included. As for which functions will be mezzanine, 
that is still in question for many functions. If you want to make 
sure that your favorites are there, oparticipate in the discussion.
Graham
19-May-2009
[13800x2]
page: read/custom [ scheme: 'http host: "twitter.com" target: "direct_messages/new.xml" 
user: "my-twitter-id" pass: "mypassword" ]  [ POST "text=This was 
also sent from a Rebol&user=synapse_emr" ]


This sends a private tweet to a user ... not clear from the API docs 
what the call is to just tweet ... anyone know?
ooops ... wrong group.
Maxim
19-May-2009
[13802]
this is easy to figure out using firebug  ;-)


redirect the html page to a server you have cheyenne running and 
save out the whole http request, you will have url and post data 
:-)
Graham
19-May-2009
[13803x2]
just reinstall wireshark
But there is a  twitter restful api ... just can't find it there. 
  Must be blind.
Janko
19-May-2009
[13805]
too bad accumulate is not in the core, I totally ditched python for 
any further work because guido v.r. removed the basic (quite poor) 
functional constructs it had
Steeve
19-May-2009
[13806]
If that so, most of existing mezzanines should be fired into external 
modules.
Because i use mezzanines only to test some ideas.

But if have to do the "real work", then i use only natives. Don't 
ask why, it's obvious.
Graham
19-May-2009
[13807]
Don''t you end up then writing your own mezzanines?
Janko
19-May-2009
[13808]
to me one of great beauties of rebol is that all stuff are expressions 
like "either" and by so composable ..  and that you don't need explicit 
return statements.. two of big features that I think aid more functional 
(not imperative) design of programs...
Steeve
19-May-2009
[13809x2]
but they they have to be strictly adapted for their purposes, so 
i remove code related to needless use cases.
speed is a priority especially with Rebol which can be really slow 
if you don't take care