r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Profiling] Rebol code optimisation and algorithm comparisons.

Terry
20-May-2010
[270]
so to determine what Sylvester is.. 

s: i1/("Sylvester")
>> [4]
p: i2/("isa")
>> [1 4]

do an intesect to get "4"
then the value = index ( 4) + 2
Ladislav
20-May-2010
[271]
Why not just use the index of the first part of the triple?

 - yes, can be done that way, but if you need to remove the triples, 
 you find out, that it may be practical
Terry
20-May-2010
[272]
how does the 4th make it more practical?
Ladislav
20-May-2010
[273]
how...

 - this way, you will have a triple ID that does not change even if 
 other triples are removed. That is not true for your index.
Terry
20-May-2010
[274]
If you remove a triple, you'd need to reindex everything, no?
Ladislav
20-May-2010
[275]
actually, that would be costly
Terry
20-May-2010
[276x2]
I'd tend to null it out instead, and reuse it later
I have thought of a fourth ID for a triple before.. but it represents 
the triple itself. .. rarely needed it though. .there's always a 
way around
Ladislav
20-May-2010
[278]
Yes, sure, as I said, it is practical mainly in case, you want to 
have simple and fast triple delete operation
Terry
20-May-2010
[279x2]
yeah.. that could remove some ambiguation
er.. wait, there's no ambiguation.. http://en.wikipedia.org/wiki/Single_Source_of_Truth
Ladislav
20-May-2010
[281]
...but you are right, that there is always a way around
Terry
20-May-2010
[282x2]
In my old AtomDB i made in Rebol some time ago, I had an index for 
each value.. ie: the number "42" was stored only once, and used the 
index with the inference.. 

values: [ "42" "November" "Baseball"..]


You could have a million birthdays recorded in "November", with a 
single element

billbdaymonth: 2
frankbdaymonth: 2
frankage: 1
the answer to life the universe and everything: 1
It makes more sense for larger values.. nice when you can use a single 
symbol to represent a 1GB binary

Anyway, thanks everyone for the help.
Ladislav
20-May-2010
[284]
Just BTW, I doubt, that INTERSECT is the fastest way how to intersect 
two hashes, might need some profiling too.
Maxim
20-May-2010
[285]
terry, any benchmark results on a dense search with a few million 
records?

also: how long is generating the index on that data

and how much RAM is your triple index consuming?
Steeve
20-May-2010
[286]
Ladislav, no prob with a map!
Terry
20-May-2010
[287x4]
Max, building the indexing functions now
indexing shouldn't take too long.. one foreach pass against the data
probably a faster way, but I'll let the gurus beat their chests over 
it :)
Izkata.. here's another option.. not sure how efficient it is.. 

orig: ["one" "two" "three" "four" "five" "six"]
blk: []

foreach [ a b c ] orig [ d: array/initial [1] reduce [a b c] append 
blk d]
Izkata
20-May-2010
[291x2]
Well, if your doing it that way, "append/only blk reduce [a b c]" 
should be faster - no call to 'array


May as well also have "blk: make block! (length? orig) / 3" - don't 
remember how much of a speedup that was in your previous tests, though
you're*
Terry
20-May-2010
[293x11]
**Izkata beats his chest, challenging all comers ** 
:)
Max: Indexing 244,327 triples took 4:20 on a rather snappy windows 
box with 6Gb RAM
DB is consuming 114 Mb Ram
(Rebol 2)
1 GB RAM = 2,143,219 triples and their indexes (approx)
Within that data (real world) i have 3,733 cable triples ("dfuflfi33s", 
"isa" "cable") 
("dfuflfi33s" isa a unique 10 byte ID that i stick on everything)


Using the code posted earlier, i can pull a final block of indexes 
for each of those cables in 0.0012 seconds
oops.. decimal error. 

.012 , not .0012
crawling with foreach to get the same result is .09  or about 9 times 
slower
pulling a single unique email address (out of 429) took  0.013

so the time seems somewhat fixed around 0.12 seconds regardless of 
the density
There's a fundamental flaw with this.. let say i find 3000 customers, 
but now i want to pull their email property

foreach customers  customer [ (pseudo) intersect the  customer with 
i1 and email i2)
once is fine.. but 3000 loops is really slow

probably need to store the properties as objects

data: [ "bob" [ email "[bob-:-example-:-com]" lname "Jones"] ...]

then continue to index email, lname etc
OK, take all the noise above and throw it away.. i have the perfect 
solution (how's that for a tease?)
It ties in with my old AtomDB developments, but uses some of the 
methods we went over.. AtomDB used foreach to crawl.. fine for smaller 
datasets but too slow on large data
Maxim
20-May-2010
[304]
foreach is alway slower than find/part  so whatever algorithm you 
use, replace it with that in its loop and it will be even better.
Terry
20-May-2010
[305]
It's becoming apparent that these large datasets need to be reduced 
when it comes to FINDing strings



imagine: ["one" "two" "three" "four" "five"... "one hundred million"] 
 

should be

o: ["one" "one hundred million"]
t; ["two" "three"]
f: ["four" "five"]

or even

fo: ["four" "fox" "food"]
fi: ["five" "fifty" "fiver niner"]

pull out the first twh chars and there's your search block
Graham
21-May-2010
[306]
You could make a complete hash of it.
Terry
21-May-2010
[307]
http://sociopathways.files.wordpress.com/2009/12/hash-browns.jpg
Terry
22-May-2010
[308x3]
Steeve, back to your method with some changes.. 
Some early benchmarks on this new method
3,000,000 objects


subjects:  [ [age 42 fname "Steeve"][ fname "Tweety" age 75] [url 
"http://rebol.com"] ... ]

3,000,000 key / value pairs that index the subjects block

i1: [ "Steeve" 1 "Tweety" 2 "Rebol" 3 ... ] 

then.. 

	theindex: select i1 "Rebol"
	out: pick subjects theindex


I can do 1,000,000 of these in 1.3 seconds or one 769, 230 / s (older 
box with 2gb ram)

then finish the other two triples index (preds vals)

i2: [age [ 1 2] fname [1 2]  url [ 3]]

i3 should have it's values converted to md5s (as suggested) 

i3: [ #{12CE1E73A3EABF1623970A0C53B9CA1F} [3] ]
The problem with the other method was the overall cost of looking 
up properties .. if all you needed was the single property, (ie all 
things that are age = 75) then that was fine, but usually you want 
more info. So if you found 5000 things that match the query, you'd 
need to run the whole process 5000 times to get all their email addresses. 


It just makes more practical sense to keep things as things, along 
with all of their properties.
(I prefer the term "things" over "objects" as triples don't usually 
contain methods, just properties)
Terry
26-May-2010
[311]
Any thoughts on speeding up intersect when comparing large blocks?
Ladislav
26-May-2010
[312x3]
You should not use Intersect on blocks
(if you want speed)
my suggestion is to try it on hashes
Ashley
26-May-2010
[315]
intersect is a native so the only way you can get faster performance 
is if you know something about the blocks contents. For example, 
if both blocks contain sorted integers *and* one block is smaller 
than the other *and* the bigger block contains the first value of 
the smaller then the following code *may* be faster than intersect:

	a: [2 3 4]
	b: [1 2 3 4 5]
	x: last b
	b: find b first a
	r: copy []
	foreach i a [
		all [i > x break]
		all [find b i append r i]
	]
	b: first b
Ladislav
27-May-2010
[316x2]
When using Intersect on blocks, you should take into account, that 
their contents are hashed first. Therefore, using hashes instead 
of blocks, the hashing operation itself may be spared.
but, I did not do any speed comparisons, so it is up to you, Terry
Terry
27-May-2010
[318x2]
i just tried some comparisons and the time is the same
however, I think I've finally created a rocket db