r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Core] Discuss core issues

Sunanda
20-May-2007
[8102]
oops -- anton typed faster than me :-)
Terry
20-May-2007
[8103]
ahh ok..    ie: [a4b   4°¢  ÑAg]   etc ?
Sunanda
20-May-2007
[8104]
That would be a block of binary....each binar would take a slot
Anton is suggesting this sort of approach:
x: make binary! 25000
>> loop 18 [insert x to-char random 255]
== #{5E218289FC8B65B86A1C597C232F9E8C79}
Just one binary string to hold the data.
Terry
20-May-2007
[8105x2]
is there a function to convert an integer to 3 bytes?
ahh.. nice
Anton
20-May-2007
[8107x4]
No, you must make the functions to convert  3-byte-int <--> integer!
>> bin: #{000005000006000007}

>> repeat index 3 [bin-int: copy/part skip bin (index - 1 * 3) 3 
print [index mold bin-int to-integer bin-int]]
1 #{000005} 5
2 #{000006} 6
3 #{000007} 7
binary! is indexed per byte, so if we use 3 bytes per value, we must 
multiply the index by 3 to find the correct value.
We must do this in our functions which should allow retrieving, changing 
and inserting the value.
Terry
20-May-2007
[8111]
would there be a performance trade off as opposed to just a block 
of integers? (finding/ appending etc.)
Anton
20-May-2007
[8112x3]
Note that converting integer <-> binary is a little tricky. But we 
can use a struct! (re-use a shared struct for optimal performance).
Yes, there is lost performance. Obviously, you must go via your access 
functions each time you store and retrieve.
(This is addressed by Rebol3's vector! datatype, which stores different 
values this way.)
Terry
20-May-2007
[8115x2]
well, performance is the primary concern
i thought crawling a block of binary (hash?) would be faster than 
a block of integers
Anton
20-May-2007
[8117]
How many numbers are there likely to be ? What do they represent 
?
Terry
20-May-2007
[8118]
Here's what Im trying to do.. 

I have a dictionary...   dict: ["one" "two" "three" ... ]    with 
 a couple of million values

I want to build an index block .. dex: [ 2 3 2 1]  that represents 
the index? of each word in my dictionary
Anton
20-May-2007
[8119]
Is this to get around rebol's current word limit ?
Terry
20-May-2007
[8120x5]
Only partly
but there's a compression value as well..
a value in the dictionary could be a page of rebol code (as a string) 
 .. and it could be represented with a single bit  (or in this case.. 
3 bytes)
the values never change.. they may be deleted... and their index 
re-valued.. or the dictionary may be appended..
so then my code becomes a block of 3 byte 'symbols' that I use to 
'pick' the actual values out of the dictionary with.
Anton
20-May-2007
[8125x2]
Is the aim to create a fixed-width number representation for the 
index ?
(I mean, a fixed-width of 3 characters.)
Terry
20-May-2007
[8127x3]
i would rather have it not fixed, and grow as needed
2 bytes would be plenty to start.. but would quickly grow to 3.. 
and very slowly (if ever.. but possible) grow to 4 bytes..
so 3 bytes is fine
Anton
20-May-2007
[8130]
When the dictionary grows more than the current index limit (3 bytes 
-> 4 bytes), then the entire index block would need to be reprocessed 
to add the extra byte to each index value.
Terry
20-May-2007
[8131x5]
so.. for compaction.. the words most commonly used should be 1 byte.. 
least common.. 3 bytes
that's fine.. the index can be rebuilt rather easily
dict: ["Rebol"  "Carl"  "isa"]    are common..  should  be  represented 
as [1  2  3 ]  .. and dict:["This is a rare string.. probably only 
used once"]   shoud be  [ ZZZ ]
Even though that last dict value is long, it only takes 3 bytes to 
take advantage of it.. see where Im going?
Im going to store the entire dict in memory as a hash!, and my code 
as a separate block of 3 byte "symbols"
Anton
20-May-2007
[8136]
Your 2 million entry index, with each index value of 3 bytes, should 
theoretically take 5.7 MB. Do you really need to reduce that with 
compaction ?
Terry
20-May-2007
[8137]
not really...again.. performance is the main thing
Anton
20-May-2007
[8138]
(They aren't symbols by the way, they're 3-byte integers.)
Terry
20-May-2007
[8139]
an integer is a symbol
Anton
20-May-2007
[8140]
Compaction makes everything more complex. You are writing more and 
more code to implement that.
Terry
20-May-2007
[8141]
the sound you utter when you say "symbol" is a symbol
Anton
20-May-2007
[8142]
a symbol is not (only) an integer
Terry
20-May-2007
[8143x7]
now we're talking semantics
my main concern is crawling the block of integers... looking for 
 needles in haystacks.. whatever is the FASTEST method for this is 
best.. compaction is a by-product
what's faster, looking for  "¥%R" or   23472937 ?
or 16581375 rather
The dictionary will have 2 million strings... the index would be 
much smaller
index being the DB using triples as schema
Here's a question.. when a block of integers is converted to a HASH!.. 
what actually happens to it?
Anton
20-May-2007
[8150x2]
Hashes, lists and blocks can contain any type of value. The values 
are not affected by the container type.
A hash just has a different way of linking the items it contains 
together, so it performs faster in some operations at the expense 
of others.