World: r3wp

Join the discussions in the REBOL3 world...

[Core] Discuss core issues

older newer	first last
Anton 20-May-2007 [8108x3]	>> bin: #{000005000006000007} >> repeat index 3 [bin-int: copy/part skip bin (index - 1 * 3) 3 print [index mold bin-int to-integer bin-int]] 1 #{000005} 5 2 #{000006} 6 3 #{000007} 7
	binary! is indexed per byte, so if we use 3 bytes per value, we must multiply the index by 3 to find the correct value.
	We must do this in our functions which should allow retrieving, changing and inserting the value.
Terry 20-May-2007 [8111]	would there be a performance trade off as opposed to just a block of integers? (finding/ appending etc.)
Anton 20-May-2007 [8112x3]	Note that converting integer <-> binary is a little tricky. But we can use a struct! (re-use a shared struct for optimal performance).
	Yes, there is lost performance. Obviously, you must go via your access functions each time you store and retrieve.
	(This is addressed by Rebol3's vector! datatype, which stores different values this way.)
Terry 20-May-2007 [8115x2]	well, performance is the primary concern
Terry 20-May-2007 [8115x2]	i thought crawling a block of binary (hash?) would be faster than a block of integers
Anton 20-May-2007 [8117]	How many numbers are there likely to be ? What do they represent ?
Terry 20-May-2007 [8118]	Here's what Im trying to do.. I have a dictionary... dict: ["one" "two" "three" ... ] with a couple of million values I want to build an index block .. dex: [ 2 3 2 1] that represents the index? of each word in my dictionary
Anton 20-May-2007 [8119]	Is this to get around rebol's current word limit ?
Terry 20-May-2007 [8120x5]	Only partly
	but there's a compression value as well..
	a value in the dictionary could be a page of rebol code (as a string) .. and it could be represented with a single bit (or in this case.. 3 bytes)
	the values never change.. they may be deleted... and their index re-valued.. or the dictionary may be appended..
	so then my code becomes a block of 3 byte 'symbols' that I use to 'pick' the actual values out of the dictionary with.
Anton 20-May-2007 [8125x2]	Is the aim to create a fixed-width number representation for the index ?
Anton 20-May-2007 [8125x2]	(I mean, a fixed-width of 3 characters.)
Terry 20-May-2007 [8127x3]	i would rather have it not fixed, and grow as needed
	2 bytes would be plenty to start.. but would quickly grow to 3.. and very slowly (if ever.. but possible) grow to 4 bytes..
	so 3 bytes is fine
Anton 20-May-2007 [8130]	When the dictionary grows more than the current index limit (3 bytes -> 4 bytes), then the entire index block would need to be reprocessed to add the extra byte to each index value.
Terry 20-May-2007 [8131x5]	so.. for compaction.. the words most commonly used should be 1 byte.. least common.. 3 bytes
	that's fine.. the index can be rebuilt rather easily
	dict: ["Rebol" "Carl" "isa"] are common.. should be represented as [1 2 3 ] .. and dict:["This is a rare string.. probably only used once"] shoud be [ ZZZ ]
	Even though that last dict value is long, it only takes 3 bytes to take advantage of it.. see where Im going?
	Im going to store the entire dict in memory as a hash!, and my code as a separate block of 3 byte "symbols"
Anton 20-May-2007 [8136]	Your 2 million entry index, with each index value of 3 bytes, should theoretically take 5.7 MB. Do you really need to reduce that with compaction ?
Terry 20-May-2007 [8137]	not really...again.. performance is the main thing
Anton 20-May-2007 [8138]	(They aren't symbols by the way, they're 3-byte integers.)
Terry 20-May-2007 [8139]	an integer is a symbol
Anton 20-May-2007 [8140]	Compaction makes everything more complex. You are writing more and more code to implement that.
Terry 20-May-2007 [8141]	the sound you utter when you say "symbol" is a symbol
Anton 20-May-2007 [8142]	a symbol is not (only) an integer
Terry 20-May-2007 [8143x7]	now we're talking semantics
	my main concern is crawling the block of integers... looking for needles in haystacks.. whatever is the FASTEST method for this is best.. compaction is a by-product
	what's faster, looking for "�%R" or 23472937 ?
	or 16581375 rather
	The dictionary will have 2 million strings... the index would be much smaller
	index being the DB using triples as schema
	Here's a question.. when a block of integers is converted to a HASH!.. what actually happens to it?
Anton 20-May-2007 [8150x2]	Hashes, lists and blocks can contain any type of value. The values are not affected by the container type.
Anton 20-May-2007 [8150x2]	A hash just has a different way of linking the items it contains together, so it performs faster in some operations at the expense of others.
Oldes 20-May-2007 [8152x2]	If you just need to save large arrays of integers, you can use format used in AS3: The AS3 Integer can be encoded into between 1 and 5 bytes. * if the integer is between 0�00 and 0x7F then only one byte (representing the integer) * if between 0�80 and 0x3FFF then 2 bytes : o (i & 0x7F) \| 0�80 o (i � 7) * if between 0�4000 and 0x1FFFFF then 3 bytes : o (i & 0x7F) \| 0�80 o (i � 7) \| 0�80 o (i � 14) * if between 0�200000 and 0xFFFFFFF then 4 bytes : o (i & 0x7F) \| 0�80 o (i � 7) \| 0�80 o (i � 14) \| 0�80 o (i � 21) * if more or equal than 0�10000000 : o (i & 0x7F) \| 0�80 o (i � 7) \| 0�80 o (i � 14) \| 0�80 o (i � 21) \| 0�80 o (i � 28)
Oldes 20-May-2007 [8152x2]	but if jyou need to search such an array, it's not the best for your... especially if there is no rebcode yet:(
Anton 20-May-2007 [8154x2]	That's interesting, using a bit in each byte to "escape" and open up the next, more significant byte. So it's a variable-length encoding scheme. (I guess, kind of like UTF8)
Anton 20-May-2007 [8154x2]	But yes, I think this adds complexity and would slow down the access routines.
Terry 20-May-2007 [8156]	yeah... I think the conclusion is ... don't worry about the number of bytes (and thus mem) when using plain integers with my index block, as it's much smaller than the dictionary anyways.. and unless Im shown otherwise, the crawling of it (find/ foreach, append) should be about as fast as any other method, right?
Anton 20-May-2007 [8157]	It should be faster and simpler to just use integers. When you want to cut down the size of your 2 million integers (=30MB), you can then look at implementing 3-byte integers packed in a binary.
older newer	first last