r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Core] Discuss core issues

Terry
20-May-2007
[8133x3]
dict: ["Rebol"  "Carl"  "isa"]    are common..  should  be  represented 
as [1  2  3 ]  .. and dict:["This is a rare string.. probably only 
used once"]   shoud be  [ ZZZ ]
Even though that last dict value is long, it only takes 3 bytes to 
take advantage of it.. see where Im going?
Im going to store the entire dict in memory as a hash!, and my code 
as a separate block of 3 byte "symbols"
Anton
20-May-2007
[8136]
Your 2 million entry index, with each index value of 3 bytes, should 
theoretically take 5.7 MB. Do you really need to reduce that with 
compaction ?
Terry
20-May-2007
[8137]
not really...again.. performance is the main thing
Anton
20-May-2007
[8138]
(They aren't symbols by the way, they're 3-byte integers.)
Terry
20-May-2007
[8139]
an integer is a symbol
Anton
20-May-2007
[8140]
Compaction makes everything more complex. You are writing more and 
more code to implement that.
Terry
20-May-2007
[8141]
the sound you utter when you say "symbol" is a symbol
Anton
20-May-2007
[8142]
a symbol is not (only) an integer
Terry
20-May-2007
[8143x7]
now we're talking semantics
my main concern is crawling the block of integers... looking for 
 needles in haystacks.. whatever is the FASTEST method for this is 
best.. compaction is a by-product
what's faster, looking for  "¥%R" or   23472937 ?
or 16581375 rather
The dictionary will have 2 million strings... the index would be 
much smaller
index being the DB using triples as schema
Here's a question.. when a block of integers is converted to a HASH!.. 
what actually happens to it?
Anton
20-May-2007
[8150x2]
Hashes, lists and blocks can contain any type of value. The values 
are not affected by the container type.
A hash just has a different way of linking the items it contains 
together, so it performs faster in some operations at the expense 
of others.
Oldes
20-May-2007
[8152x2]
If you just need to save large arrays of integers, you can use format 
used in AS3:
 The AS3 Integer can be encoded into between 1 and 5 bytes.

    *

      if the integer is between 0×00 and 0x7F then only one byte (representing 
      the integer)
    *
      if between 0×80 and 0x3FFF then 2 bytes :
          o
            (i & 0x7F) | 0×80
          o
            (i » 7)
    *
      if between 0×4000 and 0x1FFFFF then 3 bytes :
          o
            (i & 0x7F) | 0×80
          o
            (i » 7) | 0×80
          o
            (i » 14)
    *
      if between 0×200000 and 0xFFFFFFF then 4 bytes :
          o
            (i & 0x7F) | 0×80
          o
            (i » 7) | 0×80
          o
            (i » 14) | 0×80
          o
            (i » 21)
    *
      if more or equal than 0×10000000 :
          o
            (i & 0x7F) | 0×80
          o
            (i » 7) | 0×80
          o
            (i » 14) | 0×80
          o
            (i » 21) | 0×80
          o
            (i » 28)
but if jyou need to search such an array, it's not the best for your... 
especially if there is no rebcode yet:(
Anton
20-May-2007
[8154x2]
That's interesting, using a bit in each byte to "escape" and open 
up the next, more significant byte. So it's a variable-length encoding 
scheme. (I guess, kind of like UTF8)
But yes, I think this adds complexity and would slow down the access 
routines.
Terry
20-May-2007
[8156]
yeah... I think the conclusion is ... don't worry about the number 
of bytes (and thus mem) when using plain integers with my index block, 
as it's much smaller than the dictionary anyways.. and unless Im 
shown otherwise, the crawling of it (find/ foreach, append) should 
be about as fast as any other method, right?
Anton
20-May-2007
[8157]
It should be faster and simpler to just use integers. When you want 
to cut down the size of your 2 million integers (=30MB), you can 
then look at implementing 3-byte integers packed in a binary.
Terry
20-May-2007
[8158]
exactly
Anton
20-May-2007
[8159x2]
Here's a couple of conversion routines.
struct: make struct! [int [int]] none

; convert integer -> 3-byte binary

integer-to-3-byte-binary: func [integer [integer!]][
	struct/int: integer
	copy/part third struct 3
]

; convert 3-byte binary -> integer

binary-3-byte-to-integer: func [int3 [binary!]][
	struct/int: 0 ; just make sure all bytes are zero
	change third struct int3
	struct/int
]
	
; test

my-bin: integer-to-3-byte-binary 2000000
;== #{80841E}
my-int: binary-3-byte-to-integer my-bin
;== 2000000
Terry
20-May-2007
[8161x3]
ok.. that's cool.. thanks
Even though each integer uses 16 bytes, there's some compaction by 
using the smallest integers with the most commonly used dictionary 
strings
I'll add these functions to the dictionary.. 

dict: [{integer-to-3-byte-binary: func [integer [integer!]][
	struct/int: integer
	copy/part third struct 3
]}]


now whenever i want to use that function.. i can represent it with 
a single integer.. ie:  do pick dict 1
Oldes
20-May-2007
[8164x2]
Terry, I would not make own binary storage as there should be new 
datatype dictionary! in R3 which is exactly what you need... if I 
understand it well
I cannot see hash! in this list http://www.rebol.net/r3blogs/0076.html
so it will be probably replaced
Terry
20-May-2007
[8166]
and that could be reduced further.. 

dict: [

{fetch: func [index][do pick dict index]}

{integer-to-3-byte-binary: func [integer [integer!]][
	struct/int: integer
	copy/part third struct 3
]}

]

do pick dict 1
fetch 2
my-bin: integer-to-3-byte-binary 2000000
Oldes
20-May-2007
[8167]
The new vector! datatype will be usefull as well to store large block 
of same data type
Terry
20-May-2007
[8168]
Im happy with R2 ;)
Oldes
20-May-2007
[8169]
yes... you can make it in R2 using index: [1 2 3]  and for R3 you 
just replace:  index: make vector! [integer! 24 [1 2 3]]
Terry
20-May-2007
[8170x3]
Now as for compressing the dictionary.. it seems that smaller strings 
grow with compression?
>> a: compress "a"
== #{789C4B04000062006201000000}
>> a: compress "aaaa"
== #{789C4B4C4C4C040003CE018504000000}
>> a: compress "aaaaa"
== #{789C4B4C04020005B401E605000000}
Oldes
20-May-2007
[8173x2]
compress is uzing zlib compression... just insetad of checksum at 
the tail (last 4 bytes) is used size of the source string
the size is used to make a result buffer for decompress:
>> decompress rejoin [#{789C4B4C} #{04020005B401E6} #{00000000}]
** Script Error: Not enough memory
>> decompress rejoin [#{789C4B4C} #{04020005B401E6} #{FF000000}]
== "aaaaa"
Dockimbel
20-May-2007
[8175]
Terry, your 'integer-to-3-byte-binary conversion function is dependent 
on platform endianness (third struct! gives you an endian-dependent 
result), watch out for that.
Anton
20-May-2007
[8176]
You can blame me for that, I wrote it quickly.
btiffin
21-May-2007
[8177]
Terminology question;  I know I could probably RTFM, but sometimes

Ask A Friendly Human is more fun.  What is the correct terminology 
for the global

REBOL context.  I'm describing (or trying to at least) the parse 
dialect "copy" versus

the REBOL "copy".  Is there a one word term for the "no context" 
context?  Or is
the REBOL global namespace
 good enough (and not too confusing to new rebols)?
Anton
22-May-2007
[8178]
Yeah, we all call it the global context.
btiffin
22-May-2007
[8179]
Thanks Anton.
Jerry
22-May-2007
[8180]
Is there a function which can flat a block. For example,
>> FLAT [ 1 2 [ 3 4 ] 5 6 ]
== [ 1 2 3 4 5 6]
Graham
22-May-2007
[8181]
probably not what you want ...

>> to-block form  [ 1 2 [ 3 4 ] 5 6 ]
== [1 2 3 4 5 6]
Chris
22-May-2007
[8182]
flatten: func [block [any-block!]][

    parse block [

         any [block: any-block! (change/part block first block 1) :block | 
         skip]

    ]
 
   head block
]