Entropy [was Re: Compression]

[1/11] from: joel:neely:fedex at: 18-Apr-2001 2:58

Hi, all! In case anyone is interested, the following is a quick-and-dirty that will perform the character-level entropy calculation for a specific data source supplied as an argument (file or string). The result tells the minimum number of bits per character required to transmit without loss any message from a source having the same distribution as the supplied source. Run against its own source (without the verbose flag) it calculates

>> char-ent/calc-ent %char-ent.r

( 2180 ) characters Source entropy is 4.78716832865254 bits per character. Maximum compression gives 40.1603958918433 percent saving. If we take the source code for this program as "typical" for REBOL, then any character-level compression scheme that squishes REBOL scripts down to less than about 59.84% of their original size can only do so by losing information. -jn- 8<------------------------------------------------------------ #!/usr/local/bin/rebol REBOL [ title: "calc-ent" file: "calc-ent.r" author: "Joel Neely" description: { Calculates the character entropy of a data source, which must be a string or a file. } ] ; ; everything is wrapped in an object to protect the ; global namespace ; char-ent: make object! [ ; ; character frequencies are accumulated here ; (8-bit characters are supported) ; char-freq: array/initial 256 0 ; ; the final result is calculated and retained here ; entropy: 0.0 ; ; the work is done in this function ; calc-ent: function [ data [string! file!] /verbose ][ ichar ; index for current char (off by 1!) iqty ; frequency for current character tqty ; total character count ; (used instead of length? data in case ; we want to add logic in the input loop ; to consider only some characters, e.g. alpha) ][ ; ; initialize accumulators ; entropy: 0.0 tqty: 0 ; ; grab content if argument is file ; if file? data [data: read data] ; ; accumulate character frequencies from data ; ; wrap loop body in an 'if to consider only ; a subset of the characters from the source ; foreach char data [ ; ; offset char value by one since REBOL can't ; understand zero-based indexing ... sigh ;-) ; ichar: 1 + to-integer char poke char-freq ichar (1 + pick char-freq ichar) tqty: tqty + 1 ] ; ; print detail report (if /verbose refinement is set) ; and accumulate final result ; print ["^/(" tqty ") characters^/"] for ichar 1 256 1 [ ; ; ignore unused character values ; if 0 < iqty: pick char-freq ichar [ use [p q] [ p: iqty / tqty ; char probability q: p * log-2 p ; (negative) bits if verbose [ print [ mold to-char ichar - 1 tab iqty tab p tab q ] ] ; ; accumulate weighted (positive) bits per character ; entropy: entropy - q ] ] ] ; ; result is character entropy FOR THIS SOURCE ONLY ; print [ "Source entropy is" entropy "bits per character." newline "Maximum compression gives" 100.0 * (1.0 - (entropy / 8.0)) "percent saving." newline ] ] ]

[2/11] from: holger:rebol at: 18-Apr-2001 7:15

On Wed, Apr 18, 2001 at 02:58:32AM -0500, Joel Neely wrote:

> Hi, all! > > The result tells the minimum number of bits per character required to > transmit without loss any message from a source having the same > distribution as the supplied source.

Yes, but entropy is always calculated relative to some established model, shared by compressor and decompressor. It looks like the model you use is character-based, global and context-free, i.e. it does not take character combinations or other contexts into account, and does not attempt to calculate entropies of subsections. This means those 4.8 bits per character of entropy are a lower bound only for character-oriented compression. Compressors which look at character combinations or which reset compression histories and thus distribution statistics at certain times might get better results. Basically the entropy in your model describes a bound on the compression ratio of static, adaptive Huffman encoding. It does not say anything about other types of compression. These days good text compressors (ppm, dmc and even BWT) achieve around 75% compression (using models with around 2 bits of entropy per character) on typical English language reference texts (Canterbury Corpus, novels from Project Gutenberg, e.g.). The interesting question for text compression is what the minimum entropy of English language text is relative to an optimum, algorithmic model, i.e. a model that does not depend on the receiver having access to large dictionaries, portions of text etc., but only a reasonably-sized decompression algorithm. The current assumption is that a lower bound is around 1 bit per character, leading to an upper bound on English language text compression of around 87.5 %. -- Holger Kruse [holger--rebol--com]

[3/11] from: doug:vos:eds at: 18-Apr-2001 11:02

What about including better image compression in REBOL/View? I'm surprised nobody jumped on my comment yesterday about image compression from http://www.lizardtech.com/index.html that is 10 times better than JPEG (in many cases) ... This technology would be a great add on for REBOL/View

[4/11] from: kracik:mbox:dkm:cz at: 18-Apr-2001 17:39

Hi, I used LizardTech MrSID viewer to view ortophotos from http://tahoe.usgs.gov/tahoe/DOQ.html#download It's amazing how fast it is, and you can zoom in from full landscape to a single tree. But it's LizardTech's proprietary technology, so RT cannot simply use it, and PNG, GIF and JPEG decompression in REBOL is IMHO sufficient for most people. It may be possible to interface LizardTech's libraries from View/Pro or /Command. -- Michal Kracik Vos, Doug wrote:

[5/11] from: holger:rebol at: 18-Apr-2001 8:56

On Wed, Apr 18, 2001 at 11:02:12AM -0400, Vos, Doug wrote:

> What about including better image compression in REBOL/View? > > I'm surprised nobody jumped on my comment yesterday about > image compression from http://www.lizardtech.com/index.html > that is 10 times better than JPEG (in many cases) ...

It is a size vs. functionality tradeoff. Eventually we would like to provide all of these things for REBOL, but first we need a run-time component model so the main binary does not become too large. -- Holger Kruse [holger--rebol--com]

[6/11] from: petr:krenzelok:trz:cz at: 18-Apr-2001 18:48

----- Original Message ----- From: "Holger Kruse" <[holger--rebol--com]> To: <[rebol-list--rebol--com]> Sent: Wednesday, April 18, 2001 5:56 PM Subject: [REBOL] Re: Entropy [was Re: Compression]

> On Wed, Apr 18, 2001 at 11:02:12AM -0400, Vos, Doug wrote: > > What about including better image compression in REBOL/View?

<<quoted lines omitted: 6>>

> a run-time component model so the main binary does not become > too large.

Ha! :-)) Well, I know it was planned, but I also remember Jeff stating something about technical difficulty of implementation thru all the platforms (or was it API limiting the functionality needed?) Maybe now as main Rebol products are finally out, you or Carl could outline what is planned for coming future? E.g. modules - 3.0, components - 3.5, sound - 2.6 ;-) etc :-) -pekr-

[7/11] from: joel:neely:fedex at: 18-Apr-2001 15:52

Hi, Holger, Holger Kruse wrote:

> Yes, but entropy is always calculated relative to some established > model, shared by compressor and decompressor. It looks like the model > you use is character-based, global and context-free, i.e. it does not > take character combinations or other contexts into account, and does > not attempt to calculate entropies of subsections. >

Exactly right! (I guess I compressed too much! ;-) That's why I followed up by saying "... any character-level compression scheme ..." --------------- I wasn't trying to claim that context-free compression was optimal, but to give an illustration to support another point, discussed below. Sorry if I failed to be sufficiently clear.

> This means those 4.8 bits per character of entropy are a lower bound > only for character-oriented compression. Compressors which look at

<<quoted lines omitted: 3>>

> compression ratio of static, adaptive Huffman encoding. It does not > say anything about other types of compression.

Again, I agree. However, the point I was trying to make survives, even when we take all your well-made points into account. To illustrate: my REBOL coding style is relatively consistent (at least in my opinion ;-): Almost every line break is followed either by another line break or a run of tabs. I could precede a character-level (e.g. Huffman) compression scheme with a separate pass through the data to replace all occurrences of ["^/" some "^-"] with "^/" and a single decimal digit that indicates how many tabs followed the linebreak (including 0, of course). On the back end, the character-level decompression would be followed by replacing ["^/" digit] with "^/" and the indicated number of tabs. Clearly this would increase compression for my REBOL code, but in most other cases (such as Carl's recommendation for using 4 spaces for each indentation instead of a tab) it would actually hurt the compression rate. The key point is that an optimization hack that improves compression for a specific situations usually loses for others. English, to use your example, is highly redundant, and we can achieve the impressive rates you cite only by taking advantage of that specific redundancy. If we try to apply those same tricks to German, REBOL source code, or MP3 files, they will simply fail to help (at best) and can significantly degrade the compression (at worst). -jn-

[8/11] from: holger:rebol at: 18-Apr-2001 14:43

On Wed, Apr 18, 2001 at 03:52:34PM -0500, Joel Neely wrote:

> The key point is that an optimization hack that improves compression > for a specific situations usually loses for others. English, to use

<<quoted lines omitted: 3>>

> or MP3 files, they will simply fail to help (at best) and can > significantly degrade the compression (at worst).

Not really. PPM, e.g., is essentially Huffman with adaptive context-sensitive probabilities, i.e. the prediction of the next input symbol is not just based on the overall frequency but also on some "state" determined by previously seen characters. The main drawback of PPM is that dynamically maintaining probabilities of multi-character sequences has a much higher overhead (memory and CPU) than the simple couting needed for Huffman. That's why PPM is not usually used in compression tools for the consumer market. PPM (Partial Pattern Matching) is a generic model though that can be applied to a very large class of input files, so I would not call it an optimization hack. It would be a "hack" if the way the state and probabilities are maintained had some hidden model-specific assumptions, but that is not the case, at least not for PPM, DMC and BWT. The only assumption made is that *some* local context exists, and that is true for German, REBOL source code and a lot of other types of files. -- Holger Kruse [holger--rebol--com]

[9/11] from: joel:neely:fedex at: 18-Apr-2001 17:22

Hi, Holger, So as not to commit post-mortem equine assault, I'll be brief (for me ;-) Holger Kruse wrote:

> On Wed, Apr 18, 2001 at 03:52:34PM -0500, Joel Neely wrote: > > The key point is that an optimization hack that improves compression

<<quoted lines omitted: 3>>

> next input symbol is not just based on the overall frequency but > also on some "state" determined by previously seen characters.

That's fine, and there are lots of adaptive algorithms which "learn" how to handle each source stream... (but see below)

> The main drawback of PPM is that dynamically maintaining probabilities > of multi-character sequences has a much higher overhead (memory and CPU) > than the simple couting needed for Huffman. That's why PPM is not usually > used in compression tools for the consumer market. >

My experience with adaptive algorithms is that there's another drawback, if compression ratio is the primary measure of success. The knowledge of the source model has to exist somewhere. Special-case algorithms put that knowledge in the algorithm, which minimizes the compressed data for those messages that fit their model, but makes them lose for many messages that don't fit. Adaptive algorithms start off knowing nothing about the specific message on a case-by-case basis, but learn as they go. This means that the knowledge about the model (which accumulates during the processing of the message) is essentially embedded in the compressed data stream itself. Therefore the compressed data stream is larger than it would be if that model were extracted (e.g. into the algorithm or into out-of-band data), but the ability to find whatever patterns are present protects the adaptive algorithm from spectacular failure on pathological input.

> PPM (Partial Pattern Matching) is a generic model though that can be > applied to a very large class of input files, so I would not call it

<<quoted lines omitted: 3>>

> The only assumption made is that *some* local context exists, and that > is true for German, REBOL source code and a lot of other types of files.

If we agree (and I think we do) that PPM is not a special-case optimization hack, then I really don't understand how it serves as a counter-example to my original statement, which WAS about special-case tweaks. I still believe that statement holds up: The more one optimizes an algorithm for specific cases, the worse it becomes at dealing with other cases. All I'm saying (and I think we're in agreement) is that there are no free lunches. Greater flexibility (adaptive algorithms) comes at the price of less compression for specific classes of messages. Greater specialization for specific classes of messages produces better compression ratios for that class but worse performance otherwise. Classic computing science tradeoff. Just for hack value, I modified my entropy calculator to figure entropy for a 1-character context model, instead of the single-character frequence model in the code I posted earlier. A more sophisticated model clearly wins, as shown in the statistics below. Entropy is in bits per character; savings are percentages. Program Context Source File 1 Statistics Source File 2 Statistics Version width Length Entropy Savings Length Entropy Savings ------- ------- ------ ------- ------- ------ ------- ------- 1 0 chars 2180 4.78717 40.160 3258 4.73303 40.837 2 1 char 2180 2.94814 63.148 3258 2.82492 64.688 Again illustrating that the more sophisticated the model, the better the compression, but the algorithm has to work harder and the increase in model representation has to go somewhere (in the algorithm or embedded in the data stream). Thanks again for some very thought-stimulating discussion! It never fails to make me think when I get into a conversation with you guys! -jn-

[10/11] from: robert:muench:robertmuench at: 19-Apr-2001 17:02

> -----Original Message----- > From: [rebol-bounce--rebol--com] [mailto:[rebol-bounce--rebol--com]]On Behalf Of

<<quoted lines omitted: 5>>

> a run-time component model so the main binary does not become > too large.

Hi, it's good to hear that the Rebol Team continues to consider size an important issue in these days. Keep on the good job and remember all real good software ever made, fits on one 1.44MB disk ;-))) Robert

[11/11] from: emptyhead:home:nl at: 19-Apr-2001 20:26

Obviously this is not true .. all great software fits on one 880Kb disk... ;) D. Oosterveld Robert M. Muench wrote:

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted