Entropy [was Re: Compression]
[1/11] from: joel:neely:fedex at: 18-Apr-2001 2:58
Hi, all!
In case anyone is interested, the following is a quick-and-dirty that
will perform the character-level entropy calculation for a specific
data source supplied as an argument (file or string).
The result tells the minimum number of bits per character required to
transmit without loss any message from a source having the same
distribution as the supplied source.
Run against its own source (without the verbose flag) it calculates
>> char-ent/calc-ent %char-ent.r
( 2180 ) characters
Source entropy is 4.78716832865254 bits per character.
Maximum compression gives 40.1603958918433 percent saving.
If we take the source code for this program as "typical" for REBOL,
then any character-level compression scheme that squishes REBOL
scripts down to less than about 59.84% of their original size can
only do so by losing information.
-jn-
8<------------------------------------------------------------
#!/usr/local/bin/rebol
REBOL [
title: "calc-ent"
file: "calc-ent.r"
author: "Joel Neely"
description: {
Calculates the character entropy of a data source,
which must be a string or a file.
}
]
;
; everything is wrapped in an object to protect the
; global namespace
;
char-ent: make object! [
;
; character frequencies are accumulated here
; (8-bit characters are supported)
;
char-freq: array/initial 256 0
;
; the final result is calculated and retained here
;
entropy: 0.0
;
; the work is done in this function
;
calc-ent: function [
data [string! file!]
/verbose
][
ichar ; index for current char (off by 1!)
iqty ; frequency for current character
tqty ; total character count
; (used instead of length? data in case
; we want to add logic in the input loop
; to consider only some characters, e.g. alpha)
][
;
; initialize accumulators
;
entropy: 0.0
tqty: 0
;
; grab content if argument is file
;
if file? data [data: read data]
;
; accumulate character frequencies from data
;
; wrap loop body in an 'if to consider only
; a subset of the characters from the source
;
foreach char data [
;
; offset char value by one since REBOL can't
; understand zero-based indexing ... sigh ;-)
;
ichar: 1 + to-integer char
poke char-freq ichar (1 + pick char-freq ichar)
tqty: tqty + 1
]
;
; print detail report (if /verbose refinement is set)
; and accumulate final result
;
print ["^/(" tqty ") characters^/"]
for ichar 1 256 1 [
;
; ignore unused character values
;
if 0 < iqty: pick char-freq ichar [
use [p q] [
p: iqty / tqty ; char probability
q: p * log-2 p ; (negative) bits
if verbose [
print [
mold to-char ichar - 1 tab
iqty tab
p tab q
]
]
;
; accumulate weighted (positive) bits per character
;
entropy: entropy - q
]
]
]
;
; result is character entropy FOR THIS SOURCE ONLY
;
print [
"Source entropy is" entropy "bits per character." newline
"Maximum compression gives"
100.0 * (1.0 - (entropy / 8.0))
"percent saving." newline
]
]
]
[2/11] from: holger:rebol at: 18-Apr-2001 7:15
On Wed, Apr 18, 2001 at 02:58:32AM -0500, Joel Neely wrote:
> Hi, all!
>
> The result tells the minimum number of bits per character required to
> transmit without loss any message from a source having the same
> distribution as the supplied source.
Yes, but entropy is always calculated relative to some established
model, shared by compressor and decompressor. It looks like the model
you use is character-based, global and context-free, i.e. it does not
take character combinations or other contexts into account, and does
not attempt to calculate entropies of subsections.
This means those 4.8 bits per character of entropy are a lower bound
only for character-oriented compression. Compressors which look at
character combinations or which reset compression histories and thus
distribution statistics at certain times might get better results.
Basically the entropy in your model describes a bound on the compression
ratio of static, adaptive Huffman encoding. It does not say anything about
other types of compression.
These days good text compressors (ppm, dmc and even BWT) achieve
around 75% compression (using models with around 2 bits of entropy per
character) on typical English language reference texts (Canterbury Corpus,
novels from Project Gutenberg, e.g.).
The interesting question for text compression is what the minimum
entropy of English language text is relative to an optimum, algorithmic
model, i.e. a model that does not depend on the receiver having access
to large dictionaries, portions of text etc., but only a reasonably-sized
decompression algorithm. The current assumption is that a lower bound
is around 1 bit per character, leading to an upper bound on English
language text compression of around 87.5 %.
--
Holger Kruse
[holger--rebol--com]
[3/11] from: doug:vos:eds at: 18-Apr-2001 11:02
What about including better image compression in REBOL/View?
I'm surprised nobody jumped on my comment yesterday about
image compression from http://www.lizardtech.com/index.html
that is 10 times better than JPEG (in many cases) ...
This technology would be a great add on for REBOL/View
[4/11] from: kracik:mbox:dkm:cz at: 18-Apr-2001 17:39
Hi,
I used LizardTech MrSID viewer to view ortophotos from
http://tahoe.usgs.gov/tahoe/DOQ.html#download
It's amazing how fast it is, and you can zoom in from full landscape
to a single tree. But it's LizardTech's proprietary technology, so RT
cannot simply use it, and PNG, GIF and JPEG decompression in REBOL
is IMHO sufficient for most people.
It may be possible to interface LizardTech's libraries from View/Pro
or /Command.
--
Michal Kracik
Vos, Doug
wrote:
[5/11] from: holger:rebol at: 18-Apr-2001 8:56
On Wed, Apr 18, 2001 at 11:02:12AM -0400, Vos, Doug wrote:
> What about including better image compression in REBOL/View?
>
> I'm surprised nobody jumped on my comment yesterday about
> image compression from http://www.lizardtech.com/index.html
> that is 10 times better than JPEG (in many cases) ...
It is a size vs. functionality tradeoff. Eventually we would
like to provide all of these things for REBOL, but first we need
a run-time component model so the main binary does not become
too large.
--
Holger Kruse
[holger--rebol--com]
[6/11] from: petr:krenzelok:trz:cz at: 18-Apr-2001 18:48
----- Original Message -----
From: "Holger Kruse" <[holger--rebol--com]>
To: <[rebol-list--rebol--com]>
Sent: Wednesday, April 18, 2001 5:56 PM
Subject: [REBOL] Re: Entropy [was Re: Compression]
> On Wed, Apr 18, 2001 at 11:02:12AM -0400, Vos, Doug wrote:
> > What about including better image compression in REBOL/View?
<<quoted lines omitted: 6>>
> a run-time component model so the main binary does not become
> too large.
Ha! :-)) Well, I know it was planned, but I also remember Jeff stating
something about technical difficulty of implementation thru all the
platforms (or was it API limiting the functionality needed?)
Maybe now as main Rebol products are finally out, you or Carl could outline
what is planned for coming future? E.g. modules - 3.0, components - 3.5,
sound - 2.6 ;-) etc :-)
-pekr-
[7/11] from: joel:neely:fedex at: 18-Apr-2001 15:52
Hi, Holger,
Holger Kruse wrote:
> Yes, but entropy is always calculated relative to some established
> model, shared by compressor and decompressor. It looks like the model
> you use is character-based, global and context-free, i.e. it does not
> take character combinations or other contexts into account, and does
> not attempt to calculate entropies of subsections.
>
Exactly right! (I guess I compressed too much! ;-) That's why I
followed up by saying "... any character-level compression scheme ..."
---------------
I wasn't trying to claim that context-free compression was optimal,
but to give an illustration to support another point, discussed
below. Sorry if I failed to be sufficiently clear.
> This means those 4.8 bits per character of entropy are a lower bound
> only for character-oriented compression. Compressors which look at
<<quoted lines omitted: 3>>
> compression ratio of static, adaptive Huffman encoding. It does not
> say anything about other types of compression.
Again, I agree. However, the point I was trying to make survives,
even when we take all your well-made points into account.
To illustrate: my REBOL coding style is relatively consistent (at
least in my opinion ;-):
Almost every line break is followed either by another line break
or a run of tabs. I could precede a character-level (e.g. Huffman)
compression scheme with a separate pass through the data to replace
all occurrences of ["^/" some "^-"] with "^/" and a single decimal
digit that indicates how many tabs followed the linebreak (including
0, of course). On the back end, the character-level decompression
would be followed by replacing ["^/" digit] with "^/" and the
indicated number of tabs.
Clearly this would increase compression for my REBOL code, but in
most other cases (such as Carl's recommendation for using 4 spaces
for each indentation instead of a tab) it would actually hurt the
compression rate.
The key point is that an optimization hack that improves compression
for a specific situations usually loses for others. English, to use
your example, is highly redundant, and we can achieve the impressive
rates you cite only by taking advantage of that specific redundancy.
If we try to apply those same tricks to German, REBOL source code,
or MP3 files, they will simply fail to help (at best) and can
significantly degrade the compression (at worst).
-jn-
[8/11] from: holger:rebol at: 18-Apr-2001 14:43
On Wed, Apr 18, 2001 at 03:52:34PM -0500, Joel Neely wrote:
> The key point is that an optimization hack that improves compression
> for a specific situations usually loses for others. English, to use
<<quoted lines omitted: 3>>
> or MP3 files, they will simply fail to help (at best) and can
> significantly degrade the compression (at worst).
Not really. PPM, e.g., is essentially Huffman with adaptive
context-sensitive probabilities, i.e. the prediction of the
next input symbol is not just based on the overall frequency but
also on some "state" determined by previously seen characters.
The main drawback of PPM is that dynamically maintaining probabilities
of multi-character sequences has a much higher overhead (memory and CPU)
than the simple couting needed for Huffman. That's why PPM is not usually
used in compression tools for the consumer market.
PPM (Partial Pattern Matching) is a generic model though that can be
applied to a very large class of input files, so I would not call it
an optimization hack. It would be a "hack" if the way the state and
probabilities are maintained had some hidden model-specific
assumptions, but that is not the case, at least not for PPM, DMC and BWT.
The only assumption made is that *some* local context exists, and that
is true for German, REBOL source code and a lot of other types of files.
--
Holger Kruse
[holger--rebol--com]
[9/11] from: joel:neely:fedex at: 18-Apr-2001 17:22
Hi, Holger,
So as not to commit post-mortem equine assault, I'll be brief (for me
;-)
Holger Kruse wrote:
> On Wed, Apr 18, 2001 at 03:52:34PM -0500, Joel Neely wrote:
> > The key point is that an optimization hack that improves compression
<<quoted lines omitted: 3>>
> next input symbol is not just based on the overall frequency but
> also on some "state" determined by previously seen characters.
That's fine, and there are lots of adaptive algorithms which "learn"
how to handle each source stream... (but see below)
> The main drawback of PPM is that dynamically maintaining probabilities
> of multi-character sequences has a much higher overhead (memory and CPU)
> than the simple couting needed for Huffman. That's why PPM is not usually
> used in compression tools for the consumer market.
>
My experience with adaptive algorithms is that there's another drawback,
if compression ratio is the primary measure of success. The knowledge
of the source model has to exist somewhere. Special-case algorithms
put that knowledge in the algorithm, which minimizes the compressed data
for those messages that fit their model, but makes them lose for many
messages that don't fit. Adaptive algorithms start off knowing nothing
about the specific message on a case-by-case basis, but learn as they
go. This means that the knowledge about the model (which accumulates
during the processing of the message) is essentially embedded in the
compressed data stream itself. Therefore the compressed data stream is
larger than it would be if that model were extracted (e.g. into the
algorithm or into out-of-band data), but the ability to find whatever
patterns are present protects the adaptive algorithm from spectacular
failure on pathological input.
> PPM (Partial Pattern Matching) is a generic model though that can be
> applied to a very large class of input files, so I would not call it
<<quoted lines omitted: 3>>
> The only assumption made is that *some* local context exists, and that
> is true for German, REBOL source code and a lot of other types of files.
If we agree (and I think we do) that PPM is not a special-case
optimization hack, then I really don't understand how it serves
as a counter-example to my original statement, which WAS about
special-case tweaks. I still believe that statement holds up:
The more one optimizes an algorithm for specific cases,
the worse it becomes at dealing with other cases.
All I'm saying (and I think we're in agreement) is that there are no
free lunches. Greater flexibility (adaptive algorithms) comes at the
price of less compression for specific classes of messages. Greater
specialization for specific classes of messages produces better
compression ratios for that class but worse performance otherwise.
Classic computing science tradeoff.
Just for hack value, I modified my entropy calculator to figure
entropy for a 1-character context model, instead of the single-character
frequence model in the code I posted earlier. A more sophisticated
model
clearly wins, as shown in the statistics below. Entropy is in bits
per character; savings are percentages.
Program Context Source File 1 Statistics Source File 2 Statistics
Version width Length Entropy Savings Length Entropy Savings
------- ------- ------ ------- ------- ------ ------- -------
1 0 chars 2180 4.78717 40.160 3258 4.73303 40.837
2 1 char 2180 2.94814 63.148 3258 2.82492 64.688
Again illustrating that the more sophisticated the model, the better the
compression, but the algorithm has to work harder and the increase in
model representation has to go somewhere (in the algorithm or embedded
in
the data stream).
Thanks again for some very thought-stimulating discussion! It never
fails to make me think when I get into a conversation with you guys!
-jn-
[10/11] from: robert:muench:robertmuench at: 19-Apr-2001 17:02
> -----Original Message-----
> From: [rebol-bounce--rebol--com] [mailto:[rebol-bounce--rebol--com]]On Behalf Of
<<quoted lines omitted: 5>>
> a run-time component model so the main binary does not become
> too large.
Hi, it's good to hear that the Rebol Team continues to consider size an
important issue in these days. Keep on the good job and remember all real
good software ever made, fits on one 1.44MB disk ;-))) Robert
[11/11] from: emptyhead:home:nl at: 19-Apr-2001 20:26
Obviously this is not true .. all great software fits on one 880Kb
disk... ;)
D. Oosterveld
Robert M. Muench wrote:
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted