Sort by first part of line
[1/62] from: louisaturk:coxinet at: 3-Sep-2002 15:26
Hi fellow rebols,
I still don't quite understand sort. How do I sort the following lines by
the numbers only (not by the letters and not by the length). Note that
there is a space before each number in the test data below.
Thanks,
Louis
454 en tw
395 en th
313 kai o
175 oi de
314 eij thn
174 eij ton
124 kai ouk
123 kai thn
219 ek tou
160 kai en
142 kai to
126 tw qew
166 kai h
096 ei de
094 ou mh
091 ei mh
120 thj ghj
120 en toij
112 estin o
108 en taij
118 o kurioj
096 proj ton
088 kai touj
082 kai idou
115 kai ta
111 o uioj
111 de kai
103 ek twn
114 eipen autoij
104 tou anqrwpou
071 legei autoij
063 twn ioudaiwn
105 proj auton
092 oi maqhtai
082 legei autw
078 eipen autw
103 ek thj
090 ina mh
086 autw o
084 ou gar
101 apo tou
[2/62] from: chris:langreiter at: 3-Sep-2002 23:22
> 454 en tw
> 395 en th
> 313 kai o
> ...
Assuming you have that data in a variable creatively named data:
result: copy ""
foreach [n a b] sort/skip parse data none 3 [
repend result [" " n " " a " " b newline]
]
i.e. transform the string into a data structure more appropriate for
the sorting endeavours you have in mind.
-- Chris
[3/62] from: greggirwin:mindspring at: 3-Sep-2002 16:27
Hi Sunanda,
My replay to Louis should come in before this, so you'll see that I took a
different route. In any case, I wanted to point something out for
clarification.
<< If Rebol's sort were "stable" -- i.e. it kept equal keys in their
original sequence, then all you'd need to do is to write your own
sort compare routine to compare partial keys. >>
REBOL's sort *is* stable, AFAIK, per Holger. If you provide your own
comparison function then you need to return -1, 0, or 1 (rather than
true/false) and it should still be stable. I haven't done any tests to prove
it, so someone please correct me if I'm wrong.
--Gregg
[4/62] from: greggirwin:mindspring at: 3-Sep-2002 16:18
Hi Louis,
<< I still don't quite understand sort. How do I sort the following lines
by
the numbers only (not by the letters and not by the length). Note that
there is a space before each number in the test data below. >>
How about this? (assuming items are space delimited strings)
compare-items: func [a b /local aa bb] [
aa: to integer! first parse a none
bb: to integer! first parse b none
either aa = bb [0][either aa < bb [-1][1]]
]
sort/compare data :compare-items
Not terribly efficient, as it's doing a lot of work on each comparison, but
it's a start.
--Gregg
[5/62] from: rotenca:telvia:it at: 4-Sep-2002 1:18
Hi Sunanda and Luis,
for sort you should look at
http://www.rebol.com/docs/core25.html#sect4.1.3.
and
http://www.rebol.com/docs/core25.html#sect4.1.4.
perhaps this helps.
---
Ciao
Romano
[6/62] from: g:santilli:tiscalinet:it at: 4-Sep-2002 16:07
Hi Sunanda,
On Tuesday, September 3, 2002, 11:50:31 PM, you wrote:
Sac> But this doesn't preserve the input sequence -- it looks to me like
Sac> Rebol's sort is ***not*** stable.
It actually is, but if you use /COMPARE you need to return -1, 0
and 1 for it to be stable. (AFAIK) (-1 means a < b, 0 means a = b,
1 means a > b)
I think Luis just needs to LOAD its data, and then sort it with
SORT/SKIP. He can then reproduce the string from the LOADed data.
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[7/62] from: louisaturk:coxinet at: 4-Sep-2002 12:28
To everyone helping me with this, I'm having to met a deadline on another
project. I'll get back to this sort problem ASAP. I very much appreciate
all the help.
Louis
[8/62] from: rotenca:telvia:it at: 5-Sep-2002 12:28
Hi Gabriele
> I think Luis just needs to LOAD its data, and then sort it with
> SORT/SKIP. He can then reproduce the string from the LOADed data.
If you do not care to add thousands of unuseful words to the global context
and you are sure every string can be reduced at a valid Rebol datatypes, this
is a solution :-)
I think that he can parse the string building a sortable block and then rejoin
the data in a string.
> Regards,
> Gabriele.
Ciao
Romano
[9/62] from: louisaturk:coxinet at: 5-Sep-2002 10:46
Hi Sunanda,
Your sort worked perfectly. Thanks also for the explanation.
You might be interested to know that
on my 450 Pentium 2 running w2k total time to sort
29,688 lines using your code was:
3:50:10
I want to thank Sunanda and everyone else that helped with this. I still
have questions, but I'm committed to other things for the rest of this
week. Next week I'll try to ask them.
One question I'll ask now however: What exactly does hash do, and could
hash be used to speed up sort?
Thanks again,
Louis
At 05:50 PM 9/3/2002 -0400, you wrote:
[10/62] from: louisaturk:coxinet at: 5-Sep-2002 12:12
Hi Ciao,
It seems that I am having to sort large amounts of data more and more
often. I am wanting to learn how to do it faster.
At 12:28 PM 9/5/2002 +0200, you wrote:
>Hi Gabriele
> > I think Luis just needs to LOAD its data, and then sort it with
<<quoted lines omitted: 4>>
>I think that he can parse the string building a sortable block and then rejoin
>the data in a string.
Would you please give an example?
Louis
[11/62] from: greggirwin:mindspring at: 5-Sep-2002 13:00
Hi Louis,
<< It seems that I am having to sort large amounts of data more and more
often. I am wanting to learn how to do it faster. >>
One of the biggest things is to use a data structure that doesn't require
processing for every comparison. I.e. pre-parse the data into fields. REBOL,
in my experience, is plenty fast but don't make it do work redundantly if
you can avoid it.
--Gregg
[12/62] from: anton:lexicon at: 6-Sep-2002 5:08
Here's what was meant, I think.
string: {
454 en tw
395 en th
313 kai o
175 oi de
314 eij thn
174 eij ton
124 kai ouk
123 kai thn
219 ek tou
160 kai en
}
blk: parse string none ; split at whitespace
sort/skip blk 3
foreach [one two three] blk [print [one two three]]
Anton.
[13/62] from: greggirwin:mindspring at: 5-Sep-2002 12:52
Hi Louis,
<< What exactly does hash do... >>
Maybe Joel will chime in with a really good definition, but here's a quick
quote and my (probably poor) layman's explanation.
per Knuth, regarding searching algorithms "A third possibility is to avoid
all this rummaging around by doing some arithmetical calculation on K,
computing a function f(K) that is the location of K and the associated data
in the table."
Basically, and someone please correct me if I mis-state things, rather than
doing comparisons of items in a linear or tree structure, you generate a
location from the item/key and that's where the items lives. To find an
item, you just go to the location returned by your "hash function" for that
item.
Since you'll rarely get a totally unique set of results, the hashing process
also has to handle "collisions", where two keys generate the same result
when hashed.
Insertions are slower because more work is involved, but retrievals are very
fast.
<<..., and could hash be used to speed up sort? >>
I wish I could remember a thread from a while back the did some comparisons.
Normally, you don't think of a hash table as a sorted structure but REBOL is
so very cool, that you can just change the datatype from block! to list! to
hash! and try things out. Do some tests and see what happens with your
actual data. Lists are good if you're doing lots of insertions and
deletions, and hashes are good if you're doing a lot of lookups. Testing
will tell, though, how SORT interacts with the datatypes, your particular
data, and how you use it. All sorting algorithms are not created equal. :)
--Gregg
[14/62] from: joel:neely:fedex at: 5-Sep-2002 17:18
Hi, Gregg and all...
Having a furiously busy week... not much time to keep up with the
list, I'm afraid! :-(
Gregg Irwin wrote:
> ... rather than doing comparisons of items in a linear or tree
> structure, you generate a location from the item/key and that's
<<quoted lines omitted: 3>>
> hashing process also has to handle "collisions", where two keys
> generate the same result when hashed.
Good summary! I might only add that one can think of a hashed
structure as a collection of "buckets", where the keying function
tells which bucket to look in, and should uniformly distribute the
items across all buckets (usually in a pseudo-random fashion).
There's a whole art devoted to constructing "perfect hash functions"
for specific sets of values (e.g. the key words in a programming
language), but perfection normally breaks when more key values are
added to the set.
> <<..., and could hash be used to speed up sort? >>
>
No. Because hashing essentially randomizes the keys across the
collection of buckets, the keys that end up in the same bucket
normally don't have a nice relationship vis-a-vis some obvious
sort order.
-jn-
[15/62] from: louisaturk:coxinet at: 5-Sep-2002 18:01
Hi Anton,
At 05:08 AM 9/6/2002 +1000, you wrote:
>Here's what was meant, I think.
>string: {
<<quoted lines omitted: 13>>
>foreach [one two three] blk [print [one two three]]
>Anton.
Oh, I see. But there is a problem in that some of the lines are as follows:
003 apo twn presbuterwn kai arcierewn kai grammatewn
002 umwn merimnwn dunatai prosqeinai epi thn hlikian
002 umin anasthsei kurioj o qeoj umwn ek twn adelfwn
with an unknown number of spaces. Sorry, this is my fault for not
noticing that all the sample data I originally gave contained a number and
two words, and therefore was not representative of all the data.
Louis
[16/62] from: al:bri:xtra at: 6-Sep-2002 16:43
Here's my quick version:
Rebol []
Data: {003 apo twn presbuterwn kai arcierewn kai grammatewn
002 umwn merimnwn dunatai prosqeinai epi thn hlikian
002 umin anasthsei kurioj o qeoj umwn ek twn adelfwn
}
write %Sort.txt Data
Data: sort/skip map read/lines %Sort.txt function [Line [string!]] [Loaded]
[
Loaded: load/next trim Line
reduce [
Loaded/1
trim Loaded/2
]
] 2
probe Data
halt
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[17/62] from: anton:lexicon at: 6-Sep-2002 16:07
*Right* then, try this:
; data must start with newline to help the parse rule
string: {
454 en tw
395 en th
313 kai o
175 oi de asdf
314 eij thn
174 eij ton
124 kai ouk dffd fder
123 kai thn
219 ek tou
160 kai en tty ttq qzfcv
}
blk: make block! 1000 ; or 2 * number of lines in the string
; that's instead of [copy []], a speed increase for insert
whitespace: charset " ^-" ; spaces and tabs
parse/all string [
some [
"^/" ; each line starts with a newline
some whitespace
copy number to " "
(insert tail blk to integer! number)
" "
copy string to "^/" ; the rest of the line
(insert tail blk string)
]
]
sort/skip blk 2
foreach [one two] blk [print [one two]]
Anton.
[18/62] from: carl:cybercraft at: 6-Sep-2002 20:27
On 06-Sep-02, [SunandaDH--aol--com] wrote:
> Louis:
>> Your sort worked perfectly. Thanks also for the explanation.
<<quoted lines omitted: 8>>
> I'm glad to hear it worked, even if it took three hours.
> In my experience, you can speed things up with more memory.
Or, perhaps, by using less?
I've come late into this thread, but I take it Louis has large text
files which he wants to sort by the first three characters in the
line, which are of the form "001", "002", "001" etc?
Why not, instead of loading the whole files in, just build an index to
the lines in the file and sort that? For instance...
write %test.txt {001 a b c
002 cc dd ee ff
001 g h i
003 jj kk ll mm
002 n m o p}
This then allows us to treat the file on disk as a series...
>> file: open/lines %test.txt
>> print file/1
001 a b c
>> print file/3
001 g h i
>> print file/4
003 jj kk ll mm
>> close file
So, to create an unsorted index based on the first three letters of
each line...
file-index: []
file: open/lines %test.txt
forall file [
append file-index copy/part file/1 3
append file-index index? file
]
close file
After running the above, file-index contains...
>> file-index
== ["001" 1 "002" 2 "001" 3 "003" 4 "002" 5]
That we can now sort...
>> sort/skip file-index 2
== ["001" 1 "001" 3 "002" 2 "002" 5 "003" 4]
and use to print out the file with its lines sorted...
file: open/lines %test.txt
foreach [code line] file-index [
print file/:line
]
close file
Which, when run, returns...
001 a b c
001 g h i
002 cc dd ee ff
002 n m o p
003 jj kk ll mm
how long REBOL would take to sort 29,688 of those pairs of values I
don't know, (maybe the strings would be better converted to integers
before sorting?), but it should cut down a lot on the memory use.
This all assumes I've understood the problem right, of course. (:
--
Carl Read
[19/62] from: carl:cybercraft at: 6-Sep-2002 23:23
On 06-Sep-02, [SunandaDH--aol--com] wrote:
> But when it comes to working out what is actually faster none of us
> has much of an internal model of how Rebol goes about doing things.
> All we can do is speculate and experiment.
I've just had a play with some randomly generated data, and I suspect
with my method the loading of the data might take a good amount of
time compared to the actual sorting, and in that case using a list
instead of a block would help to speed the loading up. Lists are
slightly different to blocks though, so you should read up on them
before just assuming they're like a block. For instance, you might
load a block like this...
>> blk: []
== []
>> for n 1 9 1 [append blk n]
== [1 2 3 4 5 6 7 8 9]
whereas with a list, insert could be used as it moves the index with
each insert...
>> lst: to-list []
== make list! []
>> for n 1 9 1 [insert lst n]
== make list! []
>> head lst
== make list! [1 2 3 4 5 6 7 8 9]
and that would be faster than using append on a list.
--
Carl Read
[20/62] from: rotenca:telvia:it at: 6-Sep-2002 14:48
Hi,
> >I think that he can parse the string building a sortable block and then
rejoin
> >the data in a string.
>
> Would you please give an example?
My idea is like Anton's one. Andrew came with an interesting idea, but how
fast?
If the whole line had a fixed lenght, we could use another method, more fast,
here i presume lines are of different lengths.
Here i assume that numbers strings are right aligned and of fixed lengths and
separators are spaces (not tabs) to be a little more fast.
REBOL []
probe r: 10000 ;10000 lines
string: make string! 2000000
string: head insert/dup tail ""
{ 454 en tw
12395 en th
313 kai o
175 oi de asdf
313 eij thn
174 eij ton
124 kai ouk dffd fder
4123 kai thn
219 ek tou
160 kai en tty ttq qzfcv
} r / 10
blk: make block! 2 * r ; 2 * number of lines in the string
s2: has [num rest] [
clear blk
parse/all string [
some [
copy num [any " " to " "]
copy rest thru newline
(insert insert tail blk num rest)
]
]
to string! sort/skip blk 2
]
probe s2
---
Ciao
Romano
[21/62] from: gscottjones:mchsi at: 6-Sep-2002 10:22
Hi, everyone,
My two cents worth:
I recreated a similar data set as Louis (but mine being total nonsence, of
course :-) with the following algorithm:
loop 29688 [
write/append/lines %//windows/desktop/data1.txt rejoin [
" "
random/only {1234567890}
random/only {1234567890}
random/only {1234567890}
" "
random {eawk acef 32233}
]
]
It's a quick hack, so stop laughing! A sample of the data looks like:
...
977 3 2e33aewacf2k
638 e ak33we2c3a 2f
355 ak cefa233we3 2
216 e ec3w3aa2f2k3
422 3fc2eeaa33w k2
462 wa3 ae3e2fk32c
238 wea 3ack2 23f3e
616 2aa3 f3k3w2e ec
658 w 32 ckfa33a2ee
537 3e 3fw2e acak23
592 33 c3ef22we aka
858 e3w3 cea2k2fa 3
453 efca kw223ae33
...
and the resulting file size was 637kb.
As a rough comparison, to see if I was on track, I adapted Sunanda's code
(assuming I got the final version correctly):
;;;;;;;;;;;;;;;;;;;;;;;;
t: now/time
sort/compare sorted-data: copy data func [a b /local a-key b-key ] [
a-key: join second parse/all a " " ["-" index? find data a]
b-key: join second parse/all b " " ["-" index? find data b]
;print [a-key " " b-key]
return a-key < b-key
]
print now/time - t
;;;;;;;;;;;;;;;;;;;;;;;;
I didn't do dedicated run (I needed to do a few other things), but I
interrupted the run after 4 hours on my 500mhz Celeron 128MB Win98.
Next, I drew a few ideas from contributions, but I was determined to let in
run in memory. Here was the resulting code:
;;;;;;;;;;;;;;;;;;;;;;;;
rebol []
full-time: now/time
data: read/lines %//windows/desktop/data1.txt
data-blk: copy []
foreach datum data [
append data-blk second parse/all datum " "
append data-blk datum
]
sort-time: now/time
sort/skip data-blk 2
print now/time - sort-time
foreach [key value] data-blk [
write/lines/append %//windows/desktop/sorted-data.txt value
]
print now/time - full-time
print "done"
ask ""
;;;;;;;;;;;;;;;;;;;;;;;;
The idea was to do a parse through the code only once, and construct a new
key and value
pairing data block. Then I did a a simple sort/skip by 2.
Time for the sort (alone) was 1 to 2 seconds on my machine, and total
runtime about 33 seconds. Sample of the resulting data:
...
000 ae332ae ckw 3f2
000 ea a32c323w fek
000 3ac33 wee2 2afk
000 3 22kee33w acaf
000 2a2e3cak f e33w
000 ak 3e2fe3wa2c3
000 ae2 2efw3ca3 3k
000 ee2ca3w 3ka3f2
000 2 afewe333cak2
000 ce2ew3a3a3 k2f
000 kcaa3f23 w e32e
000 3e2kcwf33 2eaa
000 e3e 2waf cak323
000 332e ak2eca3fw
000 3fek2wa3c a23 e
000 32ck e ea23faw3
000 akf33 eace3w2 2
000 332ew2 ka3eca f
...
I am assuming that this is considered a stable sort based on the first
alphanumeric "word" only.
Comments welcome.
Respectfully,
--Scott Jones
[22/62] from: greggirwin:mindspring at: 6-Sep-2002 11:36
Hi Scott,
Good deal! I used your data generator and then did this:
data: read/lines %data-1.txt
compare-items: func [a b /local aa bb] [
aa: to integer! first parse a none
bb: to integer! first parse b none
either aa = bb [0][either aa < bb [-1][1]]
]
t: now/time/precise
sort/compare data :compare-items
print now/time/precise - t
halt
which gave me these results for 3 runs (on a P900 w/384 Meg of RAM):
1. 0:00:20.099
2. 0:00:20.269
3. 0:00:20.099
Changing the comparisons to use Sunanda's approach (assuming equality is
least likely): either aa < bb [-1][either aa > bb [1][0]]
1. 0:00:19.508
2. 0:00:19.528
3. 0:00:19.518
Making the data a hash! didn't speed it up and trying to make it a list!
didn't work. Even just a plain SORT on the list didn't work, and inserted
'end for some items. Haven't investigated.
The data looks stable at a glance. Not verified though.
>> data
== [" 000 c3wef22aeka 33" " 000 aw333afk c2e 2e" " 000 ake3c2we3 3a2f" "
000 wa2k2c ee3 3fa3" " 000 wa fc32a2k3ee 3" " 000 3k3e32...
--Gregg
[23/62] from: greggirwin:mindspring at: 6-Sep-2002 11:53
Scott, et al
OK, I didn't think much before, so here's another whack that preps the data
before sorting. Prep time is included in timings.
t: now/time/precise
data: make block! 2 * length? raw-data: read/lines %data-1.txt
foreach item raw-data [
append data reduce [to integer! first parse item none item]
]
compare-items: func [a b /local aa bb] [
either a < b [-1][either a > b [1][0]]
]
sort/skip/compare data 2 :compare-items
data: extract next data 2
print now/time/precise - t
halt
And the results:
1. 0:00:02.964
2. 0:00:02.974
3. 0:00:02.974
--Gregg
[24/62] from: louisaturk:coxinet at: 6-Sep-2002 16:04
Hi Everybody,
You guys are great! I've been following this thread with amazement. I am
so interested I can hardly keep my mind on other urgent projects which
unfortunately I cannot put aside right now.
Anyway, here is the data to be sorted if perchance it might be useful for
timing purposes (691KB):
http://www.pusatberita.com/test.txt
Louis
PS And many thanks to all of you for helping me like you always do.
[25/62] from: al:bri:xtra at: 7-Sep-2002 10:49
With Louis' data and this script:
Rebol []
Then: now/time
Data: sort/skip map read/lines %test.txt
function [Line [string!]] [Loaded] [
trim Line
if not empty? Line [
Loaded: load/next Line
reduce [
Loaded/1
Line
]
]
] 2
write/lines %Sorted.txt map Data
func [N [integer!] Line [string!]] [
Line
]
print now/time - Then
halt
I got a time of two seconds (0:00:02) on my machine with Rebol/View. It's a
AMD Athlon XP 1600+, 1.40GHz with 512MB of RAM.
The first line of %Sorted.txt is:
002 umwn merimnwn dunatai prosqeinai epi thn hlikian
and the last line is:
521 tou qeou
I suspect that my machine's hardware makes most of the difference in time.
Louis, if you want, I can email the sorted text back to you. Just let me
know.
I hope that helps!
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[26/62] from: joel:neely:fedex at: 6-Sep-2002 18:29
Hi, Louis, and everybody,
I've been covered up, so I may have missed something, but here's my
suggestion for a different approach (forgive me if somebody has
mentioned this already or if I've misunderstood the problem!)
Louis A. Turk
wrote:
> Anyway, here is the data to be sorted if perchance it might be
> useful for timing purposes (691KB):
>
> http://www.pusatberita.com/test.txt
>
After glancing at your data, I decided to try a version that just
eliminates the sorting entirely. Let me restate the problem in
that fashion. We have:
- a file of text lines, each of which contains:
- a single space
- a three digit number (with zero padding on the left)
- a single space
- some words
and we desire
- the lines rearranged by the leading number, but otherwise in
the same order as the original data.
Since there are relatively few three-digit integers, I thought I'd
try setting up a collection of blocks, so that block 1 will hold
all lines beginning with 001, block 2 will hold all lines beginning
with 002, etc. Here's the code, including timers...
8<--------
REBOL []
buffer: []
t0: now/time/precise
foreach item read/lines %pusatberita.text [
nr: to-integer copy/part next item 3
while [
nr > length? buffer
][
insert/only tail buffer copy []
]
append buffer/:nr item
]
t1: now/time/precise
print to-decimal t1 - t0
8<--------
When I run the above on my desk at work, using a downloaded copy
of your data, it takes about one second. To iterate through the
data lines in the desired order, we can do something like
foreach group buffer [
foreach line group [
print line ;; or whatever you wish to do with it
]
]
Someone who's already been benchmarking might try comparing this
with what's already been proposed, using the data from your web
site.
-jn-
[27/62] from: carl:cybercraft at: 7-Sep-2002 11:16
On 07-Sep-02, Gregg Irwin wrote:
> Hi Scott,
> Good deal! I used your data generator and then did this:
<<quoted lines omitted: 21>>
> list! didn't work. Even just a plain SORT on the list didn't work,
> and inserted 'end for some items. Haven't investigated.
You don't have to look far...
>> x: to-list [1 2 3]
== make list! [1 2 3]
>> sort x
== make list! [1 2 end]
>> head x
== make list! [1 2 end]
!
That's on the non-beta View. Do the beta REBOLs give the same
results?
Anyway, using Scott's data-file, here's my leaving-file-on-the-disk
approach. I expect it's slower than loading the whole file into
memory, but might be better for super-huge files...
REBOL []
data1: %data1.txt ; Set paths to where you want.
data2: %data2.txt
; Scott's data creation code...
if not exists? data1 [
loop 29688 [
write/append/lines data1 rejoin [
" "
random/only {1234567890}
random/only {1234567890}
random/only {1234567890}
" "
random {eawk acef 32233}
]
]
]
if exists? data2 [delete data2] ; *** Deletes previous runs!
; Sorting...
t: now/time/precise
file-index: copy []
file: open/lines data1
forall file [
append file-index reduce [copy/part file/1 4 index? file]
]
close file
sort/skip file-index 2
file: open/lines data1
foreach [code line] file-index [
write/append/lines data2 file/:line
]
close file
print now/time/precise - t
--
Carl Read
[28/62] from: reffy:ulrich at: 6-Sep-2002 18:58
I apologize ..
Didn't mean to send that to the whole list !
Dick
[29/62] from: reffy:ulrich at: 6-Sep-2002 18:53
-- Attached file included as plaintext by Listar --
That sounds like a fast machine you have Andrew.
I would like for you to run a test if you would.
SOURCE CODE of what I am sending:
/* QSORT.C: This program reads the name of a file
* to be sorted, from the command line. The lines
* are read into memory then sorted and then written
* to stdout.
*
* Note: The sort is of the entire ASCII line and does
* not parse the line in any way.
*
* Sample:
* qsort sortme.txt > sorted.txt
*/
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define MAX_LINE_SIZE 1024
int compare( const void *arg1, const void *arg2 );
char line[MAX_LINE_SIZE];
char ** lines;
int lcnt = 0;
FILE * stream = NULL;
time_t start, finish;
double elapsed_time;
void CountTheLines(char *);
void ReadTheLines(char *);
void SortTheLines(void);
void OutputTheLines(void);
void main( int argc, char **argv )
{
if (argc != 2)
{
printf("Usage: %s nameoffiletosort\n", argv[0]);
exit(1);
}
time( &start );
CountTheLines(argv[1]);
ReadTheLines(argv[1]);
SortTheLines();
OutputTheLines();
time( &finish );
elapsed_time = difftime( finish, start );
printf( "\nSorted %d lines in %6.0f seconds.\n", lcnt, elapsed_time );
}
void CountTheLines(char *fn)
{
// Count the lines in the file
stream = fopen(fn, "r");
if(stream == NULL)
{
printf( "Unable to open input file\n" );
exit(1);
}
while (fgets(line, MAX_LINE_SIZE, stream) != NULL)
lcnt++;
fclose(stream);
// Allocate a place to put the lines
lines = (char **)malloc(lcnt*sizeof(char *));
}
void ReadTheLines(char *fn)
{
// Read lines into memory
lcnt = 0;
stream = fopen(fn, "r");
if(stream == NULL)
{
printf( "Unable to open input file\n" );
exit(1);
}
while (fgets(line, MAX_LINE_SIZE, stream) != NULL)
lines[lcnt++] = strdup(line);
fclose(stream);
}
void SortTheLines()
{
// Sort the lines
qsort( (void *)lines, (size_t)lcnt, sizeof( char * ), compare );
}
void OutputTheLines()
{
int i;
// Output the sorted lines
for( i = 0; i < lcnt; ++i )
printf( "%s", lines[i] );
}
int compare( const void *arg1, const void *arg2 )
{
/* Compare all of both strings: */
return _stricmp( * ( char** ) arg1, * ( char** ) arg2 );
}
Download NeoPlanet at http://www.neoplanet.com
-- Binary/unsupported file stripped by Listar --
-- Type: Application/Octet-Stream
[30/62] from: carl:cybercraft at: 7-Sep-2002 12:53
On 07-Sep-02, Louis A. Turk wrote:
> Hi Everybody,
> You guys are great! I've been following this thread with amazement.
<<quoted lines omitted: 3>>
> useful for timing purposes (691KB):
> http://www.pusatberita.com/test.txt
On this very slow Amiga, my leave-file-on-the-disk method took 28
minutes with your data Louis. However, the vast majority of that
time was taken up with the final reading-lines-from-disk and
writing-lines-back-to-disk loop. Index creation took 75 seconds and
sorting 31 seconds. I assume the saving of the data could be sped up
by reading in blocks of lines (say 1000) before writing them out,
which would stop the switching from read to write for each line.
Others with more experience of REBOL's random-access of files
probably know a more sensible method for doing this than I used...
And a question for those others: Does...
file: open/lines file-name
load the whole file into memory anyway? As I didn't notice any
disk-reading while the index was being built. If so, it sort of
makes the whole point of attempting to work with the file on disk a
bit of a waste of effort.
PS. And it's quite fun and kind of hypnotic to scroll fast through
Louis's data after it's been sorted. Especially after you get past
the first few numbered lines. (:
--
Carl Read
[31/62] from: carl:cybercraft at: 7-Sep-2002 13:01
On 07-Sep-02, Joel Neely wrote:
> Since there are relatively few three-digit integers, I thought I'd
> try setting up a collection of blocks, so that block 1 will hold
> all lines beginning with 001, block 2 will hold all lines beginning
> with 002, etc.
Very, very nice Joel. In fact annoyingly so. (;
--
Carl Read
[32/62] from: carl:cybercraft at: 7-Sep-2002 18:42
On 07-Sep-02, [SunandaDH--aol--com] wrote:
> Joel:
>> Someone who's already been benchmarking might try comparing this
<<quoted lines omitted: 6>>
> 2.25 -- Joel's bridge sort
> You are today's lucky benchmark winner!!!
[snip]
> I'm kinda interested in how these perform on machines with far less
> memory that Joel's or mine.
Well, it was time for a cuppa anyway... (: So, on a slowww Amiga I
got...
33:33.21 -- Sunanda's parse in the Sort
02:26.75 -- Gregg's parse before the sort
01:17.20 -- Joel's bridge sort
And, out of interest, I converted my index-sort to work with the file
in memory, (since you kindly left out disk-access in the timing:),
and got a result of 01:48.26. Joel's method still wins though. (:
I've 32megs, but didn't have any memory problems.
Here's my code, with an updated verify tacked on the end. Note how I
saved the sorted block. Simplier than looping through it, and much
quicker, at least on this machine.
;; Carl -- index sort
;; -------------------
unset 'sorted-data
recycle
report-item/start "index sort"
file-index: array 2 * length? data
ptr: 1
foreach item data [
poke file-index ptr copy/part item 4
poke file-index ptr + 1 item
ptr: ptr + 2
]
sorted-data: extract next sort/skip file-index 2 2
report-item/end "index sort"
write/lines %sorted-data-index.txt sorted-data
unset 'sorted-data
recycle
;; Verify sort results
;; ===================
report-item/start "verifying results are the same"
parsed-file: read %sorted-data-parse.txt
pre-parsed-file: read %sorted-data-pre-parsed.txt
bridge-file: read %sorted-data-bridge.txt
index-file: read %sorted-data-index.txt
either all [parsed-file = pre-parsed-file
parsed-file = bridge-file
parsed-file = index-file
]
[print "got same results"]
[print "bad code in there somewhere"]
report-item/end "verifying results are the same"
print "done"
--
Carl Read
[33/62] from: carl:cybercraft at: 7-Sep-2002 22:14
On 07-Sep-02, [SunandaDH--aol--com] wrote:
> Your code makes less assumptions that Joel's -- yours is sorting on
> the first four characters of a line. That's should be in the format
> "b999" (b=space 9=digit) but your code will work if it isn't
I made the assumption that it'd either be space + 3 digits or 3 digits
+ space - but I should've included a trim just to be sure in case
there was the occasional leading space missing. The important thing
though is that Louis has a good few routines to tweek now. He could
change the format of the code to anything he wished and easily modify
these to suit.
As to the degradation - I wouldn't know what causes that. I've
detected garbage collection with previous speed tests, but never a
general slowdown over time.
--
Carl Read
[34/62] from: g:santilli:tiscalinet:it at: 7-Sep-2002 12:41
Hi Carl,
On Saturday, September 7, 2002, 2:53:19 AM, you wrote:
CR> And a question for those others: Does...
CR> file: open/lines file-name
CR> load the whole file into memory anyway?
It does, if you don't use /DIRECT. (Actually, in theory it should
only load a portion of it, based on the amount of free memory
available; i.e. if it fits into memory, it loads it entirely.
However, it always loaded the whole file into memory for me.)
You can check it by using:
>> file: open/lines/read %AUTOEXEC.BAT
>> file/state/inBuffer
== ["SET BLASTER=A220 I7 D1 H5 P330 T6" "SET CTSYN=C:\WINDOWS" "C:\PROGRA~1\CREATIVE\SBLIVE\DOSDRV\SBEINIT.COM"
{mode con codepage...
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[35/62] from: al:bri:xtra at: 7-Sep-2002 23:21
This script is a bit shorter:
Rebol []
Then: now/time
write/lines %Sorted.txt extract/index sort/skip map read/lines %test.txt
func [Line [string!]] [
trim Line
if not empty? Line [
reduce [
first load/next Line
Line
]
]
] 2 2 2
print now/time - Then
halt
Still no difference in time though, about 2 seconds.
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[36/62] from: al:bri:xtra at: 7-Sep-2002 23:25
With Joel's script, slightly modified:
buffer: []
t0: now/time/precise
foreach item read/lines %test.txt [
nr: to-integer copy/part next item 3
while [
nr > length? buffer
][
insert/only tail buffer copy []
]
append buffer/:nr item
]
t1: now/time/precise
print to-decimal t1 - t0
halt
I unfortunately, get:
** Script Error: Invalid argument:
** Where: to-integer
** Near: to integer! :value
>>
I think that's because of the couple of empty lines at the end.
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[37/62] from: al:bri:xtra at: 7-Sep-2002 23:29
With Carl's script:
data1: %test.txt
data2: %Sorted.txt
t: now/time/precise
file-index: copy []
file: open/lines data1
forall file [
append file-index reduce [copy/part file/1 4 index? file]
]
close file
sort/skip file-index 2
file: open/lines data1
foreach [code line] file-index [
write/append/lines data2 file/:line
]
close file
print now/time/precise - t
halt
I get this:
0:00:10.125
proving Carl's right:
> I expect it's slower than loading the whole file into memory, but might be
better for super-huge files...
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[38/62] from: al:bri:xtra at: 7-Sep-2002 23:35
Unfortunately, the binary has been stripped out by the list software.
I suspect the C program will be slower than my Rebol script, because the C
program reads the file twice, where my Rebol script only reads the file
once. Disk I/O is usually the slowest part of any algorithm.
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[39/62] from: reffy:ulrich at: 7-Sep-2002 7:28
Just for your information,
I quadrupled that file (to be sorted) until there were 118,751 lines.
The little C program I sent sorted it in less than 1 second.
Dick
[40/62] from: reffy:ulrich at: 7-Sep-2002 7:31
Just for your information,
Sorted 237,504 lines in 2 seconds.
Dick
[41/62] from: carl:cybercraft at: 8-Sep-2002 1:02
On 07-Sep-02, Gabriele Santilli wrote:
> Hi Carl,
> On Saturday, September 7, 2002, 2:53:19 AM, you wrote:
<<quoted lines omitted: 5>>
> available; i.e. if it fits into memory, it loads it entirely.
> However, it always loaded the whole file into memory for me.)
Ah, thanks Gabriele. I just should've read the refinements' list a
bit better.
> You can check it by using:
>>> file: open/lines/read %AUTOEXEC.BAT
Oh no I can't. (;
--
Carl Read
[42/62] from: rotenca:telvia:it at: 7-Sep-2002 14:57
Hi, Sunanda
> It shows some interesting deterioration for long-running Rebol consoles:
because memory use increase more and more.
> But look at the way my sort deteriorates!! It is possible that my code gets
> slugged by an internal garbage-collection, and if we ran it enough times any
> of the benchmarks could be hit by that.
The best thing should be to launch every single test code.
At the end you sill find my two tests. I need raw data, so i must re-read the
test file.
They work well under the new beta (1.2.5) in my tests beta are twice fast on
this sort of code.
Here are my test where you can seel the result of system/stats and allocated
memory increase:
Rebol 1.2.5.3.1 - Celeron 333 RAM 128 Mb Window 98 first edition
reading data : 0:00:00.11 memory allocated: 6465424
Scott : 0:00:02.74 memory allocated: 10714448
Carl : 0:00:02.58 memory allocated: 12752096
Romano : 0:00:01.54 memory allocated: 15060400
Romano-int : 0:00:01.7 memory allocated: 17671880
Joel : 0:00:01.49 memory allocated: 22394888
got same results
done
By other test i noticed than Joel code use 3 Mb memory more than others tests.
-----------code-----------
;; Romano -- parse sort
;; -------------------
data: read %../public/www.pusatberita.com/test.txt
report-item/start "Romano"
sorted-data: []
parse/all data [
some [
copy num [any " " to " "]
copy rest thru newline
(insert insert tail sorted-data num rest)
]
]
sorted-data: sort/skip sorted-data 2
report-item/end "Romano"
write %sorted-data-romano.txt sorted-data
unset 'sorted-data
recycle
;; Romano -- parse sort int
;; -------------------
data: read %../public/www.pusatberita.com/test.txt
report-item/start "Romano-int"
blk: []
parse/all data [
some [
h: any " " copy num integer!
:h copy rest thru newline
(insert insert tail blk to integer! num rest)
]
]
blk: sort/skip blk 2
report-item/end "Romano-int"
sorted-data: copy ""
foreach [x x2] blk [insert tail sorted-data x2]
write %sorted-data-romano-int.txt sorted-data
unset 'sorted-data
recycle
; for next texts:
data: read/lines %../public/www.pusatberita.com/test.txt
-----end of code ----
---
Ciao
Romano
[43/62] from: gscottjones:mchsi at: 7-Sep-2002 10:09
Hi, all,
Interesting thread. Internet access was down for 16 hours, so I missed out
on some of the fun. :-(
I do want to clarify on point on my email of yesterday. I never wished to
imply that Sunanda's initial code was "slow". I was unsure why Louis' run
took so long. I was trying to recreate the condition so that I could
compare, but I gave up after 4 hours because I assumed *I miscoded*
Sunanda's example, figuring that I had sent the code off into the weeds.
Gregg's response clearly indicated that I *had* done something wrong. Ten
thousand apologies to Sunanda. I should have been more careful in my
writing.
--Scott Jones
[44/62] from: joel:neely:fedex at: 7-Sep-2002 10:05
Hi, Romano,
Romano Paolo Tenca wrote:
> Here are my test where you can seel the result of system/stats
> and allocated memory increase:
<<quoted lines omitted: 9>>
> By other test i noticed than Joel code use 3 Mb memory more than
> others tests.
No surprise there, as the "bucket-grouping" approach creates a
second structure containing the same strings, rather than
rearranging the strings within a single block. That's a fairly
common space-for-time tradeoff. This reminds me of the sign in
a programming office...
Computer software developed here:
Good, Fast, Cheap
(pick any two!)
-jn-
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
[45/62] from: joel:neely:fedex at: 7-Sep-2002 10:00
Hi, Sunanda,
[SunandaDH--aol--com] wrote:
> Though I'm not sure I'd recommend your method to Louis unless
> he's absolutely certain that that first field is:
<<quoted lines omitted: 8>>
> only worth it if the performance gain is utterly unliveable
> without.
Completely valid concerns; I thought the data bore a strong
resemblance to the desired output described in another thread
about parsing words out of a long chunk of text, so I assumed
that Louis was actually creating the original "unsorted" data
himself. Clearly the production use of an approach like the
one I posted would require that the interface and assumptions
be precisely documented, so that any change upstream could be
checked against the requirements of the downstream code.
-jn-
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
[46/62] from: g:santilli:tiscalinet:it at: 7-Sep-2002 21:15
Hi Carl,
On Saturday, September 7, 2002, 3:02:53 PM, you wrote:
CR> Oh no I can't. (;
You know, I didn't mean you needed to use THAT file to check it.
;-)
But well, %/S/Startup-Sequence is longer so it will be a better
test case. ;-)
Regards,
Gabriele.
--
Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer
Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r
[47/62] from: al:bri:xtra at: 8-Sep-2002 9:39
My latest version is:
Rebol []
Then: now/time/precise
write/lines %Sorted.txt sort read/lines %test.txt
print now/time/precise - Then
halt
This takes: 0:00:00.321 (it does vary a bit, +/- a small number.)
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[48/62] from: al:bri:xtra at: 8-Sep-2002 10:26
My latest version is:
Rebol []
Then: now/time/precise
write/lines %Sorted.txt sort read/lines %Big.txt
print now/time/precise - Then
halt
With a test file that's 20 times as large (13.4MB), I get a time of
0:00:14.25. With Dick's C program, it's 6 seconds.
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[49/62] from: rotenca:telvia:it at: 8-Sep-2002 2:24
Hi Andrew,
> With a test file that's 20 times as large (13.4MB), I get a time of
> 0:00:14.25. With Dick's C program, it's 6 seconds.
good, but so you sort with all the line.
Your previous code used a map funtion which was not included.
---
Ciao
Romano
[50/62] from: rotenca:telvia:it at: 8-Sep-2002 2:15
Hi Sunanda,
> For this and other reasons, I have some doubts about Rebol's ability to run
> 24x7 while pounding loads of data. Any really big designs out there should
> look closely at this issue.
I am not sure. For example, at least under 1.2.5.3.1, my two tests after some
oscillations do not change memory allocation (and without any recycle
command).
Your test is stable like rock in memory allocation without recycle. I did not
try other tests.
Often (i think always) continue memory increases are only the result of a bug
in user code (blocks and strings are the first to check).
To end, in my tests i did not see any variation in the timing of your test
and up to date we are not sure that memory allocation continue to grow when we
repeat all the tests.
> Thanks for your timings. I get (on my machine) a run time of about 6.5
> seconds. 25% faster than Gregg's but still some way to go to beat Scott and
> Joel. It's that parse again. You only do one compared to Gregg's 750,000 (or
> so). I guess it's not the number of invokations: it's the total data that
has
> to be parsed than counts.
I suspect (i'm sure :-) you use a not-beta Rebol release. I have similar
results under 1.2.3.1.3.
Automatic Rebol memory allocation until 1.2.3 is buggy and to limit problems,
blocks need to be preallocated with make block!.
This is another round of test under 1.2.5.3.1 with your code also, which i
forgot last time:
1.2.5.3.1 Celeron 333 RAM 128 Mb Window 98 first edition
reading data : 0:00:00.22 memory allocated: 6465424
Sunanda : 0:00:45.92 memory allocated: 9435920
Scott : 0:00:02.8 memory allocated: 13418112
Carl : 0:00:02.09 memory allocated: 13586848
Romano : 0:00:01.37 memory allocated: 15895152
Romano-int : 0:00:01.43 memory allocated: 18477920
Joel : 0:00:01.54 memory allocated: 23200928
got same results
done
And this is a test under 1.2.3.1 with my code optimized for this version (code
follows):
1.2.1.3.1 Celeron 333 RAM 128 Mb Window 98 first edition
reading data : 0:00:00.22 memory allocated: 6071976
Sunanda : 0:00:59.1 memory allocated: 8868200
Scott : 0:00:11.26 memory allocated: 12222608
Carl : 0:00:03.84 memory allocated: 12563472
Romano : 0:00:01.37 memory allocated: 14783712
Romano-int : 0:00:01.38 memory allocated: 17348192
Joel : 0:00:03.85 memory allocated: 18710144
got same results
done
Here i'm more fast than Joel. But also Joel code can be optimized.
-----------code-----------
;; Romano -- parse sort for 1.2.1.3.1
;; -------------------
data: read %../public/www.pusatberita.com/test.txt
report-item/start "Romano"
sorted-data: make block! 80000 ;changed for 1.2.1.3.1
parse/all data [
some [
copy num [any " " to " "]
copy rest thru newline
(insert insert tail sorted-data num rest)
]
]
sort/skip sorted-data 2
report-item/end "Romano"
write %sorted-data-romano.txt sorted-data
unset 'sorted-data
recycle
;; Romano -- parse sort int for 1.2.1.3.1
;; -------------------
data: read %../public/www.pusatberita.com/test.txt
report-item/start "Romano-int"
blk: make block! 80000 ;changed for 1.2.1.3.1
parse/all data [
some [
h: any " " copy num integer!
:h copy rest thru newline
(insert insert tail blk to integer! num rest)
]
]
sort/skip blk 2
report-item/end "Romano-int"
sorted-data: copy ""
foreach [x x2] blk [insert tail sorted-data x2]
write %sorted-data-romano-int.txt sorted-data
unset 'sorted-data
recycle
; for next texts:
data: read/lines %../public/www.pusatberita.com/test.txt
-----end of code ----
---
Ciao
Romano
[51/62] from: rotenca:telvia:it at: 8-Sep-2002 2:45
Hi Sunanda,
> 1. Parse is slow -- take it out of the critical path (as Gregg and Romano
do)
> or eliminate it (as Scott and Joel do).
> 2. It's not the number of parse's (Greg has one per line; Romano has one for
> the whole dataset) so much as the volume of data to be parsed
I do not agree. The greatest problems are in memory allocation. You can see my
other message to verify the result of optimized code for memory allocation
code under 1.2.1.3.1.
I think parse is one of the more fast costructs in Rebol.
> It's been fun!
I agree.
> Sunanda.
---
Ciao
Romano
[52/62] from: al:bri:xtra at: 8-Sep-2002 13:59
Romano wrote:
> good, but so you sort with all the line.
Well, it was so much faster that way... :)
> Your previous code used a map funtion which was not included.
Here it is:
Rebol [
Name: 'Map
Title: "Map"
File: %"Map.r"
Author: "Andrew Martin"
eMail: [Al--Bri--xtra--co--nz]
Web: http://valley.150m.com
Date: 26/August/2002
Version: 1.1.0
Purpose: {Maps or applies the function to all elements of the series.}
Category: [util 1]
Acknowledgements: [
"Joel Neely"
"Ladislav"
]
Example: [
Map func [n [number!]] [n * n] [1 2 3]
;== [1 4 9]
Map [1 2 3] func [n [number!]] [n * n]
;== [1 4 9]
Map [1 2 3 4 5 6] func [a] [print [a]]
;1
;2
;3
;4
;5
;6
;== []
Map [1 2 3 4 5 6] func [a b] [print [a b]]
;1 2
;3 4
;5 6
;== []
Map [1 2 3 4 5 6] func [a b c] [print [a b c]]
;1 2 3
;4 5 6
;== []
]
Requires: %Arguments.r
]
Map: function [
{Maps or applies the function to all elements of the series.}
Arg1 [any-function! series!]
Arg2 [any-function! series!]
/Only "Inserts the result of the function as a series."
][
Result Results Function Series
][
any [
all [
any-function? :Arg1 series? :Arg2
(Function: :Arg1 Series: :Arg2)
]
all [
any-function? :Arg2 series? :Arg1
(Function: :Arg2 Series: :Arg1)
]
throw make error! reduce [
'script 'cannot-use rejoin [
{"} mold 'Map " " mold type? :Arg1 {"}
]
rejoin [
{"} mold type? :Arg2 {"}
]
]
]
Results: make Series length? Series
do compose/deep [
foreach [(Arguments :Function)] Series [
if all [
not unset? set/any 'Result Function (Arguments :Function)
not none? Result
] [
either only [
insert/only tail Results :Result
][
insert tail Results :Result
]
]
]
]
Results
]
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[53/62] from: joel:neely:fedex at: 7-Sep-2002 22:06
Hi, Romano, et alia,
Romano Paolo Tenca wrote:
> > 1. Parse is slow -- take it out of the critical path (as Gregg
> > and Romano do) or eliminate it (as Scott and Joel do).
> > 2. It's not the number of parse's (Greg has one per line; Romano
> > has one for the whole dataset) so much as the volume of data to
> > be parsed
>
> I do not agree. The greatest problems are in memory allocation.
>
...
> I think parse is one of the more fast constructs in Rebol.
>
Certainly memory management (especially when constructing blocks
in interpreted expressions as I did) becomes a very significant
component of the total run time as the size/number of blocks
continues to increase.
As for the rest, I must disagree with *both* you and Sunanda.
(Some trick, eh? ;-) I don't attach meaning to absolute phrases
such as "PARSE is slow" or "PARSE is fast", but instead think in
terms such as "PARSE is slower/faster than _____ for doing ____".
For example, if I replace the phrase
to-integer copy/part next item 3
in my original grouping algorithm with
to-integer first parse item none
two things happen:
1) The issues of leading whitespace and exact number of digits
in the numeric-looking prefix now go away.
2) The processing slows down by about 20% (takes about 20% longer
to run).
So, *in*this*case*, PARSE is slower than explicit copying for doing
the job of pulling out a fixed-length/position prefix. Generality
almost always has a price, and performance optimization almost
always increases brittleness.
I'd prefer to rephrase Sunanda's points as:
1) Move as much processing as possible out of inner loops.
2) Process as little data as possible to get the work done.
regardless of how the processing is invoked. In the second point,
doing a whitespace-driven parse of the entire line (or, worse, of
the entire file) is overkill, since we don't need the rest of the
line parsed/broken down for the task as specified.
Reminding myself to take a clue from the first point, we can use
the implicit looping of PARSE, and the fact that the prefixes are
only three digits long to come up with something like this:
8<--------
t0: now/time/precise
buffer: copy/deep array/initial 999 [[]]
digit: charset "0123456789"
parse/all read f [
some [
some " "
copy nr some digit
copy data [thru "^/" | to end] (
append pick buffer to-integer nr data
)
]
]
t1: now/time/precise
print to-decimal t1 - t0
8<--------
which, by my tests, actually runs a bit (~7%) faster than my
first version (which extended the buffer within the inner loop
whenever needed). Feeding that same idea back into the first
version produces something like this:
8<--------
t0: now/time/precise
buffer: copy/deep array/initial 999 [[]]
foreach item read/lines f [
nr: to-integer copy/part next item 3
append buffer/:nr item
]
t1: now/time/precise
print to-decimal t1 - t0
8<--------
which runs about 9% faster than the original.
-jn-
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
[54/62] from: carl:cybercraft at: 8-Sep-2002 17:09
On 08-Sep-02, Andrew Martin wrote:
> Romano wrote:
>> good, but so you sort with all the line.
> Well, it was so much faster that way... :)
But they were in a very pretty order Andrew, which that would've
stuffed up. (;
>From near the end of the sorted list...
032 eij ierosoluma
032 eij ton oikon
032 basileian tou
032 oi farisaioi
032 oi arciereij
032 tou swmatoj
032 en tw ierw
032 tw ierw
032 de autw
033 kai apokriqeij
033 eij ierousalhm
033 touj maqhtaj
033 twn agiwn
033 th kardia
033 pasi toij
033 tw ihsou
033 en toutw
033 auton en
033 ex umwn
033 ean tij
033 oti ou
034 apokriqeij de
034 tw pneumati
034 pantaj touj
034 kaqwj kai
034 ton laon
034 met emou
034 oti egw
034 dia thn
034 kai ai
034 oun o
See? Ascending length per number.
--
Carl Read
[55/62] from: al:bri:xtra at: 8-Sep-2002 18:50
Carl Read wrote:
> But they were in a very pretty order Andrew, which that would've stuffed
up. (;
> See? Ascending length per number.
I wonder if that's important to Louis? :)
Has anyone worked out what language the words are?
Andrew Martin
ICQ: 26227169 http://valley.150m.com/
[56/62] from: joel:neely:fedex at: 8-Sep-2002 6:47
Hi, all,
Andrew Martin wrote:
> Carl Read wrote:
> > See? Ascending length per number.
>
Would you believe "descending"? !!!
> I wonder if that's important to Louis? :)
>
> Has anyone worked out what language the words are?
>
It's Greek to me... ;-)
-jn-
(seriously)
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
[57/62] from: carl:cybercraft at: 9-Sep-2002 0:53
On 08-Sep-02, Joel Neely wrote:
> Hi, all,
> Andrew Martin wrote:
<<quoted lines omitted: 3>>
>>
> Would you believe "descending"? !!!
One ascends from the bottom up Joel - and as you go up each group of
numbers, the lengths of the strings get longer... (;
--
Carl Read
[58/62] from: joel:neely:fedex at: 8-Sep-2002 7:35
Note to self: be more complete!
I should have pointed out that the modified algorithm below splits
the group number from the remainder of the line; therefore, some
post-processing might be needed to re-attach the three-digit prefix
to the data (unless, of course, the subsequent processing could use
the bucket index for that purpose). So the time comparison may not
be comletely fair.
-jn-
Joel Neely wrote:
> Reminding myself to take a clue from the first point, we can use
> the implicit looping of PARSE, and the fact that the prefixes are
<<quoted lines omitted: 18>>
> first version (which extended the buffer within the inner loop
> whenever needed)...
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
[59/62] from: joel:neely:fedex at: 8-Sep-2002 7:29
Hi, Sunanda,
Very nice summary!
May I pick one small nit and reminisce for a moment?
[SunandaDH--aol--com] wrote:
> ... thread has developed into two parallel tracks, with
> a third on the way.
>
> TRACK 1
> Has several of us looking at increasingly faster ways to
> pre-process the lines to get stable sort keys.
>
I should have been more clear about the goal of my submission.
I meant to imply (but I didn't state well) that we need to be
careful about the force of habit, as it is VERY powerful. We
all jumped to the conclusion, based on Louis's first question,
that we needed to sort the data.
We've even found a way to call the solution I proposed a kind
of sort.
Actually, I began thinking about the fact that most of the
order that Louis requested was already present (i.e., the lines
with identical prefixes were already in the desired order,
relative to each other), which led me simply to group the data
by the one field that had actionable differences, the prefix.
Ergo, a solution that actually doesn't involve sorting at all.
Now for the reminiscence.
I didn't remember it (consiously... ;-) until writing this
particular email, but there's a really lovely book titled
_Programming_Pearls_ by Jon Bentley of AT&T Bell Labs (the
copy in front of me is the first edition by Addison Wesley,
(c) 1986, but I believe there's a newer version out). The
book is a collection of eponymous columns Bentley wrote for
CACM.
The very first column, titled "Cracking the Oyster", opens
as follows:
The programmer's question was simple: "How do I sort
on disk?" Before I tell you how I made my first
mistake, let me give you a chance to do better than
I did. What would you have said?
DON'T SCROLL DOWN
UNTIL
YOU
HAVE
THOUGHT
ABOUT
YOUR
ANSWER
TO
BENTLEY'S
QUESTION
,
PLEASE
!
He then continues:
My mistake was to answer his question. I gave him a
thumbnail sketch on how to sort on disk. My suggestion
that he dig into Knuth's classic _Sorting_and_Searching_
met with less than enthusiasm -- he was more concerned
about solving the problem than with furthering his
education. I then told him about the disk sorting
program in Chapter 4 of Kernighan and Plaugers's
_Software_Tools_. Their program consists of about two
hundred lines of Ratfor code in twelve procedures;
translating that into several hundred lines of FORTRAN
and testing the code would have taken about a week.
(Nothing like dating yourself, right? ;-)
Bentley goes on to describe how he (fortunately) kept asking
questions until the *REAL* nature of the problem became clear,
as summarized below:
- The system was used for re-organizing political districts;
the data were the precinct numbers that made up a district.
- Each value was an integer (a precinct number).
- It was illegal for a value to appear more than once.
- There would be up to 27,000 values (in the range 1..27,000).
- The system only had 1K-2K sixteen-bit words of free memory
at that point. (I told you I was dating myself! ;-)
Given this restatement of the problem, *now* how would you
advise the programmer? (other than offering him a ride in your
time machine, of course! ;-)
-jn-
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
[60/62] from: reffy:ulrich at: 8-Sep-2002 8:53
I am thinking Rebol for the most part is very well put together internally.
Interpreters are notoriously slow for some things and fast for others.
In general, the avoidance of dynamic memory allocation (i.e. using C's malloc and such)
is a big winner.
I took the quick hack and with a half million lines, a half million mallocs occur, and,
a million lines are read. Yep, I actually count them first :-)
Yes, I could rewrite the little qsort.c such that it didn't traverse the lines twice,
once to count.
I could use 2 mallocs and one I/O effort to read the file into memory in one big chunk,
and yes there would be a good improvement.
But, we all know that this sort requires little intelligence to make it happen. So the
gain is really not there, unless, it is to be done with multiple files (merge and sort)
and done very often, in which case, writing a second version of the qsort.c might be
a big winner for anyone who might want to call qsort.c from Rebol.
As is often the case, the clever solution sometimes is the big winner, like realizing
a parse is not needed in this particular case.
Dick
[61/62] from: reffy:ulrich at: 8-Sep-2002 9:14
The zero-left-filled number width is fixed, is it not? ie. 3 digits.
[62/62] from: joel:neely:fedex at: 8-Sep-2002 19:34
YMMV, but I typically traverse indexed data structures in order of
increasing indices; as the index goes up, the length of the data
string goes down.
-jn-
Carl Read wrote:
> On 08-Sep-02, Joel Neely wrote:
> > Hi, all,
<<quoted lines omitted: 12>>
> [rebol-request--rebol--com] with "unsubscribe" in the
> subject, without the quotes.
--
; Joel Neely joeldotneelyatfedexdotcom
REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip
do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] {
| e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted