Sort by first part of line

[1/62] from: louisaturk:coxinet at: 3-Sep-2002 15:26

Hi fellow rebols, I still don't quite understand sort. How do I sort the following lines by the numbers only (not by the letters and not by the length). Note that there is a space before each number in the test data below. Thanks, Louis 454 en tw 395 en th 313 kai o 175 oi de 314 eij thn 174 eij ton 124 kai ouk 123 kai thn 219 ek tou 160 kai en 142 kai to 126 tw qew 166 kai h 096 ei de 094 ou mh 091 ei mh 120 thj ghj 120 en toij 112 estin o 108 en taij 118 o kurioj 096 proj ton 088 kai touj 082 kai idou 115 kai ta 111 o uioj 111 de kai 103 ek twn 114 eipen autoij 104 tou anqrwpou 071 legei autoij 063 twn ioudaiwn 105 proj auton 092 oi maqhtai 082 legei autw 078 eipen autw 103 ek thj 090 ina mh 086 autw o 084 ou gar 101 apo tou

[2/62] from: chris:langreiter at: 3-Sep-2002 23:22

> 454 en tw > 395 en th > 313 kai o > ...

Assuming you have that data in a variable creatively named data: result: copy "" foreach [n a b] sort/skip parse data none 3 [ repend result [" " n " " a " " b newline] ] i.e. transform the string into a data structure more appropriate for the sorting endeavours you have in mind. -- Chris

[3/62] from: greggirwin:mindspring at: 3-Sep-2002 16:27

Hi Sunanda, My replay to Louis should come in before this, so you'll see that I took a different route. In any case, I wanted to point something out for clarification. << If Rebol's sort were "stable" -- i.e. it kept equal keys in their original sequence, then all you'd need to do is to write your own sort compare routine to compare partial keys. >> REBOL's sort *is* stable, AFAIK, per Holger. If you provide your own comparison function then you need to return -1, 0, or 1 (rather than true/false) and it should still be stable. I haven't done any tests to prove it, so someone please correct me if I'm wrong. --Gregg

[4/62] from: greggirwin:mindspring at: 3-Sep-2002 16:18

Hi Louis, << I still don't quite understand sort. How do I sort the following lines by the numbers only (not by the letters and not by the length). Note that there is a space before each number in the test data below. >> How about this? (assuming items are space delimited strings) compare-items: func [a b /local aa bb] [ aa: to integer! first parse a none bb: to integer! first parse b none either aa = bb [0][either aa < bb [-1][1]] ] sort/compare data :compare-items Not terribly efficient, as it's doing a lot of work on each comparison, but it's a start. --Gregg

[5/62] from: rotenca:telvia:it at: 4-Sep-2002 1:18

Hi Sunanda and Luis, for sort you should look at http://www.rebol.com/docs/core25.html#sect4.1.3. and http://www.rebol.com/docs/core25.html#sect4.1.4. perhaps this helps. --- Ciao Romano

[6/62] from: g:santilli:tiscalinet:it at: 4-Sep-2002 16:07

Hi Sunanda, On Tuesday, September 3, 2002, 11:50:31 PM, you wrote: Sac> But this doesn't preserve the input sequence -- it looks to me like Sac> Rebol's sort is ***not*** stable. It actually is, but if you use /COMPARE you need to return -1, 0 and 1 for it to be stable. (AFAIK) (-1 means a < b, 0 means a = b, 1 means a > b) I think Luis just needs to LOAD its data, and then sort it with SORT/SKIP. He can then reproduce the string from the LOADed data. Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[7/62] from: louisaturk:coxinet at: 4-Sep-2002 12:28

To everyone helping me with this, I'm having to met a deadline on another project. I'll get back to this sort problem ASAP. I very much appreciate all the help. Louis

[8/62] from: rotenca:telvia:it at: 5-Sep-2002 12:28

Hi Gabriele

> I think Luis just needs to LOAD its data, and then sort it with > SORT/SKIP. He can then reproduce the string from the LOADed data.

If you do not care to add thousands of unuseful words to the global context and you are sure every string can be reduced at a valid Rebol datatypes, this is a solution :-) I think that he can parse the string building a sortable block and then rejoin the data in a string.

> Regards, > Gabriele.

Ciao Romano

[9/62] from: louisaturk:coxinet at: 5-Sep-2002 10:46

Hi Sunanda, Your sort worked perfectly. Thanks also for the explanation. You might be interested to know that on my 450 Pentium 2 running w2k total time to sort 29,688 lines using your code was: 3:50:10 I want to thank Sunanda and everyone else that helped with this. I still have questions, but I'm committed to other things for the rest of this week. Next week I'll try to ask them. One question I'll ask now however: What exactly does hash do, and could hash be used to speed up sort? Thanks again, Louis At 05:50 PM 9/3/2002 -0400, you wrote:

[10/62] from: louisaturk:coxinet at: 5-Sep-2002 12:12

Hi Ciao, It seems that I am having to sort large amounts of data more and more often. I am wanting to learn how to do it faster. At 12:28 PM 9/5/2002 +0200, you wrote:

>Hi Gabriele > > I think Luis just needs to LOAD its data, and then sort it with

<<quoted lines omitted: 4>>

>I think that he can parse the string building a sortable block and then rejoin >the data in a string.

Would you please give an example? Louis

[11/62] from: greggirwin:mindspring at: 5-Sep-2002 13:00

Hi Louis, << It seems that I am having to sort large amounts of data more and more often. I am wanting to learn how to do it faster. >> One of the biggest things is to use a data structure that doesn't require processing for every comparison. I.e. pre-parse the data into fields. REBOL, in my experience, is plenty fast but don't make it do work redundantly if you can avoid it. --Gregg

[12/62] from: anton:lexicon at: 6-Sep-2002 5:08

Here's what was meant, I think. string: { 454 en tw 395 en th 313 kai o 175 oi de 314 eij thn 174 eij ton 124 kai ouk 123 kai thn 219 ek tou 160 kai en } blk: parse string none ; split at whitespace sort/skip blk 3 foreach [one two three] blk [print [one two three]] Anton.

[13/62] from: greggirwin:mindspring at: 5-Sep-2002 12:52

Hi Louis, << What exactly does hash do... >> Maybe Joel will chime in with a really good definition, but here's a quick quote and my (probably poor) layman's explanation. per Knuth, regarding searching algorithms "A third possibility is to avoid all this rummaging around by doing some arithmetical calculation on K, computing a function f(K) that is the location of K and the associated data in the table." Basically, and someone please correct me if I mis-state things, rather than doing comparisons of items in a linear or tree structure, you generate a location from the item/key and that's where the items lives. To find an item, you just go to the location returned by your "hash function" for that item. Since you'll rarely get a totally unique set of results, the hashing process also has to handle "collisions", where two keys generate the same result when hashed. Insertions are slower because more work is involved, but retrievals are very fast. <<..., and could hash be used to speed up sort? >> I wish I could remember a thread from a while back the did some comparisons. Normally, you don't think of a hash table as a sorted structure but REBOL is so very cool, that you can just change the datatype from block! to list! to hash! and try things out. Do some tests and see what happens with your actual data. Lists are good if you're doing lots of insertions and deletions, and hashes are good if you're doing a lot of lookups. Testing will tell, though, how SORT interacts with the datatypes, your particular data, and how you use it. All sorting algorithms are not created equal. :) --Gregg

[14/62] from: joel:neely:fedex at: 5-Sep-2002 17:18

Hi, Gregg and all... Having a furiously busy week... not much time to keep up with the list, I'm afraid! :-( Gregg Irwin wrote:

> ... rather than doing comparisons of items in a linear or tree > structure, you generate a location from the item/key and that's

<<quoted lines omitted: 3>>

> hashing process also has to handle "collisions", where two keys > generate the same result when hashed.

Good summary! I might only add that one can think of a hashed structure as a collection of "buckets", where the keying function tells which bucket to look in, and should uniformly distribute the items across all buckets (usually in a pseudo-random fashion). There's a whole art devoted to constructing "perfect hash functions" for specific sets of values (e.g. the key words in a programming language), but perfection normally breaks when more key values are added to the set.

> <<..., and could hash be used to speed up sort? >> >

No. Because hashing essentially randomizes the keys across the collection of buckets, the keys that end up in the same bucket normally don't have a nice relationship vis-a-vis some obvious sort order. -jn-

[15/62] from: louisaturk:coxinet at: 5-Sep-2002 18:01

Hi Anton, At 05:08 AM 9/6/2002 +1000, you wrote:

>Here's what was meant, I think. >string: {

<<quoted lines omitted: 13>>

>foreach [one two three] blk [print [one two three]] >Anton.

Oh, I see. But there is a problem in that some of the lines are as follows: 003 apo twn presbuterwn kai arcierewn kai grammatewn 002 umwn merimnwn dunatai prosqeinai epi thn hlikian 002 umin anasthsei kurioj o qeoj umwn ek twn adelfwn with an unknown number of spaces. Sorry, this is my fault for not noticing that all the sample data I originally gave contained a number and two words, and therefore was not representative of all the data. Louis

[16/62] from: al:bri:xtra at: 6-Sep-2002 16:43

Here's my quick version: Rebol [] Data: {003 apo twn presbuterwn kai arcierewn kai grammatewn 002 umwn merimnwn dunatai prosqeinai epi thn hlikian 002 umin anasthsei kurioj o qeoj umwn ek twn adelfwn } write %Sort.txt Data Data: sort/skip map read/lines %Sort.txt function [Line [string!]] [Loaded] [ Loaded: load/next trim Line reduce [ Loaded/1 trim Loaded/2 ] ] 2 probe Data halt Andrew Martin ICQ: 26227169 http://valley.150m.com/

[17/62] from: anton:lexicon at: 6-Sep-2002 16:07

*Right* then, try this: ; data must start with newline to help the parse rule string: { 454 en tw 395 en th 313 kai o 175 oi de asdf 314 eij thn 174 eij ton 124 kai ouk dffd fder 123 kai thn 219 ek tou 160 kai en tty ttq qzfcv } blk: make block! 1000 ; or 2 * number of lines in the string ; that's instead of [copy []], a speed increase for insert whitespace: charset " ^-" ; spaces and tabs parse/all string [ some [ "^/" ; each line starts with a newline some whitespace copy number to " " (insert tail blk to integer! number) " " copy string to "^/" ; the rest of the line (insert tail blk string) ] ] sort/skip blk 2 foreach [one two] blk [print [one two]] Anton.

[18/62] from: carl:cybercraft at: 6-Sep-2002 20:27

On 06-Sep-02, [SunandaDH--aol--com] wrote:

> Louis: >> Your sort worked perfectly. Thanks also for the explanation.

<<quoted lines omitted: 8>>

> I'm glad to hear it worked, even if it took three hours. > In my experience, you can speed things up with more memory.

Or, perhaps, by using less? I've come late into this thread, but I take it Louis has large text files which he wants to sort by the first three characters in the line, which are of the form "001", "002", "001" etc? Why not, instead of loading the whole files in, just build an index to the lines in the file and sort that? For instance... write %test.txt {001 a b c 002 cc dd ee ff 001 g h i 003 jj kk ll mm 002 n m o p} This then allows us to treat the file on disk as a series...

>> file: open/lines %test.txt >> print file/1

001 a b c

>> print file/3

001 g h i

>> print file/4

003 jj kk ll mm

>> close file

So, to create an unsorted index based on the first three letters of each line... file-index: [] file: open/lines %test.txt forall file [ append file-index copy/part file/1 3 append file-index index? file ] close file After running the above, file-index contains...

>> file-index

== ["001" 1 "002" 2 "001" 3 "003" 4 "002" 5] That we can now sort...

>> sort/skip file-index 2

== ["001" 1 "001" 3 "002" 2 "002" 5 "003" 4] and use to print out the file with its lines sorted... file: open/lines %test.txt foreach [code line] file-index [ print file/:line ] close file Which, when run, returns... 001 a b c 001 g h i 002 cc dd ee ff 002 n m o p 003 jj kk ll mm how long REBOL would take to sort 29,688 of those pairs of values I don't know, (maybe the strings would be better converted to integers before sorting?), but it should cut down a lot on the memory use. This all assumes I've understood the problem right, of course. (: -- Carl Read

[19/62] from: carl:cybercraft at: 6-Sep-2002 23:23

On 06-Sep-02, [SunandaDH--aol--com] wrote:

> But when it comes to working out what is actually faster none of us > has much of an internal model of how Rebol goes about doing things. > All we can do is speculate and experiment.

I've just had a play with some randomly generated data, and I suspect with my method the loading of the data might take a good amount of time compared to the actual sorting, and in that case using a list instead of a block would help to speed the loading up. Lists are slightly different to blocks though, so you should read up on them before just assuming they're like a block. For instance, you might load a block like this...

>> blk: []

== []

>> for n 1 9 1 [append blk n]

== [1 2 3 4 5 6 7 8 9] whereas with a list, insert could be used as it moves the index with each insert...

>> lst: to-list []

== make list! []

>> for n 1 9 1 [insert lst n]

== make list! []

>> head lst

== make list! [1 2 3 4 5 6 7 8 9] and that would be faster than using append on a list. -- Carl Read

[20/62] from: rotenca:telvia:it at: 6-Sep-2002 14:48

Hi,

> >I think that he can parse the string building a sortable block and then

rejoin

> >the data in a string. > > Would you please give an example?

My idea is like Anton's one. Andrew came with an interesting idea, but how fast? If the whole line had a fixed lenght, we could use another method, more fast, here i presume lines are of different lengths. Here i assume that numbers strings are right aligned and of fixed lengths and separators are spaces (not tabs) to be a little more fast. REBOL [] probe r: 10000 ;10000 lines string: make string! 2000000 string: head insert/dup tail "" { 454 en tw 12395 en th 313 kai o 175 oi de asdf 313 eij thn 174 eij ton 124 kai ouk dffd fder 4123 kai thn 219 ek tou 160 kai en tty ttq qzfcv } r / 10 blk: make block! 2 * r ; 2 * number of lines in the string s2: has [num rest] [ clear blk parse/all string [ some [ copy num [any " " to " "] copy rest thru newline (insert insert tail blk num rest) ] ] to string! sort/skip blk 2 ] probe s2 --- Ciao Romano

[21/62] from: gscottjones:mchsi at: 6-Sep-2002 10:22

Hi, everyone, My two cents worth: I recreated a similar data set as Louis (but mine being total nonsence, of course :-) with the following algorithm: loop 29688 [ write/append/lines %//windows/desktop/data1.txt rejoin [ " " random/only {1234567890} random/only {1234567890} random/only {1234567890} " " random {eawk acef 32233} ] ] It's a quick hack, so stop laughing! A sample of the data looks like: ... 977 3 2e33aewacf2k 638 e ak33we2c3a 2f 355 ak cefa233we3 2 216 e ec3w3aa2f2k3 422 3fc2eeaa33w k2 462 wa3 ae3e2fk32c 238 wea 3ack2 23f3e 616 2aa3 f3k3w2e ec 658 w 32 ckfa33a2ee 537 3e 3fw2e acak23 592 33 c3ef22we aka 858 e3w3 cea2k2fa 3 453 efca kw223ae33 ... and the resulting file size was 637kb. As a rough comparison, to see if I was on track, I adapted Sunanda's code (assuming I got the final version correctly): ;;;;;;;;;;;;;;;;;;;;;;;; t: now/time sort/compare sorted-data: copy data func [a b /local a-key b-key ] [ a-key: join second parse/all a " " ["-" index? find data a] b-key: join second parse/all b " " ["-" index? find data b] ;print [a-key " " b-key] return a-key < b-key ] print now/time - t ;;;;;;;;;;;;;;;;;;;;;;;; I didn't do dedicated run (I needed to do a few other things), but I interrupted the run after 4 hours on my 500mhz Celeron 128MB Win98. Next, I drew a few ideas from contributions, but I was determined to let in run in memory. Here was the resulting code: ;;;;;;;;;;;;;;;;;;;;;;;; rebol [] full-time: now/time data: read/lines %//windows/desktop/data1.txt data-blk: copy [] foreach datum data [ append data-blk second parse/all datum " " append data-blk datum ] sort-time: now/time sort/skip data-blk 2 print now/time - sort-time foreach [key value] data-blk [ write/lines/append %//windows/desktop/sorted-data.txt value ] print now/time - full-time print "done" ask "" ;;;;;;;;;;;;;;;;;;;;;;;; The idea was to do a parse through the code only once, and construct a new key and value pairing data block. Then I did a a simple sort/skip by 2. Time for the sort (alone) was 1 to 2 seconds on my machine, and total runtime about 33 seconds. Sample of the resulting data: ... 000 ae332ae ckw 3f2 000 ea a32c323w fek 000 3ac33 wee2 2afk 000 3 22kee33w acaf 000 2a2e3cak f e33w 000 ak 3e2fe3wa2c3 000 ae2 2efw3ca3 3k 000 ee2ca3w 3ka3f2 000 2 afewe333cak2 000 ce2ew3a3a3 k2f 000 kcaa3f23 w e32e 000 3e2kcwf33 2eaa 000 e3e 2waf cak323 000 332e ak2eca3fw 000 3fek2wa3c a23 e 000 32ck e ea23faw3 000 akf33 eace3w2 2 000 332ew2 ka3eca f ... I am assuming that this is considered a stable sort based on the first alphanumeric "word" only. Comments welcome. Respectfully, --Scott Jones

[22/62] from: greggirwin:mindspring at: 6-Sep-2002 11:36

Hi Scott, Good deal! I used your data generator and then did this: data: read/lines %data-1.txt compare-items: func [a b /local aa bb] [ aa: to integer! first parse a none bb: to integer! first parse b none either aa = bb [0][either aa < bb [-1][1]] ] t: now/time/precise sort/compare data :compare-items print now/time/precise - t halt which gave me these results for 3 runs (on a P900 w/384 Meg of RAM): 1. 0:00:20.099 2. 0:00:20.269 3. 0:00:20.099 Changing the comparisons to use Sunanda's approach (assuming equality is least likely): either aa < bb [-1][either aa > bb [1][0]] 1. 0:00:19.508 2. 0:00:19.528 3. 0:00:19.518 Making the data a hash! didn't speed it up and trying to make it a list! didn't work. Even just a plain SORT on the list didn't work, and inserted 'end for some items. Haven't investigated. The data looks stable at a glance. Not verified though.

>> data

== [" 000 c3wef22aeka 33" " 000 aw333afk c2e 2e" " 000 ake3c2we3 3a2f" " 000 wa2k2c ee3 3fa3" " 000 wa fc32a2k3ee 3" " 000 3k3e32... --Gregg

[23/62] from: greggirwin:mindspring at: 6-Sep-2002 11:53

Scott, et al OK, I didn't think much before, so here's another whack that preps the data before sorting. Prep time is included in timings. t: now/time/precise data: make block! 2 * length? raw-data: read/lines %data-1.txt foreach item raw-data [ append data reduce [to integer! first parse item none item] ] compare-items: func [a b /local aa bb] [ either a < b [-1][either a > b [1][0]] ] sort/skip/compare data 2 :compare-items data: extract next data 2 print now/time/precise - t halt And the results: 1. 0:00:02.964 2. 0:00:02.974 3. 0:00:02.974 --Gregg

[24/62] from: louisaturk:coxinet at: 6-Sep-2002 16:04

Hi Everybody, You guys are great! I've been following this thread with amazement. I am so interested I can hardly keep my mind on other urgent projects which unfortunately I cannot put aside right now. Anyway, here is the data to be sorted if perchance it might be useful for timing purposes (691KB): http://www.pusatberita.com/test.txt Louis PS And many thanks to all of you for helping me like you always do.

[25/62] from: al:bri:xtra at: 7-Sep-2002 10:49

With Louis' data and this script: Rebol [] Then: now/time Data: sort/skip map read/lines %test.txt function [Line [string!]] [Loaded] [ trim Line if not empty? Line [ Loaded: load/next Line reduce [ Loaded/1 Line ] ] ] 2 write/lines %Sorted.txt map Data func [N [integer!] Line [string!]] [ Line ] print now/time - Then halt I got a time of two seconds (0:00:02) on my machine with Rebol/View. It's a AMD Athlon XP 1600+, 1.40GHz with 512MB of RAM. The first line of %Sorted.txt is: 002 umwn merimnwn dunatai prosqeinai epi thn hlikian and the last line is: 521 tou qeou I suspect that my machine's hardware makes most of the difference in time. Louis, if you want, I can email the sorted text back to you. Just let me know. I hope that helps! Andrew Martin ICQ: 26227169 http://valley.150m.com/

[26/62] from: joel:neely:fedex at: 6-Sep-2002 18:29

Hi, Louis, and everybody, I've been covered up, so I may have missed something, but here's my suggestion for a different approach (forgive me if somebody has mentioned this already or if I've misunderstood the problem!) Louis A. Turk wrote:

> Anyway, here is the data to be sorted if perchance it might be > useful for timing purposes (691KB): > > http://www.pusatberita.com/test.txt >

After glancing at your data, I decided to try a version that just eliminates the sorting entirely. Let me restate the problem in that fashion. We have: - a file of text lines, each of which contains: - a single space - a three digit number (with zero padding on the left) - a single space - some words and we desire - the lines rearranged by the leading number, but otherwise in the same order as the original data. Since there are relatively few three-digit integers, I thought I'd try setting up a collection of blocks, so that block 1 will hold all lines beginning with 001, block 2 will hold all lines beginning with 002, etc. Here's the code, including timers... 8<-------- REBOL [] buffer: [] t0: now/time/precise foreach item read/lines %pusatberita.text [ nr: to-integer copy/part next item 3 while [ nr > length? buffer ][ insert/only tail buffer copy [] ] append buffer/:nr item ] t1: now/time/precise print to-decimal t1 - t0 8<-------- When I run the above on my desk at work, using a downloaded copy of your data, it takes about one second. To iterate through the data lines in the desired order, we can do something like foreach group buffer [ foreach line group [ print line ;; or whatever you wish to do with it ] ] Someone who's already been benchmarking might try comparing this with what's already been proposed, using the data from your web site. -jn-

[27/62] from: carl:cybercraft at: 7-Sep-2002 11:16

On 07-Sep-02, Gregg Irwin wrote:

> Hi Scott, > Good deal! I used your data generator and then did this:

<<quoted lines omitted: 21>>

> list! didn't work. Even just a plain SORT on the list didn't work, > and inserted 'end for some items. Haven't investigated.

You don't have to look far...

>> x: to-list [1 2 3]

== make list! [1 2 3]

>> sort x

== make list! [1 2 end]

>> head x

== make list! [1 2 end] ! That's on the non-beta View. Do the beta REBOLs give the same results? Anyway, using Scott's data-file, here's my leaving-file-on-the-disk approach. I expect it's slower than loading the whole file into memory, but might be better for super-huge files... REBOL [] data1: %data1.txt ; Set paths to where you want. data2: %data2.txt ; Scott's data creation code... if not exists? data1 [ loop 29688 [ write/append/lines data1 rejoin [ " " random/only {1234567890} random/only {1234567890} random/only {1234567890} " " random {eawk acef 32233} ] ] ] if exists? data2 [delete data2] ; *** Deletes previous runs! ; Sorting... t: now/time/precise file-index: copy [] file: open/lines data1 forall file [ append file-index reduce [copy/part file/1 4 index? file] ] close file sort/skip file-index 2 file: open/lines data1 foreach [code line] file-index [ write/append/lines data2 file/:line ] close file print now/time/precise - t -- Carl Read

[28/62] from: reffy:ulrich at: 6-Sep-2002 18:58

I apologize .. Didn't mean to send that to the whole list ! Dick

[29/62] from: reffy:ulrich at: 6-Sep-2002 18:53

-- Attached file included as plaintext by Listar -- That sounds like a fast machine you have Andrew. I would like for you to run a test if you would. SOURCE CODE of what I am sending: /* QSORT.C: This program reads the name of a file * to be sorted, from the command line. The lines * are read into memory then sorted and then written * to stdout. * * Note: The sort is of the entire ASCII line and does * not parse the line in any way. * * Sample: * qsort sortme.txt > sorted.txt */ #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define MAX_LINE_SIZE 1024 int compare( const void *arg1, const void *arg2 ); char line[MAX_LINE_SIZE]; char ** lines; int lcnt = 0; FILE * stream = NULL; time_t start, finish; double elapsed_time; void CountTheLines(char *); void ReadTheLines(char *); void SortTheLines(void); void OutputTheLines(void); void main( int argc, char **argv ) { if (argc != 2) { printf("Usage: %s nameoffiletosort\n", argv[0]); exit(1); } time( &start ); CountTheLines(argv[1]); ReadTheLines(argv[1]); SortTheLines(); OutputTheLines(); time( &finish ); elapsed_time = difftime( finish, start ); printf( "\nSorted %d lines in %6.0f seconds.\n", lcnt, elapsed_time ); } void CountTheLines(char *fn) { // Count the lines in the file stream = fopen(fn, "r"); if(stream == NULL) { printf( "Unable to open input file\n" ); exit(1); } while (fgets(line, MAX_LINE_SIZE, stream) != NULL) lcnt++; fclose(stream); // Allocate a place to put the lines lines = (char **)malloc(lcnt*sizeof(char *)); } void ReadTheLines(char *fn) { // Read lines into memory lcnt = 0; stream = fopen(fn, "r"); if(stream == NULL) { printf( "Unable to open input file\n" ); exit(1); } while (fgets(line, MAX_LINE_SIZE, stream) != NULL) lines[lcnt++] = strdup(line); fclose(stream); } void SortTheLines() { // Sort the lines qsort( (void *)lines, (size_t)lcnt, sizeof( char * ), compare ); } void OutputTheLines() { int i; // Output the sorted lines for( i = 0; i < lcnt; ++i ) printf( "%s", lines[i] ); } int compare( const void *arg1, const void *arg2 ) { /* Compare all of both strings: */ return _stricmp( * ( char** ) arg1, * ( char** ) arg2 ); } Download NeoPlanet at http://www.neoplanet.com -- Binary/unsupported file stripped by Listar -- -- Type: Application/Octet-Stream

[30/62] from: carl:cybercraft at: 7-Sep-2002 12:53

On 07-Sep-02, Louis A. Turk wrote:

> Hi Everybody, > You guys are great! I've been following this thread with amazement.

<<quoted lines omitted: 3>>

> useful for timing purposes (691KB): > http://www.pusatberita.com/test.txt

On this very slow Amiga, my leave-file-on-the-disk method took 28 minutes with your data Louis. However, the vast majority of that time was taken up with the final reading-lines-from-disk and writing-lines-back-to-disk loop. Index creation took 75 seconds and sorting 31 seconds. I assume the saving of the data could be sped up by reading in blocks of lines (say 1000) before writing them out, which would stop the switching from read to write for each line. Others with more experience of REBOL's random-access of files probably know a more sensible method for doing this than I used... And a question for those others: Does... file: open/lines file-name load the whole file into memory anyway? As I didn't notice any disk-reading while the index was being built. If so, it sort of makes the whole point of attempting to work with the file on disk a bit of a waste of effort. PS. And it's quite fun and kind of hypnotic to scroll fast through Louis's data after it's been sorted. Especially after you get past the first few numbered lines. (: -- Carl Read

[31/62] from: carl:cybercraft at: 7-Sep-2002 13:01

On 07-Sep-02, Joel Neely wrote:

> Since there are relatively few three-digit integers, I thought I'd > try setting up a collection of blocks, so that block 1 will hold > all lines beginning with 001, block 2 will hold all lines beginning > with 002, etc.

Very, very nice Joel. In fact annoyingly so. (; -- Carl Read

[32/62] from: carl:cybercraft at: 7-Sep-2002 18:42

On 07-Sep-02, [SunandaDH--aol--com] wrote:

> Joel: >> Someone who's already been benchmarking might try comparing this

<<quoted lines omitted: 6>>

> 2.25 -- Joel's bridge sort > You are today's lucky benchmark winner!!!

[snip]

> I'm kinda interested in how these perform on machines with far less > memory that Joel's or mine.

Well, it was time for a cuppa anyway... (: So, on a slowww Amiga I got... 33:33.21 -- Sunanda's parse in the Sort 02:26.75 -- Gregg's parse before the sort 01:17.20 -- Joel's bridge sort And, out of interest, I converted my index-sort to work with the file in memory, (since you kindly left out disk-access in the timing:), and got a result of 01:48.26. Joel's method still wins though. (: I've 32megs, but didn't have any memory problems. Here's my code, with an updated verify tacked on the end. Note how I saved the sorted block. Simplier than looping through it, and much quicker, at least on this machine. ;; Carl -- index sort ;; ------------------- unset 'sorted-data recycle report-item/start "index sort" file-index: array 2 * length? data ptr: 1 foreach item data [ poke file-index ptr copy/part item 4 poke file-index ptr + 1 item ptr: ptr + 2 ] sorted-data: extract next sort/skip file-index 2 2 report-item/end "index sort" write/lines %sorted-data-index.txt sorted-data unset 'sorted-data recycle ;; Verify sort results ;; =================== report-item/start "verifying results are the same" parsed-file: read %sorted-data-parse.txt pre-parsed-file: read %sorted-data-pre-parsed.txt bridge-file: read %sorted-data-bridge.txt index-file: read %sorted-data-index.txt either all [parsed-file = pre-parsed-file parsed-file = bridge-file parsed-file = index-file ] [print "got same results"] [print "bad code in there somewhere"] report-item/end "verifying results are the same" print "done" -- Carl Read

[33/62] from: carl:cybercraft at: 7-Sep-2002 22:14

On 07-Sep-02, [SunandaDH--aol--com] wrote:

> Your code makes less assumptions that Joel's -- yours is sorting on > the first four characters of a line. That's should be in the format > "b999" (b=space 9=digit) but your code will work if it isn't

I made the assumption that it'd either be space + 3 digits or 3 digits + space - but I should've included a trim just to be sure in case there was the occasional leading space missing. The important thing though is that Louis has a good few routines to tweek now. He could change the format of the code to anything he wished and easily modify these to suit. As to the degradation - I wouldn't know what causes that. I've detected garbage collection with previous speed tests, but never a general slowdown over time. -- Carl Read

[34/62] from: g:santilli:tiscalinet:it at: 7-Sep-2002 12:41

Hi Carl, On Saturday, September 7, 2002, 2:53:19 AM, you wrote: CR> And a question for those others: Does... CR> file: open/lines file-name CR> load the whole file into memory anyway? It does, if you don't use /DIRECT. (Actually, in theory it should only load a portion of it, based on the amount of free memory available; i.e. if it fits into memory, it loads it entirely. However, it always loaded the whole file into memory for me.) You can check it by using:

>> file: open/lines/read %AUTOEXEC.BAT >> file/state/inBuffer

== ["SET BLASTER=A220 I7 D1 H5 P330 T6" "SET CTSYN=C:\WINDOWS" "C:\PROGRA~1\CREATIVE\SBLIVE\DOSDRV\SBEINIT.COM" {mode con codepage... Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[35/62] from: al:bri:xtra at: 7-Sep-2002 23:21

This script is a bit shorter: Rebol [] Then: now/time write/lines %Sorted.txt extract/index sort/skip map read/lines %test.txt func [Line [string!]] [ trim Line if not empty? Line [ reduce [ first load/next Line Line ] ] ] 2 2 2 print now/time - Then halt Still no difference in time though, about 2 seconds. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[36/62] from: al:bri:xtra at: 7-Sep-2002 23:25

With Joel's script, slightly modified: buffer: [] t0: now/time/precise foreach item read/lines %test.txt [ nr: to-integer copy/part next item 3 while [ nr > length? buffer ][ insert/only tail buffer copy [] ] append buffer/:nr item ] t1: now/time/precise print to-decimal t1 - t0 halt I unfortunately, get: ** Script Error: Invalid argument: ** Where: to-integer ** Near: to integer! :value

I think that's because of the couple of empty lines at the end. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[37/62] from: al:bri:xtra at: 7-Sep-2002 23:29

With Carl's script: data1: %test.txt data2: %Sorted.txt t: now/time/precise file-index: copy [] file: open/lines data1 forall file [ append file-index reduce [copy/part file/1 4 index? file] ] close file sort/skip file-index 2 file: open/lines data1 foreach [code line] file-index [ write/append/lines data2 file/:line ] close file print now/time/precise - t halt I get this: 0:00:10.125 proving Carl's right:

> I expect it's slower than loading the whole file into memory, but might be

better for super-huge files... Andrew Martin ICQ: 26227169 http://valley.150m.com/

[38/62] from: al:bri:xtra at: 7-Sep-2002 23:35

Unfortunately, the binary has been stripped out by the list software. I suspect the C program will be slower than my Rebol script, because the C program reads the file twice, where my Rebol script only reads the file once. Disk I/O is usually the slowest part of any algorithm. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[39/62] from: reffy:ulrich at: 7-Sep-2002 7:28

Just for your information, I quadrupled that file (to be sorted) until there were 118,751 lines. The little C program I sent sorted it in less than 1 second. Dick

[40/62] from: reffy:ulrich at: 7-Sep-2002 7:31

Just for your information, Sorted 237,504 lines in 2 seconds. Dick

[41/62] from: carl:cybercraft at: 8-Sep-2002 1:02

On 07-Sep-02, Gabriele Santilli wrote:

> Hi Carl, > On Saturday, September 7, 2002, 2:53:19 AM, you wrote:

<<quoted lines omitted: 5>>

> available; i.e. if it fits into memory, it loads it entirely. > However, it always loaded the whole file into memory for me.)

Ah, thanks Gabriele. I just should've read the refinements' list a bit better.

> You can check it by using: >>> file: open/lines/read %AUTOEXEC.BAT

Oh no I can't. (; -- Carl Read

[42/62] from: rotenca:telvia:it at: 7-Sep-2002 14:57

Hi, Sunanda

> It shows some interesting deterioration for long-running Rebol consoles:

because memory use increase more and more.

> But look at the way my sort deteriorates!! It is possible that my code gets > slugged by an internal garbage-collection, and if we ran it enough times any > of the benchmarks could be hit by that.

The best thing should be to launch every single test code. At the end you sill find my two tests. I need raw data, so i must re-read the test file. They work well under the new beta (1.2.5) in my tests beta are twice fast on this sort of code. Here are my test where you can seel the result of system/stats and allocated memory increase: Rebol 1.2.5.3.1 - Celeron 333 RAM 128 Mb Window 98 first edition reading data : 0:00:00.11 memory allocated: 6465424 Scott : 0:00:02.74 memory allocated: 10714448 Carl : 0:00:02.58 memory allocated: 12752096 Romano : 0:00:01.54 memory allocated: 15060400 Romano-int : 0:00:01.7 memory allocated: 17671880 Joel : 0:00:01.49 memory allocated: 22394888 got same results done By other test i noticed than Joel code use 3 Mb memory more than others tests. -----------code----------- ;; Romano -- parse sort ;; ------------------- data: read %../public/www.pusatberita.com/test.txt report-item/start "Romano" sorted-data: [] parse/all data [ some [ copy num [any " " to " "] copy rest thru newline (insert insert tail sorted-data num rest) ] ] sorted-data: sort/skip sorted-data 2 report-item/end "Romano" write %sorted-data-romano.txt sorted-data unset 'sorted-data recycle ;; Romano -- parse sort int ;; ------------------- data: read %../public/www.pusatberita.com/test.txt report-item/start "Romano-int" blk: [] parse/all data [ some [ h: any " " copy num integer! :h copy rest thru newline (insert insert tail blk to integer! num rest) ] ] blk: sort/skip blk 2 report-item/end "Romano-int" sorted-data: copy "" foreach [x x2] blk [insert tail sorted-data x2] write %sorted-data-romano-int.txt sorted-data unset 'sorted-data recycle ; for next texts: data: read/lines %../public/www.pusatberita.com/test.txt -----end of code ---- --- Ciao Romano

[43/62] from: gscottjones:mchsi at: 7-Sep-2002 10:09

Hi, all, Interesting thread. Internet access was down for 16 hours, so I missed out on some of the fun. :-( I do want to clarify on point on my email of yesterday. I never wished to imply that Sunanda's initial code was "slow". I was unsure why Louis' run took so long. I was trying to recreate the condition so that I could compare, but I gave up after 4 hours because I assumed *I miscoded* Sunanda's example, figuring that I had sent the code off into the weeds. Gregg's response clearly indicated that I *had* done something wrong. Ten thousand apologies to Sunanda. I should have been more careful in my writing. --Scott Jones

[44/62] from: joel:neely:fedex at: 7-Sep-2002 10:05

Hi, Romano, Romano Paolo Tenca wrote:

> Here are my test where you can seel the result of system/stats > and allocated memory increase:

<<quoted lines omitted: 9>>

> By other test i noticed than Joel code use 3 Mb memory more than > others tests.

No surprise there, as the "bucket-grouping" approach creates a second structure containing the same strings, rather than rearranging the strings within a single block. That's a fairly common space-for-time tradeoff. This reminds me of the sign in a programming office... Computer software developed here: Good, Fast, Cheap (pick any two!) -jn- -- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]

[45/62] from: joel:neely:fedex at: 7-Sep-2002 10:00

Hi, Sunanda, [SunandaDH--aol--com] wrote:

> Though I'm not sure I'd recommend your method to Louis unless > he's absolutely certain that that first field is:

<<quoted lines omitted: 8>>

> only worth it if the performance gain is utterly unliveable > without.

Completely valid concerns; I thought the data bore a strong resemblance to the desired output described in another thread about parsing words out of a long chunk of text, so I assumed that Louis was actually creating the original "unsorted" data himself. Clearly the production use of an approach like the one I posted would require that the interface and assumptions be precisely documented, so that any change upstream could be checked against the requirements of the downstream code. -jn- -- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]

[46/62] from: g:santilli:tiscalinet:it at: 7-Sep-2002 21:15

Hi Carl, On Saturday, September 7, 2002, 3:02:53 PM, you wrote: CR> Oh no I can't. (; You know, I didn't mean you needed to use THAT file to check it. ;-) But well, %/S/Startup-Sequence is longer so it will be a better test case. ;-) Regards, Gabriele. -- Gabriele Santilli <[g--santilli--tiscalinet--it]> -- REBOL Programmer Amigan -- AGI L'Aquila -- REB: http://web.tiscali.it/rebol/index.r

[47/62] from: al:bri:xtra at: 8-Sep-2002 9:39

My latest version is: Rebol [] Then: now/time/precise write/lines %Sorted.txt sort read/lines %test.txt print now/time/precise - Then halt This takes: 0:00:00.321 (it does vary a bit, +/- a small number.) Andrew Martin ICQ: 26227169 http://valley.150m.com/

[48/62] from: al:bri:xtra at: 8-Sep-2002 10:26

My latest version is: Rebol [] Then: now/time/precise write/lines %Sorted.txt sort read/lines %Big.txt print now/time/precise - Then halt With a test file that's 20 times as large (13.4MB), I get a time of 0:00:14.25. With Dick's C program, it's 6 seconds. Andrew Martin ICQ: 26227169 http://valley.150m.com/

[49/62] from: rotenca:telvia:it at: 8-Sep-2002 2:24

Hi Andrew,

> With a test file that's 20 times as large (13.4MB), I get a time of > 0:00:14.25. With Dick's C program, it's 6 seconds.

good, but so you sort with all the line. Your previous code used a map funtion which was not included. --- Ciao Romano

[50/62] from: rotenca:telvia:it at: 8-Sep-2002 2:15

Hi Sunanda,

> For this and other reasons, I have some doubts about Rebol's ability to run > 24x7 while pounding loads of data. Any really big designs out there should > look closely at this issue.

I am not sure. For example, at least under 1.2.5.3.1, my two tests after some oscillations do not change memory allocation (and without any recycle command). Your test is stable like rock in memory allocation without recycle. I did not try other tests. Often (i think always) continue memory increases are only the result of a bug in user code (blocks and strings are the first to check). To end, in my tests i did not see any variation in the timing of your test and up to date we are not sure that memory allocation continue to grow when we repeat all the tests.

> Thanks for your timings. I get (on my machine) a run time of about 6.5 > seconds. 25% faster than Gregg's but still some way to go to beat Scott and > Joel. It's that parse again. You only do one compared to Gregg's 750,000 (or > so). I guess it's not the number of invokations: it's the total data that

has

> to be parsed than counts.

I suspect (i'm sure :-) you use a not-beta Rebol release. I have similar results under 1.2.3.1.3. Automatic Rebol memory allocation until 1.2.3 is buggy and to limit problems, blocks need to be preallocated with make block!. This is another round of test under 1.2.5.3.1 with your code also, which i forgot last time: 1.2.5.3.1 Celeron 333 RAM 128 Mb Window 98 first edition reading data : 0:00:00.22 memory allocated: 6465424 Sunanda : 0:00:45.92 memory allocated: 9435920 Scott : 0:00:02.8 memory allocated: 13418112 Carl : 0:00:02.09 memory allocated: 13586848 Romano : 0:00:01.37 memory allocated: 15895152 Romano-int : 0:00:01.43 memory allocated: 18477920 Joel : 0:00:01.54 memory allocated: 23200928 got same results done And this is a test under 1.2.3.1 with my code optimized for this version (code follows): 1.2.1.3.1 Celeron 333 RAM 128 Mb Window 98 first edition reading data : 0:00:00.22 memory allocated: 6071976 Sunanda : 0:00:59.1 memory allocated: 8868200 Scott : 0:00:11.26 memory allocated: 12222608 Carl : 0:00:03.84 memory allocated: 12563472 Romano : 0:00:01.37 memory allocated: 14783712 Romano-int : 0:00:01.38 memory allocated: 17348192 Joel : 0:00:03.85 memory allocated: 18710144 got same results done Here i'm more fast than Joel. But also Joel code can be optimized. -----------code----------- ;; Romano -- parse sort for 1.2.1.3.1 ;; ------------------- data: read %../public/www.pusatberita.com/test.txt report-item/start "Romano" sorted-data: make block! 80000 ;changed for 1.2.1.3.1 parse/all data [ some [ copy num [any " " to " "] copy rest thru newline (insert insert tail sorted-data num rest) ] ] sort/skip sorted-data 2 report-item/end "Romano" write %sorted-data-romano.txt sorted-data unset 'sorted-data recycle ;; Romano -- parse sort int for 1.2.1.3.1 ;; ------------------- data: read %../public/www.pusatberita.com/test.txt report-item/start "Romano-int" blk: make block! 80000 ;changed for 1.2.1.3.1 parse/all data [ some [ h: any " " copy num integer! :h copy rest thru newline (insert insert tail blk to integer! num rest) ] ] sort/skip blk 2 report-item/end "Romano-int" sorted-data: copy "" foreach [x x2] blk [insert tail sorted-data x2] write %sorted-data-romano-int.txt sorted-data unset 'sorted-data recycle ; for next texts: data: read/lines %../public/www.pusatberita.com/test.txt -----end of code ---- --- Ciao Romano

[51/62] from: rotenca:telvia:it at: 8-Sep-2002 2:45

Hi Sunanda,

> 1. Parse is slow -- take it out of the critical path (as Gregg and Romano

do)

> or eliminate it (as Scott and Joel do). > 2. It's not the number of parse's (Greg has one per line; Romano has one for > the whole dataset) so much as the volume of data to be parsed

I do not agree. The greatest problems are in memory allocation. You can see my other message to verify the result of optimized code for memory allocation code under 1.2.1.3.1. I think parse is one of the more fast costructs in Rebol.

> It's been fun!

I agree.

> Sunanda.

--- Ciao Romano

[52/62] from: al:bri:xtra at: 8-Sep-2002 13:59

Romano wrote:

> good, but so you sort with all the line.

Well, it was so much faster that way... :)

> Your previous code used a map funtion which was not included.

Here it is: Rebol [ Name: 'Map Title: "Map" File: %"Map.r" Author: "Andrew Martin" eMail: [Al--Bri--xtra--co--nz] Web: http://valley.150m.com Date: 26/August/2002 Version: 1.1.0 Purpose: {Maps or applies the function to all elements of the series.} Category: [util 1] Acknowledgements: [ "Joel Neely" "Ladislav" ] Example: [ Map func [n [number!]] [n * n] [1 2 3] ;== [1 4 9] Map [1 2 3] func [n [number!]] [n * n] ;== [1 4 9] Map [1 2 3 4 5 6] func [a] [print [a]] ;1 ;2 ;3 ;4 ;5 ;6 ;== [] Map [1 2 3 4 5 6] func [a b] [print [a b]] ;1 2 ;3 4 ;5 6 ;== [] Map [1 2 3 4 5 6] func [a b c] [print [a b c]] ;1 2 3 ;4 5 6 ;== [] ] Requires: %Arguments.r ] Map: function [ {Maps or applies the function to all elements of the series.} Arg1 [any-function! series!] Arg2 [any-function! series!] /Only "Inserts the result of the function as a series." ][ Result Results Function Series ][ any [ all [ any-function? :Arg1 series? :Arg2 (Function: :Arg1 Series: :Arg2) ] all [ any-function? :Arg2 series? :Arg1 (Function: :Arg2 Series: :Arg1) ] throw make error! reduce [ 'script 'cannot-use rejoin [ {"} mold 'Map " " mold type? :Arg1 {"} ] rejoin [ {"} mold type? :Arg2 {"} ] ] ] Results: make Series length? Series do compose/deep [ foreach [(Arguments :Function)] Series [ if all [ not unset? set/any 'Result Function (Arguments :Function) not none? Result ] [ either only [ insert/only tail Results :Result ][ insert tail Results :Result ] ] ] ] Results ] Andrew Martin ICQ: 26227169 http://valley.150m.com/

[53/62] from: joel:neely:fedex at: 7-Sep-2002 22:06

Hi, Romano, et alia, Romano Paolo Tenca wrote:

> > 1. Parse is slow -- take it out of the critical path (as Gregg > > and Romano do) or eliminate it (as Scott and Joel do). > > 2. It's not the number of parse's (Greg has one per line; Romano > > has one for the whole dataset) so much as the volume of data to > > be parsed > > I do not agree. The greatest problems are in memory allocation. >

...

> I think parse is one of the more fast constructs in Rebol. >

Certainly memory management (especially when constructing blocks in interpreted expressions as I did) becomes a very significant component of the total run time as the size/number of blocks continues to increase. As for the rest, I must disagree with *both* you and Sunanda. (Some trick, eh? ;-) I don't attach meaning to absolute phrases such as "PARSE is slow" or "PARSE is fast", but instead think in terms such as "PARSE is slower/faster than _____ for doing ____". For example, if I replace the phrase to-integer copy/part next item 3 in my original grouping algorithm with to-integer first parse item none two things happen: 1) The issues of leading whitespace and exact number of digits in the numeric-looking prefix now go away. 2) The processing slows down by about 20% (takes about 20% longer to run). So, *in*this*case*, PARSE is slower than explicit copying for doing the job of pulling out a fixed-length/position prefix. Generality almost always has a price, and performance optimization almost always increases brittleness. I'd prefer to rephrase Sunanda's points as: 1) Move as much processing as possible out of inner loops. 2) Process as little data as possible to get the work done. regardless of how the processing is invoked. In the second point, doing a whitespace-driven parse of the entire line (or, worse, of the entire file) is overkill, since we don't need the rest of the line parsed/broken down for the task as specified. Reminding myself to take a clue from the first point, we can use the implicit looping of PARSE, and the fact that the prefixes are only three digits long to come up with something like this: 8<-------- t0: now/time/precise buffer: copy/deep array/initial 999 [[]] digit: charset "0123456789" parse/all read f [ some [ some " " copy nr some digit copy data [thru "^/" | to end] ( append pick buffer to-integer nr data ) ] ] t1: now/time/precise print to-decimal t1 - t0 8<-------- which, by my tests, actually runs a bit (~7%) faster than my first version (which extended the buffer within the inner loop whenever needed). Feeding that same idea back into the first version produces something like this: 8<-------- t0: now/time/precise buffer: copy/deep array/initial 999 [[]] foreach item read/lines f [ nr: to-integer copy/part next item 3 append buffer/:nr item ] t1: now/time/precise print to-decimal t1 - t0 8<-------- which runs about 9% faster than the original. -jn- -- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]

[54/62] from: carl:cybercraft at: 8-Sep-2002 17:09

On 08-Sep-02, Andrew Martin wrote:

> Romano wrote: >> good, but so you sort with all the line. > Well, it was so much faster that way... :)

But they were in a very pretty order Andrew, which that would've stuffed up. (;

>From near the end of the sorted list...

032 eij ierosoluma 032 eij ton oikon 032 basileian tou 032 oi farisaioi 032 oi arciereij 032 tou swmatoj 032 en tw ierw 032 tw ierw 032 de autw 033 kai apokriqeij 033 eij ierousalhm 033 touj maqhtaj 033 twn agiwn 033 th kardia 033 pasi toij 033 tw ihsou 033 en toutw 033 auton en 033 ex umwn 033 ean tij 033 oti ou 034 apokriqeij de 034 tw pneumati 034 pantaj touj 034 kaqwj kai 034 ton laon 034 met emou 034 oti egw 034 dia thn 034 kai ai 034 oun o See? Ascending length per number. -- Carl Read

[55/62] from: al:bri:xtra at: 8-Sep-2002 18:50

Carl Read wrote:

> But they were in a very pretty order Andrew, which that would've stuffed

up. (;

> See? Ascending length per number.

I wonder if that's important to Louis? :) Has anyone worked out what language the words are? Andrew Martin ICQ: 26227169 http://valley.150m.com/

[56/62] from: joel:neely:fedex at: 8-Sep-2002 6:47

Hi, all, Andrew Martin wrote:

> Carl Read wrote: > > See? Ascending length per number. >

Would you believe "descending"? !!!

> I wonder if that's important to Louis? :) > > Has anyone worked out what language the words are? >

It's Greek to me... ;-) -jn- (seriously) -- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]

[57/62] from: carl:cybercraft at: 9-Sep-2002 0:53

On 08-Sep-02, Joel Neely wrote:

> Hi, all, > Andrew Martin wrote:

<<quoted lines omitted: 3>>

>> > Would you believe "descending"? !!!

One ascends from the bottom up Joel - and as you go up each group of numbers, the lengths of the strings get longer... (; -- Carl Read

[58/62] from: joel:neely:fedex at: 8-Sep-2002 7:35

Note to self: be more complete! I should have pointed out that the modified algorithm below splits the group number from the remainder of the line; therefore, some post-processing might be needed to re-attach the three-digit prefix to the data (unless, of course, the subsequent processing could use the bucket index for that purpose). So the time comparison may not be comletely fair. -jn- Joel Neely wrote:

> Reminding myself to take a clue from the first point, we can use > the implicit looping of PARSE, and the fact that the prefixes are

<<quoted lines omitted: 18>>

> first version (which extended the buffer within the inner loop > whenever needed)...

-- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]

[59/62] from: joel:neely:fedex at: 8-Sep-2002 7:29

Hi, Sunanda, Very nice summary! May I pick one small nit and reminisce for a moment? [SunandaDH--aol--com] wrote:

> ... thread has developed into two parallel tracks, with > a third on the way. > > TRACK 1 > Has several of us looking at increasingly faster ways to > pre-process the lines to get stable sort keys. >

I should have been more clear about the goal of my submission. I meant to imply (but I didn't state well) that we need to be careful about the force of habit, as it is VERY powerful. We all jumped to the conclusion, based on Louis's first question, that we needed to sort the data. We've even found a way to call the solution I proposed a kind of sort. Actually, I began thinking about the fact that most of the order that Louis requested was already present (i.e., the lines with identical prefixes were already in the desired order, relative to each other), which led me simply to group the data by the one field that had actionable differences, the prefix. Ergo, a solution that actually doesn't involve sorting at all. Now for the reminiscence. I didn't remember it (consiously... ;-) until writing this particular email, but there's a really lovely book titled _Programming_Pearls_ by Jon Bentley of AT&T Bell Labs (the copy in front of me is the first edition by Addison Wesley, (c) 1986, but I believe there's a newer version out). The book is a collection of eponymous columns Bentley wrote for CACM. The very first column, titled "Cracking the Oyster", opens as follows: The programmer's question was simple: "How do I sort on disk?" Before I tell you how I made my first mistake, let me give you a chance to do better than I did. What would you have said? DON'T SCROLL DOWN UNTIL YOU HAVE THOUGHT ABOUT YOUR ANSWER TO BENTLEY'S QUESTION , PLEASE ! He then continues: My mistake was to answer his question. I gave him a thumbnail sketch on how to sort on disk. My suggestion that he dig into Knuth's classic _Sorting_and_Searching_ met with less than enthusiasm -- he was more concerned about solving the problem than with furthering his education. I then told him about the disk sorting program in Chapter 4 of Kernighan and Plaugers's _Software_Tools_. Their program consists of about two hundred lines of Ratfor code in twelve procedures; translating that into several hundred lines of FORTRAN and testing the code would have taken about a week. (Nothing like dating yourself, right? ;-) Bentley goes on to describe how he (fortunately) kept asking questions until the *REAL* nature of the problem became clear, as summarized below: - The system was used for re-organizing political districts; the data were the precinct numbers that made up a district. - Each value was an integer (a precinct number). - It was illegal for a value to appear more than once. - There would be up to 27,000 values (in the range 1..27,000). - The system only had 1K-2K sixteen-bit words of free memory at that point. (I told you I was dating myself! ;-) Given this restatement of the problem, *now* how would you advise the programmer? (other than offering him a ride in your time machine, of course! ;-) -jn- -- ; Joel Neely joeldotneelyatfedexdotcom REBOL [] do [ do func [s] [ foreach [a b] s [prin b] ] sort/skip do function [s] [t] [ t: "" foreach [a b] s [repend t [b a]] t ] { | e s m!zauafBpcvekexEohthjJakwLrngohOqrlryRnsctdtiub} 2 ]

[60/62] from: reffy:ulrich at: 8-Sep-2002 8:53

I am thinking Rebol for the most part is very well put together internally. Interpreters are notoriously slow for some things and fast for others. In general, the avoidance of dynamic memory allocation (i.e. using C's malloc and such) is a big winner. I took the quick hack and with a half million lines, a half million mallocs occur, and, a million lines are read. Yep, I actually count them first :-) Yes, I could rewrite the little qsort.c such that it didn't traverse the lines twice, once to count. I could use 2 mallocs and one I/O effort to read the file into memory in one big chunk, and yes there would be a good improvement. But, we all know that this sort requires little intelligence to make it happen. So the gain is really not there, unless, it is to be done with multiple files (merge and sort) and done very often, in which case, writing a second version of the qsort.c might be a big winner for anyone who might want to call qsort.c from Rebol. As is often the case, the clever solution sometimes is the big winner, like realizing a parse is not needed in this particular case. Dick

[61/62] from: reffy:ulrich at: 8-Sep-2002 9:14

The zero-left-filled number width is fixed, is it not? ie. 3 digits.

[62/62] from: joel:neely:fedex at: 8-Sep-2002 19:34

YMMV, but I typically traverse indexed data structures in order of increasing indices; as the index goes up, the length of the data string goes down. -jn- Carl Read wrote:

> On 08-Sep-02, Joel Neely wrote: > > Hi, all,

<<quoted lines omitted: 12>>

> [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes.

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted