r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[!REBOL3]

Gregg
17-Jul-2011
[9190x15]
Found a small test suite.
test: func [block] [
	print [mold/only :block newline tab mold do block]
]	

test [split "1234567812345678" 4]
;== ["1234" "5678" "1234" "5678"]

test [split "1234567812345678" 3]
;== ["123" "456" "781" "234" "567" "8"]
test [split "1234567812345678" 5]
;== ["12345" "67812" "34567" "8"]

test [split/into [1 2 3 4 5 6] 2]
;== [[1 2 3] [4 5 6]]
test [split/into "1234567812345678" 2]
;== ["12345678" "12345678"]

test [split/into "1234567812345678" 3]
;== ["12345" "67812" "345678"]
test [split/into "1234567812345678" 5]
;== ["123" "456" "781" "234" "5678"]

test [split [1 2 3 4 5 6] [2 1 3]]
;== [[1 2] [3] [4 5 6]]
test [split "1234567812345678" [4 4 2 2 1 1 1 1]]
;== ["1234" "5678" "12" "34" "5" "6" "7" "8"]
test [split first [(1 2 3 4 5 6 7 8 9)] 3]
;== [(1 2 3) (4 5 6) (7 8 9)]
test [split #{0102030405060708090A} [4 3 1 2]]
;== [#{01020304} #{050607} #{08} #{090A}]

test [split [1 2 3 4 5 6] [2 1]]
;== [[1 2] [3]]

test [split [1 2 3 4 5 6] [2 1 3 5]]
;== [[1 2] [3] [4 5 6] []]

test [split [1 2 3 4 5 6] [2 1 6]]
;== [[1 2] [3] [4 5 6]]

test [split [1 2 3 4 5 6] [3 2 2 -1 -4 3 -2]]
;== [[1 2 3] [4 5] [6] [6] [2 3 4 5] [2 3 4] [3 4]]	

test [split "abc,de,fghi,jk" #","]
;== ["abc" "de" "fghi" "jk"]
test [split "abc<br>de<br>fghi<br>jk" <br>]
;== ["abc" "de" "fghi" "jk"]

test [split "abc|de/fghi:jk" charset "|/:"]
;== ["abc" "de" "fghi" "jk"]

test [split "abc^M^Jde^Mfghi^Jjk" [crlf | #"^M" | newline]]
;== ["abc" "de" "fghi" "jk"]
test [split "abc     de fghi  jk" [some #" "]]
;== ["abc" "de" "fghi" "jk"]
The original was written before MAP-EACH and the new COLLECT. Here 
is the source I have, updated to use those as the current version 
does, but with the last rule reverted to the original.

Related cc reports: 
    http://issue.cc/r3/1096
    http://issue.cc/r3/690

split: func [

    "Split a series into pieces; fixed or variable size, fixed number, 
    or at delimiters"
    series	[series!] "The series to split"

    dlm		[block! integer! char! bitset! any-string!] "Split size, delimiter(s), 
    or rule(s)." 

    /into	"If dlm is an integer, split into n pieces, rather than pieces 
    of length n."
    /local size count mk1 mk2
][
    either all [block? dlm  parse dlm [some integer!]] [
            map-each len dlm [
                either positive? len [
                    copy/part series series: skip series len
                ] [
                    series: skip series negate len

                    ; return unset so that nothing is added to output
                    ()
                ]
            ]
    ][
        size: dlm   ; alias for readability
        collect [
            parse/all series case [
                all [integer? size into] [

                    if size < 1 [cause-error 'Script 'invalid-arg size]
                    count: size - 1
                    size: round/down divide length? series size
                    [

                        count [copy series size skip (keep/only series)]
                        copy series to end (keep/only series)
                    ]
                ]
                integer? dlm [

                    if size < 1 [cause-error 'Script 'invalid-arg size]

                    [any [copy series 1 size skip (keep/only series)]]
                ]

                'else [ ; = any [bitset? dlm  any-string? dlm  char? dlm]

                    [any [mk1: some [mk2: dlm break | skip] (keep copy/part mk1 mk2)]]
                ]
            ]
        ] 
    ]
]
>> split "a.b.c" "."
== ["a" "b" "c"]

>> split "c c" " "
== ["c" "c"]

>> split "1," " "
== ["1,"]

>> split "1,2" " "
== ["1,2"]

>> split "c,c" ","
== ["c" "c"]

>> split/into "" 1
== [""]

>> split/into "" 2
== ["" ""]

>>  split "This! is a. test? to see " charset "!?."
== ["This" " is a" " test" " to see "]
The test case that fails with this is where the delimiter is the 
last char. You don't get an empty field at the end. 

>> split "1,2,3," ","
== ["1" "2" "3"]
I found some notes that at one point to/thru broke for block and 
bitset targets.
ROUND not returning an integer broke some things too. i.e. currently 
broken.
Another bug has crept in somewhere along the way:

    series: skip series negate len


The NEGATE messes up the skip of the already negative value, which 
breaks cases like this:

    >> split [1 2 3 4 5 6] [3 2 2 -2 2 -4 3]
    == [[1 2 3] [4 5] [6] [5 6] [3 4 5]]
Updated SPLIT. Please test, add tests cases, comment, and critique. 
If you look at the special processing section, and are offended by 
it, feel free to impove it. Well, feel free to improve any of it.


I haven't checked the doc page to see if all the examples work as 
they are doc'd, which also needs to be done.
split: func [

    "Split a series into pieces; fixed or variable size, fixed number, 
    or at delimiters"
    series	[series!] "The series to split"

    dlm		[block! integer! char! bitset! any-string!] "Split size, delimiter(s), 
    or rule(s)." 

    /into	"If dlm is an integer, split into n pieces, rather than pieces 
    of length n."

    /local size piece-size count mk1 mk2 res fill-val add-fill-val
][
    either all [block? dlm  parse dlm [some integer!]] [
        map-each len dlm [
            either positive? len [
                copy/part series series: skip series len
            ] [
                series: skip series len
                ; return unset so that nothing is added to output
                ()
            ]
        ]
    ][
        size: dlm   ; alias for readability
        res: collect [
            parse/all series case [
                all [integer? size  into] [

                    if size < 1 [cause-error 'Script 'invalid-arg size]
                    count: size - 1

                    piece-size: to integer! round/down divide length? series size
                    if zero? piece-size [piece-size: 1]
                    [

                        count [copy series piece-size skip (keep/only series)]
                        copy series to end (keep/only series)
                    ]
                ]
                integer? dlm [

                    if size < 1 [cause-error 'Script 'invalid-arg size]

                    [any [copy series 1 size skip (keep/only series)]]
                ]

                'else [ ; = any [bitset? dlm  any-string? dlm  char? dlm]

                    [any [mk1: some [mk2: dlm break | skip] (keep/only copy/part mk1 
                    mk2)]]
                ]
            ]
        ]

        ;-- Special processing, to handle cases where the spec'd more items 
        in

        ;   /into than the series contains (so we want to append empty items),

        ;   or where the dlm was a char/string/charset and it was the last 
        char

        ;   (so we want to append an empty field that the above rule misses).
        fill-val: does [copy either any-block? series [[]] [""]]
        add-fill-val: does [append/only res fill-val]
        case [
            all [integer? size  into] [

                ; If the result is too short, i.e., less items than 'size, add
                ; empty items to fill it to 'size.

                ; We loop here, because insert/dup doesn't copy the value inserted.
                if size > length? res [
                    loop (size - length? res) [add-fill-val]
                ]
            ]
            ; integer? dlm [
            ; ]

            'else [ ; = any [bitset? dlm  any-string? dlm  char? dlm]

                ; If the last thing in the series is a delimiter, there is an

                ; implied empty field after it, which we add here.
                case [
                    bitset? dlm [

                        ; ATTEMPT is here because LAST will return NONE for an 

                        ; empty series, and finding none in a bitest is not allowed.

                        if attempt [find dlm last series] [add-fill-val]
                    ]
                    char? dlm [
                        if dlm = last series [add-fill-val]
                    ]
                    string? dlm [
                        if all [
                            find series dlm
                            empty? find/last/tail series dlm
                        ] [add-fill-val]
                    ]
                ]
            ]
        ]
                
        res
    ]
]
test: func [block expected-result /local res] [
    if error? try [
        print [mold/only :block newline tab mold res: do block]

        if res <> expected-result [print [tab 'FAILED! tab 'expected mold 
        expected-result]]
    ][
        print [mold/only :block newline tab "ERROR!"]
    ]
]
test [split "1234567812345678" 4]  ["1234" "5678" "1234" "5678"]


test [split "1234567812345678" 3]  ["123" "456" "781" "234" "567" 
"8"]
test [split "1234567812345678" 5]  ["12345" "67812" "34567" "8"]

test [split/into [1 2 3 4 5 6] 2]       [[1 2 3] [4 5 6]]
test [split/into "1234567812345678" 2]  ["12345678" "12345678"]

test [split/into "1234567812345678" 3]  ["12345" "67812" "345678"]

test [split/into "1234567812345678" 5]  ["123" "456" "781" "234" 
"5678"]

test [split/into "123" 6]       ["1" "2" "3" "" "" ""]
test [split/into [1 2 3] 6]     [[1] [2] [3] [] [] []] 



test [split [1 2 3 4 5 6] [2 1 3]]                  [[1 2] [3] [4 
5 6]]

test [split "1234567812345678" [4 4 2 2 1 1 1 1]]   ["1234" "5678" 
"12" "34" "5" "6" "7" "8"]

test [split first [(1 2 3 4 5 6 7 8 9)] 3]          [(1 2 3) (4 5 
6) (7 8 9)]

test [split #{0102030405060708090A} [4 3 1 2]]      [#{01020304} 
#{050607} #{08} #{090A}]

test [split [1 2 3 4 5 6] [2 1]]                [[1 2] [3]]


test [split [1 2 3 4 5 6] [2 1 3 5]]            [[1 2] [3] [4 5 6] 
[]]


test [split [1 2 3 4 5 6] [2 1 6]]              [[1 2] [3] [4 5 6]]


test [split [1 2 3 4 5 6] [3 2 2 -2 2 -4 3]]    [[1 2 3] [4 5] [6] 
[5 6] [3 4 5]]


test [split "abc,de,fghi,jk" #","]              ["abc" "de" "fghi" 
"jk"]

test [split "abc<br>de<br>fghi<br>jk" <br>]     ["abc" "de" "fghi" 
"jk"]

test [split "a.b.c" "."]     ["a" "b" "c"]
test [split "c c" " "]       ["c" "c"]
test [split "1,2,3" " "]     ["1,2,3"]
test [split "1,2,3" ","]     ["1" "2" "3"]
test [split "1,2,3," ","]    ["1" "2" "3" ""]
test [split "1,2,3," charset ",."]    ["1" "2" "3" ""]
test [split "1.2,3." charset ",."]    ["1" "2" "3" ""]


test [split "abc|de/fghi:jk" charset "|/:"]                     ["abc" 
"de" "fghi" "jk"]


test [split "abc^M^Jde^Mfghi^Jjk" [crlf | #"^M" | newline]]     ["abc" 
"de" "fghi" "jk"]

test [split "abc     de fghi  jk" [some #" "]]                  ["abc" 
"de" "fghi" "jk"]
A quick scan of the docs showed that negative skip val usage changed 
from the original design. I will revert the negate on those to match 
the doc'd behavior.
split: func [

    "Split a series into pieces; fixed or variable size, fixed number, 
    or at delimiters"
    series	[series!] "The series to split"

    dlm		[block! integer! char! bitset! any-string!] "Split size, delimiter(s), 
    or rule(s)." 

    /into	"If dlm is an integer, split into n pieces, rather than pieces 
    of length n."

    /local size piece-size count mk1 mk2 res fill-val add-fill-val
][
    either all [block? dlm  parse dlm [some integer!]] [
        map-each len dlm [
            either positive? len [
                copy/part series series: skip series len
            ] [
                series: skip series negate len
                ; return unset so that nothing is added to output
                ()
            ]
        ]
    ][
        size: dlm   ; alias for readability
        res: collect [
            parse/all series case [
                all [integer? size  into] [

                    if size < 1 [cause-error 'Script 'invalid-arg size]
                    count: size - 1

                    piece-size: to integer! round/down divide length? series size
                    if zero? piece-size [piece-size: 1]
                    [

                        count [copy series piece-size skip (keep/only series)]
                        copy series to end (keep/only series)
                    ]
                ]
                integer? dlm [

                    if size < 1 [cause-error 'Script 'invalid-arg size]

                    [any [copy series 1 size skip (keep/only series)]]
                ]

                'else [ ; = any [bitset? dlm  any-string? dlm  char? dlm]

                    [any [mk1: some [mk2: dlm break | skip] (keep/only copy/part mk1 
                    mk2)]]
                ]
            ]
        ]

        ;-- Special processing, to handle cases where the spec'd more items 
        in

        ;   /into than the series contains (so we want to append empty items),

        ;   or where the dlm was a char/string/charset and it was the last 
        char

        ;   (so we want to append an empty field that the above rule misses).
        fill-val: does [copy either any-block? series [[]] [""]]
        add-fill-val: does [append/only res fill-val]
        case [
            all [integer? size  into] [

                ; If the result is too short, i.e., less items than 'size, add
                ; empty items to fill it to 'size.

                ; We loop here, because insert/dup doesn't copy the value inserted.
                if size > length? res [
                    loop (size - length? res) [add-fill-val]
                ]
            ]
            ; integer? dlm [
            ; ]

            'else [ ; = any [bitset? dlm  any-string? dlm  char? dlm]

                ; If the last thing in the series is a delimiter, there is an

                ; implied empty field after it, which we add here.
                case [
                    bitset? dlm [

                        ; ATTEMPT is here because LAST will return NONE for an 

                        ; empty series, and finding none in a bitest is not allowed.

                        if attempt [find dlm last series] [add-fill-val]
                    ]
                    char? dlm [
                        if dlm = last series [add-fill-val]
                    ]
                    string? dlm [
                        if all [
                            find series dlm
                            empty? find/last/tail series dlm
                        ] [add-fill-val]
                    ]
                ]
            ]
        ]
                
        res
    ]
]
; Old design for negative skip vals

;test [split [1 2 3 4 5 6] [3 2 2 -2 2 -4 3]]    [[1 2 3] [4 5] [6] 
[5 6] [3 4 5]]
; New design for negative skip vals
test [split [1 2 3 4 5 6] [2 -2 2]]             [[1 2] [5 6]]
Steeve
18-Jul-2011
[9205x4]
Seems you wrecked the behavior when a parse rule is fulfilled.

[split] should keep the matched parts, you do the contrary (exclusion), 
why this change ?.
Ok, I see now you turned it back to the primary behavior. But it 
should be discussed at first.
I vote for the include behavior.
It makes sense because whatever new junk sequences are added in the 
source, the macthing process will continue to collect the expected 
tokens.
It makes sense because whatever new junk sequences are added in the 
source, the macthing process will continue to collect the expected 
tokens.
Gregg
18-Jul-2011
[9209]
Could you provide examples of what you mean? The original design 
was flexible, but perhaps not as useful. I understand why it was 
changed, and think it's better for general use.
Steeve
18-Jul-2011
[9210x2]
Well, I just read the code.

You replaced this:

    [any [mk1: some [mk2: dlm break | skip] (emit copy/part mk1 mk2)]]
by this:

    [any [mk1: [to dlm mk2: dlm | to end mk2:] (keep copy/part mk1 mk2)]]


In the first case: the rule is used to extract the matching sequences

In the second case, the rule is used to exclude the matching sequences.
Sorry, In fact it's the contrary (swap the 2 cases)
Gregg
18-Jul-2011
[9212]
OK. I'm not vested in the implementation, just the results. Feel 
free to improve things and make it more elegant. As long as the tests 
all pass, or we agree on behavior changes, I don't have a problem.
Steeve
18-Jul-2011
[9213]
Well...Seems I don't know how to make it  clear, as usual :-)
It's you who want to change the behavior.
But it's Ok I guess, since no one else complained :-)
Gregg
18-Jul-2011
[9214x2]
Hmmm, I thought I reverted to the current behavior (which was not 
the original behavior), aside from bug fixes.
If you could use a test case to explain, I might get it. I'm slow 
today.
Steeve
18-Jul-2011
[9216]
To me the current behavior is the one I have in the current code 
of R3
Gregg
18-Jul-2011
[9217]
And what behavior did I change (as a test case)?
Steeve
18-Jul-2011
[9218]
oK wait a little, I will do my best ;-)
Gregg
18-Jul-2011
[9219]
Maybe you can find one on http://www.rebol.com/r3/docs/functions/split.html
that shows it?
Steeve
18-Jul-2011
[9220x4]
current behavior:
split "-a-a'" ["a"]
>> ["a" "a"]

yours:
split "-a-a" ["a"]
>> ["-" "-"]
not tested though, I just read the code
hmmm, Seems It's me who is totaly wreckled
Ok forget my big mouth, if you can
Gregg
18-Jul-2011
[9224x2]
So, you're saying that you want to specify a delimiter, and have 
it keep that? In any case, that's not the current behavior:

>> split "-a-a'" ["a"]
== ["-" "-" "'" ""]

Here's mine:

>> split "-a-a'" ["a"]
== ["-" "-" "'"]
So, my final version above seems OK then?
Steeve
18-Jul-2011
[9226]
Yeah, I just taken my pills, I'm fine now
Gregg
18-Jul-2011
[9227x3]
LOL. :-)
I don't know the current system for submitting patches to R3. Once 
more people sign off on it, maybe BrianH will show up and see if 
we can get it in there.
Thanks for taking a look at it Steeve. Pills or not.
Steeve
18-Jul-2011
[9230x4]
I will probably rewrite it completly before though
Gezz, I wanted tp say HE not I
*wanted to say He (Brian)
Ok, time to go bed, it seems
sqlab
19-Jul-2011
[9234]
It seems the parse behaviour changed somewhere a few times..
Nevertheless I think split is overloaded and overcomplicated.

The parse rules should better go in a PARSE Common Patterns library, 
that is included with Rebol,
not unlike http://www.rebol.com/article/0508.html
Gregg
19-Jul-2011
[9235]
How would you redesign it?
BrianH
19-Jul-2011
[9236x4]
Well, for one thing, complex, flexible mezzanine functions tend to 
be slowed down by the conditional code that determines the actual 
behavior desired in a particular case. There are real advantages 
to seperating a complex function into multiple smaller, simpler functions. 
This makes it so the choices about which set of behavior to use are 
made by the programmer at development time instead of by the interpreter 
at runtime. SPLIT is a really large function that does many different 
things, so it's a good candidate for such a function split.
I liked Carl's idea of a SPLIT function that takes a series and returns 
the series from the head to the offset, and from the offset to the 
end. Like this:

    split: func [series [series!]] [reduce [copy/part head series series 
    copy series]]

At most, add a option to control the copying. Then have a seperate 
function split on a delimeter, another split into a number of parts, 
etc.
There are some mezzanine functions that have to be large and complex 
for other reasons. For instance, a couple of the LOAD subfunctions 
need to have functionality bundled together for security purposes. 
This doesn't seem to be the case with SPLIT.
It's one of the ironies of R3 that for a language that touts its 
ability to create user-designed dialects, inside the R3 mezzanine 
code, dialected functions are often too slow to be efficient enough 
for inclusion. This is why most of the builtin dialects are implemented 
in native code, through natives or commands. A dialect needs to be 
efficient enough to merit its use as opposed to the procedural equivalent, 
and easy enough to comprehend that the users of the dialect are likely 
to use it, rather than a simpler alternative. Developers' minds have 
overhead too.