World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Maxim 28-Sep-2006 [1440x3] | simple and clean, good idea! |
I'm just starting to be able to actually USE parse for dialecting. So far I've been almost solely using it to replace regexp functionality. | |
so many years of reboling (since core 1.2) , and still parse remains largely untaimed by myself. | |
Graham 29-Sep-2006 [1443x9] | This was I thought a simple task .. to parse a csv file.... |
COHEN ,"WILLIAM ",""," 305782","123 "C" AVENUE","CORONADO ","CA","92118","560456788","(619)555-2730","( ) - 0","08/22/1927","M","SHARP CORONADO/MISSI","","","","","POLLICK","JAMES ","","MOUNTAIN","RODERICK ","", | |
this seems to be a difficult line as there is an embedded quote viz "123 "c" Avenue" | |
this is Gabriele's published parser CSV-parser: make object! [ line-rule: [field any [separator field]] field: [[quoted-string | string] (insert tail fields any [f-val copy ""])] string: [copy f-val any str-char] quoted-string: [{"} copy f-val any qstr-char {"} (replace/all f-val {""} {"})] str-char: none qstr-char: [{""} | separator | str-char] fields: [] f-val: none separator: #";" set 'parse-csv-line func [ "Parses a CSV line (returns a block of strings)" line [string!] /with sep [char!] "The separator between fields" ] [ clear fields separator: any [sep #";"] str-char: complement charset join {"} separator parse/all line line-rule copy fields ] ] | |
which was written to cope with embedded quotes, but fails where there is an empty field eg , "" , | |
This is Joel Neely's from the same day ... readcsv: make object! [ all-records: copy [] one-record: copy [] one-segment: copy "" one-field: copy "" noncomma: complement charset "," nonquote: complement charset {"} segment: [ copy one-segment any nonquote (if found? one-segment [append one-field one-segment]) ] quoted: [ {"} (one-field: copy "") segment any [{""} (append one-field {"}) segment] {"} ] unquoted: [copy one-field any noncomma] field: [[quoted | unquoted] (append one-record one-field)] record: [field any ["," field]] run: func [f [file!] /local line] [ all-records: copy [] foreach line read/lines f [ one-record: copy [] either parse/all line record [ append/only all-records one-record ][ print ["parse failed:" line] ] ] all-records ] ] | |
which reports an error with this line. | |
this might fix Gabriele's parser .. CSV-parser: make object! [ line-rule: [field any [separator field]] field: [[quoted-string | string] (insert tail fields any [f-val copy ""])] string: [copy f-val any str-char] quoted-string: [{"} copy f-val any qstr-char {"} (if found? f-val [ replace/all f-val {""} {"}])] str-char: none qstr-char: [{""} | separator | str-char] fields: [] f-val: none separator: #";" set 'parse-csv-line func [ "Parses a CSV line (returns a block of strings)" line [string!] /with sep [char!] "The separator between fields" ] [ clear fields separator: any [sep #";"] str-char: complement charset join {"} separator parse/all line line-rule copy fields ] ] | |
perhaps not. | |
sqlab 29-Sep-2006 [1452] | Why you do not use split? |
Gabriele 29-Sep-2006 [1453x2] | graham, iirc my version is meant to handle embedded quotes when properly escaped, i.e. you should have "123 ""C"" AVENUE" there for it to work. |
i actually wonder why are quotes used in that line. they are only needed if the field contains the separator. | |
Graham 29-Sep-2006 [1455] | split will work if there are no embedded commas I guess |
Anton 3-Oct-2006 [1456] | What's the parse rule to go backwards ? -1 skip ? |
Oldes 3-Oct-2006 [1457x2] | maybe this will help: x: [1 2 3 4 5] parse x [any [x: set d number! (probe x probe d x: next x) :x]] |
you can set the x to another position if you need | |
Anton 3-Oct-2006 [1459] | Ah yes - very good :) |
Maxim 3-Oct-2006 [1460x3] | my god, I think I finally -get- Parse... call me the village idiot. I used to use parse, now I also understand subconciously it ;-) |
that should read "... I also understand it subconciously" | |
(parse rule inversion ;-) | |
Izkata 3-Oct-2006 [1463] | That's a ~very~ good example, Oldes... it should be put in the docs somewhere (if it isn't already.) I didn't understand how get-words and set-words worked in parse, either, before.. |
Volker 3-Oct-2006 [1464] | Nice demo of parse-position main features :) |
Rebolek 4-Oct-2006 [1465] | I've got following PARSE problem: I've got string - "<good tag><bad tag><other tag><good tag>" and I want to keep "good tag" and "<>" in other tags change to let's say "X" (I need to change it to HTML entities but that doesn't matter now). So result will look like: "<good tag>Xbad tagXXother tagX<good tag>" I'm working on it for last few hours but still not found sollution. Is there any? |
Anton 4-Oct-2006 [1466] | string: "<good tag><bad tag><other tag><good tag>" entity: "<ENTITY>" parse/all string [ any [ to "<" start: skip to ">" end: skip (if not find copy/part start end "good tag" [ change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] string ;== {<good tag><ENTITY>bad tag<ENTITY><ENTITY>other tag<ENTITY><good tag>} |
Rebolek 4-Oct-2006 [1467x3] | Anton nice thanks. But I also need it to work on this: string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>". I almost got it, but that non-symmetric "3 > 5" is still problem for me. |
I'll probable replace everything and then just revert the "good tag" back. It's not very elegant, but... | |
(hm, 3 > 5. my examples are not very 'real-life' :-)) | |
Anton 4-Oct-2006 [1470] | Such unmatched tags cause a headache for any parser. |
Rebolek 4-Oct-2006 [1471] | YES |
Anton 4-Oct-2006 [1472x2] | What are the HTML entities by the way ? |
<, and > ? | |
BrianH 4-Oct-2006 [1474] | Yes. |
Rebolek 4-Oct-2006 [1475] | Anton: yes. I have to check lot of XML files full of errors (actually it's Vista documentation, so it's understandable...) |
Anton 4-Oct-2006 [1476x3] | Ok, give this a burl. |
string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" string: " > >> < <<good tag><bad tag> 3 > 5 <other tag><good tag etc> >> > " ; (1) search for end tags >, they are erroneous so replace them ; (2) search for start tags <, if there is more than one, replace all except the last one ; (3) search for end tag >, check tag body and replace if necessary entity: "&entity;" ntag: complement charset "<>" ; non tag parse/all result: copy string [ any [ ; (1) any [ any ntag start: ">" end: ( change/part start entity 1 end: skip start length? entity ;print [1 index? start] ) :end ] ; (2) (start: none stop?: none) any [ any ntag start: "<" end: ;(print [2 mold start]) any ntag "<" ( ;print "found a second start tag" change/part start entity 1 end: skip start length? entity ;(print [2.1 mold copy/part start end]) start: none ) :end ] (if none? start [stop?: 'break]) stop? ; ok, we found at least one start tag ;(print ["OK we found at least one start tag" mold start]) :start skip ; (3) any ntag end: ">" ;(print [3 mold copy/part start end]) (if not find copy/part start end "good tag" [ ;print ["found a bad tag" mold copy/part start end] change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] result | |
All you need to do now is define two separate entity strings for < and > and then use the right one when replacing. | |
Rebolek 4-Oct-2006 [1479] | great, I'll test it, thanks |
Anton 4-Oct-2006 [1480x2] | Holy ---- ! where did two and a half hours go ? |
oh no.. maybe I only spent one and a half hours on it, but still...! | |
Rebolek 4-Oct-2006 [1482] | Erhm sorry ;) |
Anton 4-Oct-2006 [1483] | Ahh don't worry about that. |
Ladislav 4-Oct-2006 [1484x2] | this looks like an alternative: |
result: "" parse/all string [ any [ ; starting good tag copy s ["<good tag" thru ">"] (append result s) | ; ending good tag "</good tag>" (append result "</good tag>") | ; entity replacement "<" (append result "<") | ">" (append result ">") | copy s skip (append result s) ] ] print result | |
Volker 4-Oct-2006 [1486] | In this case you may also look at load/markup ;) |
Tomc 4-Oct-2006 [1487] | what Volker said. s: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" b: load/markup s while [not tail? b][ either tag? first b [ either find/match first b "good tag" [print first b] [print rejoin["X" to string! first b "X"]] ] [print first b] b: next b ] |
Oldes 5-Oct-2006 [1488x2] | I think there is some limit in load/markup - I would not used it for large data |
And Rebolek, you can use this my code to remove unwanted tags (It's already here - posted a few days befere - but with a little bug - this should be OK as I'm using it) remove-tags: func[html /except allowed-tags /local new x tag name tagchars][ if not string? html [return html] new: make string! length? html tagchars: charset [#"a" - #"z" #"A" - #"Z"] parse/all html [ any [ copy x to {<} copy tag thru {>} ( if not none? x [insert tail new x] if all [ except parse/all tag ["<" opt #"/" copy name some tagchars to end] find allowed-tags name ][ insert tail new tag ] ) ] copy x to end (if not none? x [insert tail new x]) ] new ] | |
older newer | first last |