World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Maxim 28-Sep-2006 [1440x3]	simple and clean, good idea!
	I'm just starting to be able to actually USE parse for dialecting. So far I've been almost solely using it to replace regexp functionality.
	so many years of reboling (since core 1.2) , and still parse remains largely untaimed by myself.
Graham 29-Sep-2006 [1443x9]	This was I thought a simple task .. to parse a csv file....
	COHEN ,"WILLIAM ",""," 305782","123 "C" AVENUE","CORONADO ","CA","92118","560456788","(619)555-2730","( ) - 0","08/22/1927","M","SHARP CORONADO/MISSI","","","","","POLLICK","JAMES ","","MOUNTAIN","RODERICK ","",
	this seems to be a difficult line as there is an embedded quote viz "123 "c" Avenue"
	this is Gabriele's published parser CSV-parser: make object! [ line-rule: [field any [separator field]] field: [[quoted-string \| string] (insert tail fields any [f-val copy ""])] string: [copy f-val any str-char] quoted-string: [{"} copy f-val any qstr-char {"} (replace/all f-val {""} {"})] str-char: none qstr-char: [{""} \| separator \| str-char] fields: [] f-val: none separator: #";" set 'parse-csv-line func [ "Parses a CSV line (returns a block of strings)" line [string!] /with sep [char!] "The separator between fields" ] [ clear fields separator: any [sep #";"] str-char: complement charset join {"} separator parse/all line line-rule copy fields ] ]
	which was written to cope with embedded quotes, but fails where there is an empty field eg , "" ,
	This is Joel Neely's from the same day ... readcsv: make object! [ all-records: copy [] one-record: copy [] one-segment: copy "" one-field: copy "" noncomma: complement charset "," nonquote: complement charset {"} segment: [ copy one-segment any nonquote (if found? one-segment [append one-field one-segment]) ] quoted: [ {"} (one-field: copy "") segment any [{""} (append one-field {"}) segment] {"} ] unquoted: [copy one-field any noncomma] field: [[quoted \| unquoted] (append one-record one-field)] record: [field any ["," field]] run: func [f [file!] /local line] [ all-records: copy [] foreach line read/lines f [ one-record: copy [] either parse/all line record [ append/only all-records one-record ][ print ["parse failed:" line] ] ] all-records ] ]
	which reports an error with this line.
	this might fix Gabriele's parser .. CSV-parser: make object! [ line-rule: [field any [separator field]] field: [[quoted-string \| string] (insert tail fields any [f-val copy ""])] string: [copy f-val any str-char] quoted-string: [{"} copy f-val any qstr-char {"} (if found? f-val [ replace/all f-val {""} {"}])] str-char: none qstr-char: [{""} \| separator \| str-char] fields: [] f-val: none separator: #";" set 'parse-csv-line func [ "Parses a CSV line (returns a block of strings)" line [string!] /with sep [char!] "The separator between fields" ] [ clear fields separator: any [sep #";"] str-char: complement charset join {"} separator parse/all line line-rule copy fields ] ]
	perhaps not.
sqlab 29-Sep-2006 [1452]	Why you do not use split?
Gabriele 29-Sep-2006 [1453x2]	graham, iirc my version is meant to handle embedded quotes when properly escaped, i.e. you should have "123 ""C"" AVENUE" there for it to work.
Gabriele 29-Sep-2006 [1453x2]	i actually wonder why are quotes used in that line. they are only needed if the field contains the separator.
Graham 29-Sep-2006 [1455]	split will work if there are no embedded commas I guess
Anton 3-Oct-2006 [1456]	What's the parse rule to go backwards ? -1 skip ?
Oldes 3-Oct-2006 [1457x2]	maybe this will help: x: [1 2 3 4 5] parse x [any [x: set d number! (probe x probe d x: next x) :x]]
Oldes 3-Oct-2006 [1457x2]	you can set the x to another position if you need
Anton 3-Oct-2006 [1459]	Ah yes - very good :)
Maxim 3-Oct-2006 [1460x3]	my god, I think I finally -get- Parse... call me the village idiot. I used to use parse, now I also understand subconciously it ;-)
	that should read "... I also understand it subconciously"
	(parse rule inversion ;-)
Izkata 3-Oct-2006 [1463]	That's a ~very~ good example, Oldes... it should be put in the docs somewhere (if it isn't already.) I didn't understand how get-words and set-words worked in parse, either, before..
Volker 3-Oct-2006 [1464]	Nice demo of parse-position main features :)
Rebolek 4-Oct-2006 [1465]	I've got following PARSE problem: I've got string - "<good tag><bad tag><other tag><good tag>" and I want to keep "good tag" and "<>" in other tags change to let's say "X" (I need to change it to HTML entities but that doesn't matter now). So result will look like: "<good tag>Xbad tagXXother tagX<good tag>" I'm working on it for last few hours but still not found sollution. Is there any?
Anton 4-Oct-2006 [1466]	string: "<good tag><bad tag><other tag><good tag>" entity: "<ENTITY>" parse/all string [ any [ to "<" start: skip to ">" end: skip (if not find copy/part start end "good tag" [ change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] string ;== {<good tag><ENTITY>bad tag<ENTITY><ENTITY>other tag<ENTITY><good tag>}
Rebolek 4-Oct-2006 [1467x3]	Anton nice thanks. But I also need it to work on this: string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>". I almost got it, but that non-symmetric "3 > 5" is still problem for me.
	I'll probable replace everything and then just revert the "good tag" back. It's not very elegant, but...
	(hm, 3 > 5. my examples are not very 'real-life' :-))
Anton 4-Oct-2006 [1470]	Such unmatched tags cause a headache for any parser.
Rebolek 4-Oct-2006 [1471]	YES
Anton 4-Oct-2006 [1472x2]	What are the HTML entities by the way ?
Anton 4-Oct-2006 [1472x2]	<, and > ?
BrianH 4-Oct-2006 [1474]	Yes.
Rebolek 4-Oct-2006 [1475]	Anton: yes. I have to check lot of XML files full of errors (actually it's Vista documentation, so it's understandable...)
Anton 4-Oct-2006 [1476x3]	Ok, give this a burl.
	string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" string: " > >> < <<good tag><bad tag> 3 > 5 <other tag><good tag etc> >> > " ; (1) search for end tags >, they are erroneous so replace them ; (2) search for start tags <, if there is more than one, replace all except the last one ; (3) search for end tag >, check tag body and replace if necessary entity: "&entity;" ntag: complement charset "<>" ; non tag parse/all result: copy string [ any [ ; (1) any [ any ntag start: ">" end: ( change/part start entity 1 end: skip start length? entity ;print [1 index? start] ) :end ] ; (2) (start: none stop?: none) any [ any ntag start: "<" end: ;(print [2 mold start]) any ntag "<" ( ;print "found a second start tag" change/part start entity 1 end: skip start length? entity ;(print [2.1 mold copy/part start end]) start: none ) :end ] (if none? start [stop?: 'break]) stop? ; ok, we found at least one start tag ;(print ["OK we found at least one start tag" mold start]) :start skip ; (3) any ntag end: ">" ;(print [3 mold copy/part start end]) (if not find copy/part start end "good tag" [ ;print ["found a bad tag" mold copy/part start end] change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] result
	All you need to do now is define two separate entity strings for < and > and then use the right one when replacing.
Rebolek 4-Oct-2006 [1479]	great, I'll test it, thanks
Anton 4-Oct-2006 [1480x2]	Holy ---- ! where did two and a half hours go ?
Anton 4-Oct-2006 [1480x2]	oh no.. maybe I only spent one and a half hours on it, but still...!
Rebolek 4-Oct-2006 [1482]	Erhm sorry ;)
Anton 4-Oct-2006 [1483]	Ahh don't worry about that.
Ladislav 4-Oct-2006 [1484x2]	this looks like an alternative:
Ladislav 4-Oct-2006 [1484x2]	result: "" parse/all string [ any [ ; starting good tag copy s ["<good tag" thru ">"] (append result s) \| ; ending good tag "</good tag>" (append result "</good tag>") \| ; entity replacement "<" (append result "<") \| ">" (append result ">") \| copy s skip (append result s) ] ] print result
Volker 4-Oct-2006 [1486]	In this case you may also look at load/markup ;)
Tomc 4-Oct-2006 [1487]	what Volker said. s: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" b: load/markup s while [not tail? b][ either tag? first b [ either find/match first b "good tag" [print first b] [print rejoin["X" to string! first b "X"]] ] [print first b] b: next b ]
Oldes 5-Oct-2006 [1488x2]	I think there is some limit in load/markup - I would not used it for large data
Oldes 5-Oct-2006 [1488x2]	And Rebolek, you can use this my code to remove unwanted tags (It's already here - posted a few days befere - but with a little bug - this should be OK as I'm using it) remove-tags: func[html /except allowed-tags /local new x tag name tagchars][ if not string? html [return html] new: make string! length? html tagchars: charset [#"a" - #"z" #"A" - #"Z"] parse/all html [ any [ copy x to {<} copy tag thru {>} ( if not none? x [insert tail new x] if all [ except parse/all tag ["<" opt #"/" copy name some tagchars to end] find allowed-tags name ][ insert tail new tag ] ) ] copy x to end (if not none? x [insert tail new x]) ] new ]
older newer	first last