World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
[unknown: 5] 2-Sep-2007 [2226x10]	parse data "^-"
	The problem is that some of these broken html tags cause parse to not work correctly. The tags will contain double quotes (The result of an export from oscommerce).
	sorry i was using parse/all data "^-"
	Doesn't even have to be broken tags it appears
	Just when a quote is preceeding the tag
	data: {my string^-"<span style="font: 12px arial;>some text</span>"}
	>> parse/all data "^-" == ["my string" "<span style=" {font: 12px arial;>some text</span>"}]
	Notice you get it breaking the string even where there is NOT a tab.
	Is this a bug?
	I've looked at this some more and it only seems to be a problem if the quote is preceeding the <span> tag. If you move the quote around you get what is expected and get the correct expected parsing.
btiffin 2-Sep-2007 [2236x2]	Your example still doesn't seem to jive with the documentation. Reading the docs, I would expected two strings in the output block. "my string" and the rest, in braces. It has something to do with a double quote starting a parse sequence. {"abc"def} parses as ["abc" "def"] { "abc"def"} parses as a single string as expected [{ "abc"def}]
btiffin 2-Sep-2007 [2236x2]	typos; expected = expect second example was supposed to be { "abc"def} The space after the brace seems to trigger different behaviour than {" with no space after the brace. Any character actually, the bad behaviour is only with brace immediately followed by double quote.
[unknown: 5] 3-Sep-2007 [2238]	btiffin use this example: data: {my string^-"<span style="font: 12px arial;>some text</span>"}
btiffin 3-Sep-2007 [2239]	Yeah, I think the weird parsing behavior is due to the fact that the tab seperator is followed immediately by a token that begins with double quote. If you change the data to ... -^ "<span... (note the space after the tab), the behaviour changes. giving >> parse/all data2 to string! tab == ["my string" { "<span style="font: 12px arial;>some text</span>"}] As I would expect. You've uncovered something here. parse seems dependent on quote as the first symbol in a token.
[unknown: 5] 3-Sep-2007 [2240x2]	Yeah but inserting the extra space is a crude workaround that still requires extra processing to then remove the space that was added. You think this is a bug with parse?
[unknown: 5] 3-Sep-2007 [2240x2]	I would add it to Rambo but not sure if it is one just yet.
RobertS 3-Sep-2007 [2242]	is there something I could test on 2.7.5 ?
[unknown: 5] 3-Sep-2007 [2243x2]	sure test this:
[unknown: 5] 3-Sep-2007 [2243x2]	parse/all data: {my string^-"<span style="font: 12px arial;>some text</span>"} "^-"
btiffin 3-Sep-2007 [2245]	Paul; re; Inserting extra space... No sorry, didn't mean to imply that. Just pointing out that you've discovered a bug; afaik.
[unknown: 5] 3-Sep-2007 [2246]	Yeah I message Gabriel - want him to take a look at it.
Gabriele 4-Sep-2007 [2247]	it's not a bug - parse without a rule is meant for csv parsing, and quotes delimit a field. it's not as useful as it was intended to be, but it's intentional behavior. you need to provide your own rule if you don't want quotes to be parsed.
btiffin 4-Sep-2007 [2248]	Umm, that's not quite what is happening here. imho. parse/all {"abc"def} to string! tab should return [{"abc"def}] should it not? ["abc" {def}] seems wrong. parse/all { "abc"def} to string! tab returns [{ "abc"def"}] as expected. The quote being in the first postion effects the parse behaviour that much?
[unknown: 5] 4-Sep-2007 [2249]	Yeah it definately seems like odd behavior to me. Also, isn't the TAB string the rule? Maybe, I don't get what your saying Gabriele.
PeterWood 4-Sep-2007 [2250]	Paul: I don't think the TAB string counts as a rule. It is a parameter supplying a specified delimiter when using parse for splitting strings (paraphrasing the User Guide).
Gabriele 4-Sep-2007 [2251]	tab is the delimiter. " after delimiter (which also means " as first char) means that the field is delimited by quotes. as i said, it was intended to parse csv files easily, however, i think it gets on the way most often than not. there should at least be a refinement to disable this. in any case, currently the only way around it is using your own rule.
[unknown: 5] 4-Sep-2007 [2252]	So how would you fix this problem with a rule?
Tomc 4-Sep-2007 [2253]	data: {my string^-"<span style="font: 12px arial;>some text</span>"} rule: [copy token [to tab \| to end](insert/only tail result token) skip rule] parse/all data [(result: copy []) rule] result ["my string" {"<span style="font: 12px arial;>some text</span>"}]
btiffin 4-Sep-2007 [2254]	Thanks for the clarification Gabriele. Tomc et al; we rebols really really need a place for long term storage of these types of 'work arounds' :)
Tomc 4-Sep-2007 [2255x2]	don't see it as a workaround. If I am using parse/all I always have a rule in a block and not a simple string.
Tomc 4-Sep-2007 [2255x2]	I see using parse data string as a shortcut that I can rarely afford
btiffin 4-Sep-2007 [2257x2]	It's kinda why I put work around in quotes, as it isn't really a workaround, more a means to an end.
btiffin 4-Sep-2007 [2257x2]	Still think we need a hints and tips pile somewhere :)
[unknown: 5] 5-Sep-2007 [2259]	thanks Tomc.
PatrickP61 5-Sep-2007 [2260x2]	Hi all, Have any of you written a parser to handle .rtf files? I am trying create a simple template file that I can parse against to identify Underlined, Bold, Italic, or Regular field values. Example: (Since I cannot Bold, Italic, or underline within Altme, please pretend to see what I'm saying). Config File: (when I typed the following using WordPad) looks like this Default Arial font 10 * Regular Courier New font 11 * Italic * Bold Bold Italic * Regular Underline * Regular Strikeout Regular Underline Strikeout Bold Italic Underline Strikeout Same file when using Notepad to view: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}{\f1\fmodern\fprq1\fcharset0 Courier New;}} {\colortbl ;\red0\green0\blue0;} {\\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20 Default Arial font 10 \cf1\f1\fs22 Regular Courier New font 11 * \i Italic * \b\i0 Bold\par \i Bold Italic * \ul\b0\i0 Regular Underline * \ulnone\strike Regular Strikeout\par \ul Regular Underline Strikeout\par \b\i Bold Italic Underline Strikeout\par \cf0\ulnone\b0\i0\strike0\f0\fs20\par } I can guess that fs20 refers to the default Arial font, while FS22 is the Courier New font. \i italic \b bold \ul underline \par may mean newline I am not sure of what I want the parser to return the results as and was wondering if someone has already made a generic parser of .rtf files, or can point me out to info regarding them?
PatrickP61 5-Sep-2007 [2260x2]	I found some info in Wiki
Tomc 5-Sep-2007 [2262]	http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp
PatrickP61 5-Sep-2007 [2263]	Wow, I never realized how incredibly extensive RTF is. The ONLY thing I need is to identify the character position and length of Regular, Italic, Bold, Underline, or Strikeout and the text, so in my above example, maybe the parser could return this: Note: birsu stands for Bold, Italic, Regular, Strikeout, Underline. Line Pos Len birsu Text 1 1 24 ..r.. "Default Arial font 10 * " 1 25 (n) ..r.. "Regular Courier New font 11 * " 1 (..) (..) .i... "Italic * " 1 (..) (..) b.... "Bold"(newline) <-- note \i0 turns off itialic 2 1 14 bi... "Bold Italic * " <-- note \b is still in effect from a previous setting 2 15 (..) ..r.u "Regular Underline * " <-- note \i\b is turned off. 2 (..) (..) ..rs. "Regular Strikeout"(newline) 3 1 (..) ..rsu "Regular Underline Strikeout"(newline) 4 1 (..) bi.su "Bold Italic Underline Strikeout"(newline) Ideas on how to do this as a start?
Gregg 10-Sep-2007 [2264]	First, you may need to spend some time with PARSE, so you're really comfortable with it. Taking on something like RTF--even just a subset--is going to be a sizable task. I would start by identifying the escapes (backslash words) and figuring out how you're going to maintain state as attributes are applied and removed.
PatrickP61 10-Sep-2007 [2265]	Hey Gregg -- That is just what I've been doing. I have identified the following: 1. That all printable \ { and } will show up in RTF as backslash along with the special character like \\ \{ or \} any remaining \, {, or } will be RTF commands. 2. { } and ; identify groupings with the open brace and terminating the group with close brace within the RTF. The semicolon is used to terminate sub parameters for a particular command. 3. \xxx will always identify a particular command with an optional number appended to it. Example: \b means bold while \b0 meand bold off. What I am toying with is to define simple rules to break apart a string of the RTF commands and embedded text into two parts, the command part and a parameter part. (some parameters may be a block of multiple values). I'm studying the Parse command to see what I can do simply and progress from there.
Steeve 16-Oct-2007 [2266x2]	i know your script Gabriele and other similar scripts , i just think we could be more concise to write a grammar using reflexive rules
Steeve 16-Oct-2007 [2266x2]	I am aware that it increases the complexity of the parser understanding but it is just an intellectual exercise for the moment
Graham 16-Nov-2007 [2268x4]	How to reliably break a block of text up by whitespace?
	I tried parse/all text "^/^- " but I still get large blocks of text as one
	I guess I have to use charsets of whitespace and non-whitespace
	just seems that it should be easier to split up a block of text by the whitespace
Sunanda 16-Nov-2007 [2272]	Have you tried parse/all trim/lines "..." " "
Graham 16-Nov-2007 [2273x2]	it's getting fooled by "{" chars I think
Graham 16-Nov-2007 [2273x2]	parse doesn't like " and {
Sunanda 16-Nov-2007 [2275]	That rings a bell --- I vaguely remember having to do stuff like replacing " or } with to-char 0 before doing some parses, and then changing back afterwards. That works if you have no to-char 0 in your strings
older newer	first last