World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
[unknown: 5] 1-Sep-2007 [2220x2] | Tom, for parse I only want to parse/all data tab. Problem is that parse will break apart html tags and more. I don't want to parse out tags because they will be needed to be left intact to some extent. |
It just seems to me that parse/all data tab doesn't ONLY parse out the tabs but breaks at these doublequotes together. | |
Tomc 2-Sep-2007 [2222x3] | Paul how are you defining tab? it seems to work for me. |
str == {some chars "" some more chars"" and more} >> parse/all str "^-" == [{some chars "" some more chars"" and more}] | |
there are no beraks at the double quotes. | |
[unknown: 5] 2-Sep-2007 [2225x11] | It looks like it breaks on html tags that might be broken. For example, I was testing parse on a tab deliminated file and performing the following parse: |
parse data "^-" | |
The problem is that some of these broken html tags cause parse to not work correctly. The tags will contain double quotes (The result of an export from oscommerce). | |
sorry i was using parse/all data "^-" | |
Doesn't even have to be broken tags it appears | |
Just when a quote is preceeding the tag | |
data: {my string^-"<span style="font: 12px arial;>some text</span>"} | |
>> parse/all data "^-" == ["my string" "<span style=" {font: 12px arial;>some text</span>"}] | |
Notice you get it breaking the string even where there is NOT a tab. | |
Is this a bug? | |
I've looked at this some more and it only seems to be a problem if the quote is preceeding the <span> tag. If you move the quote around you get what is expected and get the correct expected parsing. | |
btiffin 2-Sep-2007 [2236x2] | Your example still doesn't seem to jive with the documentation. Reading the docs, I would expected two strings in the output block. "my string" and the rest, in braces. It has something to do with a double quote starting a parse sequence. {"abc"def} parses as ["abc" "def"] { "abc"def"} parses as a single string as expected [{ "abc"def}] |
typos; expected = expect second example was supposed to be { "abc"def} The space after the brace seems to trigger different behaviour than {" with no space after the brace. Any character actually, the bad behaviour is only with brace immediately followed by double quote. | |
[unknown: 5] 3-Sep-2007 [2238] | btiffin use this example: data: {my string^-"<span style="font: 12px arial;>some text</span>"} |
btiffin 3-Sep-2007 [2239] | Yeah, I think the weird parsing behavior is due to the fact that the tab seperator is followed immediately by a token that begins with double quote. If you change the data to ... -^ "<span... (note the space after the tab), the behaviour changes. giving >> parse/all data2 to string! tab == ["my string" { "<span style="font: 12px arial;>some text</span>"}] As I would expect. You've uncovered something here. parse seems dependent on quote as the first symbol in a token. |
[unknown: 5] 3-Sep-2007 [2240x2] | Yeah but inserting the extra space is a crude workaround that still requires extra processing to then remove the space that was added. You think this is a bug with parse? |
I would add it to Rambo but not sure if it is one just yet. | |
RobertS 3-Sep-2007 [2242] | is there something I could test on 2.7.5 ? |
[unknown: 5] 3-Sep-2007 [2243x2] | sure test this: |
parse/all data: {my string^-"<span style="font: 12px arial;>some text</span>"} "^-" | |
btiffin 3-Sep-2007 [2245] | Paul; re; Inserting extra space... No sorry, didn't mean to imply that. Just pointing out that you've discovered a bug; afaik. |
[unknown: 5] 3-Sep-2007 [2246] | Yeah I message Gabriel - want him to take a look at it. |
Gabriele 4-Sep-2007 [2247] | it's not a bug - parse without a rule is meant for csv parsing, and quotes delimit a field. it's not as useful as it was intended to be, but it's intentional behavior. you need to provide your own rule if you don't want quotes to be parsed. |
btiffin 4-Sep-2007 [2248] | Umm, that's not quite what is happening here. imho. parse/all {"abc"def} to string! tab should return [{"abc"def}] should it not? ["abc" {def}] seems wrong. parse/all { "abc"def} to string! tab returns [{ "abc"def"}] as expected. The quote being in the first postion effects the parse behaviour that much? |
[unknown: 5] 4-Sep-2007 [2249] | Yeah it definately seems like odd behavior to me. Also, isn't the TAB string the rule? Maybe, I don't get what your saying Gabriele. |
PeterWood 4-Sep-2007 [2250] | Paul: I don't think the TAB string counts as a rule. It is a parameter supplying a specified delimiter when using parse for splitting strings (paraphrasing the User Guide). |
Gabriele 4-Sep-2007 [2251] | tab is the delimiter. " after delimiter (which also means " as first char) means that the field is delimited by quotes. as i said, it was intended to parse csv files easily, however, i think it gets on the way most often than not. there should at least be a refinement to disable this. in any case, currently the only way around it is using your own rule. |
[unknown: 5] 4-Sep-2007 [2252] | So how would you fix this problem with a rule? |
Tomc 4-Sep-2007 [2253] | data: {my string^-"<span style="font: 12px arial;>some text</span>"} rule: [copy token [to tab | to end](insert/only tail result token) skip rule] parse/all data [(result: copy []) rule] result ["my string" {"<span style="font: 12px arial;>some text</span>"}] |
btiffin 4-Sep-2007 [2254] | Thanks for the clarification Gabriele. Tomc et al; we rebols really really need a place for long term storage of these types of 'work arounds' :) |
Tomc 4-Sep-2007 [2255x2] | don't see it as a workaround. If I am using parse/all I always have a rule in a block and not a simple string. |
I see using parse data string as a shortcut that I can rarely afford | |
btiffin 4-Sep-2007 [2257x2] | It's kinda why I put work around in quotes, as it isn't really a workaround, more a means to an end. |
Still think we need a hints and tips pile somewhere :) | |
[unknown: 5] 5-Sep-2007 [2259] | thanks Tomc. |
PatrickP61 5-Sep-2007 [2260x2] | Hi all, Have any of you written a parser to handle .rtf files? I am trying create a simple template file that I can parse against to identify Underlined, Bold, Italic, or Regular field values. Example: (Since I cannot Bold, Italic, or underline within Altme, please pretend to see what I'm saying). Config File: (when I typed the following using WordPad) looks like this Default Arial font 10 * Regular Courier New font 11 * Italic * Bold Bold Italic * Regular Underline * Regular Strikeout Regular Underline Strikeout Bold Italic Underline Strikeout Same file when using Notepad to view: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}{\f1\fmodern\fprq1\fcharset0 Courier New;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20 Default Arial font 10 * \cf1\f1\fs22 Regular Courier New font 11 * \i Italic * \b\i0 Bold\par \i Bold Italic * \ul\b0\i0 Regular Underline * \ulnone\strike Regular Strikeout\par \ul Regular Underline Strikeout\par \b\i Bold Italic Underline Strikeout\par \cf0\ulnone\b0\i0\strike0\f0\fs20\par } I can guess that fs20 refers to the default Arial font, while FS22 is the Courier New font. \i italic \b bold \ul underline \par may mean newline I am not sure of what I want the parser to return the results as and was wondering if someone has already made a generic parser of .rtf files, or can point me out to info regarding them? |
I found some info in Wiki | |
Tomc 5-Sep-2007 [2262] | http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp |
PatrickP61 5-Sep-2007 [2263] | Wow, I never realized how incredibly extensive RTF is. The ONLY thing I need is to identify the character position and length of Regular, Italic, Bold, Underline, or Strikeout and the text, so in my above example, maybe the parser could return this: Note: birsu stands for Bold, Italic, Regular, Strikeout, Underline. Line Pos Len birsu Text 1 1 24 ..r.. "Default Arial font 10 * " 1 25 (n) ..r.. "Regular Courier New font 11 * " 1 (..) (..) .i... "Italic * " 1 (..) (..) b.... "Bold"(newline) <-- note \i0 turns off itialic 2 1 14 bi... "Bold Italic * " <-- note \b is still in effect from a previous setting 2 15 (..) ..r.u "Regular Underline * " <-- note \i\b is turned off. 2 (..) (..) ..rs. "Regular Strikeout"(newline) 3 1 (..) ..rsu "Regular Underline Strikeout"(newline) 4 1 (..) bi.su "Bold Italic Underline Strikeout"(newline) Ideas on how to do this as a start? |
Gregg 10-Sep-2007 [2264] | First, you may need to spend some time with PARSE, so you're *really* comfortable with it. Taking on something like RTF--even just a subset--is going to be a sizable task. I would start by identifying the escapes (backslash words) and figuring out how you're going to maintain state as attributes are applied and removed. |
PatrickP61 10-Sep-2007 [2265] | Hey Gregg -- That is just what I've been doing. I have identified the following: 1. That all printable \ { and } will show up in RTF as backslash along with the special character like \\ \{ or \} any remaining \, {, or } will be RTF commands. 2. { } and ; identify groupings with the open brace and terminating the group with close brace within the RTF. The semicolon is used to terminate sub parameters for a particular command. 3. \xxx will always identify a particular command with an optional number appended to it. Example: \b means bold while \b0 meand bold off. What I am toying with is to define simple rules to break apart a string of the RTF commands and embedded text into two parts, the command part and a parameter part. (some parameters may be a block of multiple values). I'm studying the Parse command to see what I can do simply and progress from there. |
Steeve 16-Oct-2007 [2266x2] | i know your script Gabriele and other similar scripts , i just think we could be more concise to write a grammar using reflexive rules |
I am aware that it increases the complexity of the parser understanding but it is just an intellectual exercise for the moment | |
Graham 16-Nov-2007 [2268x2] | How to reliably break a block of text up by whitespace? |
I tried parse/all text "^/^- " but I still get large blocks of text as one | |
older newer | first last |