World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
btiffin 3-Sep-2007 [2239] | Yeah, I think the weird parsing behavior is due to the fact that the tab seperator is followed immediately by a token that begins with double quote. If you change the data to ... -^ "<span... (note the space after the tab), the behaviour changes. giving >> parse/all data2 to string! tab == ["my string" { "<span style="font: 12px arial;>some text</span>"}] As I would expect. You've uncovered something here. parse seems dependent on quote as the first symbol in a token. |
[unknown: 5] 3-Sep-2007 [2240x2] | Yeah but inserting the extra space is a crude workaround that still requires extra processing to then remove the space that was added. You think this is a bug with parse? |
I would add it to Rambo but not sure if it is one just yet. | |
RobertS 3-Sep-2007 [2242] | is there something I could test on 2.7.5 ? |
[unknown: 5] 3-Sep-2007 [2243x2] | sure test this: |
parse/all data: {my string^-"<span style="font: 12px arial;>some text</span>"} "^-" | |
btiffin 3-Sep-2007 [2245] | Paul; re; Inserting extra space... No sorry, didn't mean to imply that. Just pointing out that you've discovered a bug; afaik. |
[unknown: 5] 3-Sep-2007 [2246] | Yeah I message Gabriel - want him to take a look at it. |
Gabriele 4-Sep-2007 [2247] | it's not a bug - parse without a rule is meant for csv parsing, and quotes delimit a field. it's not as useful as it was intended to be, but it's intentional behavior. you need to provide your own rule if you don't want quotes to be parsed. |
btiffin 4-Sep-2007 [2248] | Umm, that's not quite what is happening here. imho. parse/all {"abc"def} to string! tab should return [{"abc"def}] should it not? ["abc" {def}] seems wrong. parse/all { "abc"def} to string! tab returns [{ "abc"def"}] as expected. The quote being in the first postion effects the parse behaviour that much? |
[unknown: 5] 4-Sep-2007 [2249] | Yeah it definately seems like odd behavior to me. Also, isn't the TAB string the rule? Maybe, I don't get what your saying Gabriele. |
PeterWood 4-Sep-2007 [2250] | Paul: I don't think the TAB string counts as a rule. It is a parameter supplying a specified delimiter when using parse for splitting strings (paraphrasing the User Guide). |
Gabriele 4-Sep-2007 [2251] | tab is the delimiter. " after delimiter (which also means " as first char) means that the field is delimited by quotes. as i said, it was intended to parse csv files easily, however, i think it gets on the way most often than not. there should at least be a refinement to disable this. in any case, currently the only way around it is using your own rule. |
[unknown: 5] 4-Sep-2007 [2252] | So how would you fix this problem with a rule? |
Tomc 4-Sep-2007 [2253] | data: {my string^-"<span style="font: 12px arial;>some text</span>"} rule: [copy token [to tab | to end](insert/only tail result token) skip rule] parse/all data [(result: copy []) rule] result ["my string" {"<span style="font: 12px arial;>some text</span>"}] |
btiffin 4-Sep-2007 [2254] | Thanks for the clarification Gabriele. Tomc et al; we rebols really really need a place for long term storage of these types of 'work arounds' :) |
Tomc 4-Sep-2007 [2255x2] | don't see it as a workaround. If I am using parse/all I always have a rule in a block and not a simple string. |
I see using parse data string as a shortcut that I can rarely afford | |
btiffin 4-Sep-2007 [2257x2] | It's kinda why I put work around in quotes, as it isn't really a workaround, more a means to an end. |
Still think we need a hints and tips pile somewhere :) | |
[unknown: 5] 5-Sep-2007 [2259] | thanks Tomc. |
PatrickP61 5-Sep-2007 [2260x2] | Hi all, Have any of you written a parser to handle .rtf files? I am trying create a simple template file that I can parse against to identify Underlined, Bold, Italic, or Regular field values. Example: (Since I cannot Bold, Italic, or underline within Altme, please pretend to see what I'm saying). Config File: (when I typed the following using WordPad) looks like this Default Arial font 10 * Regular Courier New font 11 * Italic * Bold Bold Italic * Regular Underline * Regular Strikeout Regular Underline Strikeout Bold Italic Underline Strikeout Same file when using Notepad to view: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}{\f1\fmodern\fprq1\fcharset0 Courier New;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20 Default Arial font 10 * \cf1\f1\fs22 Regular Courier New font 11 * \i Italic * \b\i0 Bold\par \i Bold Italic * \ul\b0\i0 Regular Underline * \ulnone\strike Regular Strikeout\par \ul Regular Underline Strikeout\par \b\i Bold Italic Underline Strikeout\par \cf0\ulnone\b0\i0\strike0\f0\fs20\par } I can guess that fs20 refers to the default Arial font, while FS22 is the Courier New font. \i italic \b bold \ul underline \par may mean newline I am not sure of what I want the parser to return the results as and was wondering if someone has already made a generic parser of .rtf files, or can point me out to info regarding them? |
I found some info in Wiki | |
Tomc 5-Sep-2007 [2262] | http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp |
PatrickP61 5-Sep-2007 [2263] | Wow, I never realized how incredibly extensive RTF is. The ONLY thing I need is to identify the character position and length of Regular, Italic, Bold, Underline, or Strikeout and the text, so in my above example, maybe the parser could return this: Note: birsu stands for Bold, Italic, Regular, Strikeout, Underline. Line Pos Len birsu Text 1 1 24 ..r.. "Default Arial font 10 * " 1 25 (n) ..r.. "Regular Courier New font 11 * " 1 (..) (..) .i... "Italic * " 1 (..) (..) b.... "Bold"(newline) <-- note \i0 turns off itialic 2 1 14 bi... "Bold Italic * " <-- note \b is still in effect from a previous setting 2 15 (..) ..r.u "Regular Underline * " <-- note \i\b is turned off. 2 (..) (..) ..rs. "Regular Strikeout"(newline) 3 1 (..) ..rsu "Regular Underline Strikeout"(newline) 4 1 (..) bi.su "Bold Italic Underline Strikeout"(newline) Ideas on how to do this as a start? |
Gregg 10-Sep-2007 [2264] | First, you may need to spend some time with PARSE, so you're *really* comfortable with it. Taking on something like RTF--even just a subset--is going to be a sizable task. I would start by identifying the escapes (backslash words) and figuring out how you're going to maintain state as attributes are applied and removed. |
PatrickP61 10-Sep-2007 [2265] | Hey Gregg -- That is just what I've been doing. I have identified the following: 1. That all printable \ { and } will show up in RTF as backslash along with the special character like \\ \{ or \} any remaining \, {, or } will be RTF commands. 2. { } and ; identify groupings with the open brace and terminating the group with close brace within the RTF. The semicolon is used to terminate sub parameters for a particular command. 3. \xxx will always identify a particular command with an optional number appended to it. Example: \b means bold while \b0 meand bold off. What I am toying with is to define simple rules to break apart a string of the RTF commands and embedded text into two parts, the command part and a parameter part. (some parameters may be a block of multiple values). I'm studying the Parse command to see what I can do simply and progress from there. |
Steeve 16-Oct-2007 [2266x2] | i know your script Gabriele and other similar scripts , i just think we could be more concise to write a grammar using reflexive rules |
I am aware that it increases the complexity of the parser understanding but it is just an intellectual exercise for the moment | |
Graham 16-Nov-2007 [2268x4] | How to reliably break a block of text up by whitespace? |
I tried parse/all text "^/^- " but I still get large blocks of text as one | |
I guess I have to use charsets of whitespace and non-whitespace | |
just seems that it should be easier to split up a block of text by the whitespace | |
Sunanda 16-Nov-2007 [2272] | Have you tried parse/all trim/lines "..." " " |
Graham 16-Nov-2007 [2273x2] | it's getting fooled by "{" chars I think |
parse doesn't like " and { | |
Sunanda 16-Nov-2007 [2275] | That rings a bell --- I vaguely remember having to do stuff like replacing " or } with to-char 0 before doing some parses, and then changing back afterwards. That works if you have no to-char 0 in your strings |
Graham 16-Nov-2007 [2276] | I'll have to go back over my old scripts where I solved this before :( |
Oldes 16-Nov-2007 [2277] | If I remember well, this behaviour is because of CSV parsing - parse with delimiters (rules as a string) was designed mainly for that case. |
Graham 16-Nov-2007 [2278x2] | I'll try Gregg's split function |
Nice to have code snippets on line when the brain is too tired to create one's own | |
Brock 22-Nov-2007 [2280x3] | What's wrong with this? I'm trying to retrieve the "area" query string parameter out of this web log record... test: {10.200.55.63 - - [22/Oct/2007:10:32:57 -0500] "GET /irj/servlet/prt/portal/prtroot/com.cpc.km.Redirect?userid=KALEFBM&area=chm&Rurl=http://bjzprd /sellserve/displaysalesupdate.aspx?id=3815" 302 182} |
with the following parse statement... parse test [ thru "area=" copy new-area [to " " | to "?" | to "&"] to end (if debug? [print new-area]) ] | |
I expect the return to be just the characters chm, however the remainder of the querystring text is also being transfered. So the to "&" is not being considered within the rule. | |
Chris 22-Nov-2007 [2283] | I don't think you can use copy in that way. |
Brock 22-Nov-2007 [2284] | meaning I would nead to have 3 thru... copy... to... rules? |
Steeve 22-Nov-2007 [2285] | parse/all test [thru "&area=" copy val to "&"] print val |
Chris 22-Nov-2007 [2286x3] | Hmm, no - I'm wrong. Try parse/all first though (for to " ") |
Or, instead of parse, do -- select decode-cgi find/tail string "?" to-set-word 'area | |
string = test | |
older newer | first last |