r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

btiffin
3-Sep-2007
[2239]
Yeah, I think the weird parsing behavior is due to the fact that 
the tab seperator is followed immediately by a token that begins 
with double quote.  If you change the data to ... -^ "<span...  (note 
the space after the tab),  the behaviour changes. giving
>> parse/all data2 to string! tab

== ["my string" { "<span style="font: 12px arial;>some text</span>"}]

As I would expect.  You've uncovered something here.  parse seems 
dependent on quote as the first symbol in a token.
[unknown: 5]
3-Sep-2007
[2240x2]
Yeah but inserting the extra space is a crude workaround that still 
requires extra processing to then remove the space that was added. 
 You think this is a bug with parse?
I would add it to Rambo but not sure if it is one just yet.
RobertS
3-Sep-2007
[2242]
is there something I could test on 2.7.5 ?
[unknown: 5]
3-Sep-2007
[2243x2]
sure test this:
parse/all data: {my string^-"<span style="font: 12px arial;>some 
text</span>"} "^-"
btiffin
3-Sep-2007
[2245]
Paul;  re; Inserting extra space... No sorry, didn't mean to imply 
that.  Just pointing out that you've discovered a bug; afaik.
[unknown: 5]
3-Sep-2007
[2246]
Yeah I message Gabriel - want him to take a look at it.
Gabriele
4-Sep-2007
[2247]
it's not a bug - parse without a rule is meant for csv parsing, and 
quotes delimit a field. it's not as useful as it was intended to 
be, but it's intentional behavior. you need to provide your own rule 
if you don't want quotes to be parsed.
btiffin
4-Sep-2007
[2248]
Umm, that's not quite what is happening here. imho.

parse/all {"abc"def} to string! tab  should return [{"abc"def}]  
should it not?  ["abc" {def}] seems wrong.

parse/all { "abc"def} to string! tab  returns [{ "abc"def"}] as expected. 
 The quote being in the first postion effects the parse behaviour 
that much?
[unknown: 5]
4-Sep-2007
[2249]
Yeah it definately seems like odd behavior to me.  Also, isn't the 
TAB string the rule?  Maybe, I don't get what your saying Gabriele.
PeterWood
4-Sep-2007
[2250]
Paul: I don't think the TAB string counts as a rule. It is a parameter 
supplying a specified delimiter when using parse for splitting strings 
(paraphrasing the User Guide).
Gabriele
4-Sep-2007
[2251]
tab is the delimiter. " after delimiter (which also means " as first 
char) means that the field is delimited by quotes. as i said, it 
was intended to parse csv files easily, however, i think it gets 
on the way most often than not. there should at least be a refinement 
to disable this. in any case, currently the only way around it is 
using your own rule.
[unknown: 5]
4-Sep-2007
[2252]
So how would you fix this problem with a rule?
Tomc
4-Sep-2007
[2253]
data: {my string^-"<span style="font: 12px arial;>some text</span>"}

rule: [copy token [to tab | to end](insert/only tail result token) 
skip rule]
parse/all data [(result: copy []) rule]
result


 ["my string" {"<span style="font: 12px arial;>some text</span>"}]
btiffin
4-Sep-2007
[2254]
Thanks for the clarification Gabriele.  Tomc et al; we rebols really 
really need a place for long term storage of these types of 'work 
arounds'  :)
Tomc
4-Sep-2007
[2255x2]
don't see it as a workaround. If I am using parse/all I always have 
a rule in a block and not a simple string.
I see using parse data string as a shortcut that I can rarely afford
btiffin
4-Sep-2007
[2257x2]
It's kinda why I put work around in quotes, as it isn't really a 
workaround, more a means to an end.
Still think we need a hints and tips pile somewhere  :)
[unknown: 5]
5-Sep-2007
[2259]
thanks Tomc.
PatrickP61
5-Sep-2007
[2260x2]
Hi all,    Have any of you written a parser to handle .rtf files?


I am trying create a simple template file that I can parse against 
to identify Underlined, Bold, Italic, or Regular field values.


Example:  (Since I cannot Bold, Italic, or underline within Altme, 
please pretend to see what I'm saying).

 Config File: (when I typed the following using WordPad) looks like 
 this


Default Arial font 10 * Regular Courier New font 11 * Italic * Bold
Bold Italic * Regular Underline * Regular Strikeout
Regular Underline Strikeout
Bold Italic Underline Strikeout

	Same file when using Notepad to view:


{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 
Arial;}{\f1\fmodern\fprq1\fcharset0 Courier New;}}
{\colortbl ;\red0\green0\blue0;}

{\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20 
Default Arial font 10 * \cf1\f1\fs22 Regular Courier New font 11 
* \i Italic * \b\i0 Bold\par

\i Bold Italic * \ul\b0\i0 Regular Underline * \ulnone\strike Regular 
Strikeout\par
\ul Regular Underline Strikeout\par
\b\i Bold Italic Underline Strikeout\par
\cf0\ulnone\b0\i0\strike0\f0\fs20\par
}


I can guess that fs20 refers to the default Arial font, while FS22 
is the Courier New font.
\i italic
\b bold
\ul underline
\par may mean newline
 

I am not sure of what I want the parser to return the results as 
and was wondering if someone has already made a generic parser of 
.rtf files, or can point me out to info regarding them?
I found some info in Wiki
Tomc
5-Sep-2007
[2262]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp
PatrickP61
5-Sep-2007
[2263]
Wow, I never realized how incredibly extensive RTF is.


The ONLY thing I need is to identify the character position and length 
of Regular, Italic, Bold, Underline, or Strikeout and the text, so 
in my above example, maybe the parser could return this:  Note: birsu 
stands for Bold, Italic, Regular, Strikeout, Underline.
Line	Pos	Len	birsu	Text
1	1	24	..r..	"Default Arial font 10 * " 
1	25	(n)	..r..	"Regular Courier New font 11 * "
1	(..)	(..)	.i...	"Italic * "		

1	(..)	(..)	b....	"Bold"(newline)			<-- note \i0 turns off itialic

2	1	14	bi...	"Bold Italic * "			<-- note \b is still in effect from 
a previous setting

2	15	(..)	..r.u	"Regular Underline * "		<-- note \i\b is turned off.
2	(..)	(..)	..rs.	"Regular Strikeout"(newline)
3	1	(..)	..rsu	"Regular Underline Strikeout"(newline)
4	1	(..)	bi.su	"Bold Italic Underline Strikeout"(newline)

Ideas on how to do this as a start?
Gregg
10-Sep-2007
[2264]
First, you may need to spend some time with PARSE, so you're *really* 
comfortable with it. Taking on something like RTF--even just a subset--is 
going to be a sizable task. I would start by identifying the escapes 
(backslash words) and figuring out how you're going to maintain state 
as attributes are applied and removed.
PatrickP61
10-Sep-2007
[2265]
Hey Gregg -- That is just what I've been doing.  I have identified 
the following:

1. That all printable \ { and } will show up in RTF as backslash 
along with the special character like \\   \{  or \}  any remaining 
\, {, or } will be RTF commands.

2.  {  }  and ; identify groupings with the open brace and terminating 
the group with close brace within the RTF.  The semicolon is used 
to terminate sub parameters for a particular command.

3.  \xxx  will always identify a particular command with an optional 
number appended to it.  Example: \b  means bold while \b0 meand bold 
off.


What I am toying with is to define simple rules to break apart a 
string of the RTF commands and embedded text into two parts, the 
command part and a parameter part.  (some parameters may be a block 
of multiple values).


I'm studying the Parse command to see what I can do simply and progress 
from there.
Steeve
16-Oct-2007
[2266x2]
i know your script Gabriele and other similar scripts , i just think 
we could be more concise to write a grammar using reflexive rules
I am aware that it increases the complexity of the parser understanding 
but it is just an intellectual exercise for the moment
Graham
16-Nov-2007
[2268x4]
How to reliably break a block of text up by whitespace?
I tried parse/all text "^/^- " but I still get large blocks of text 
as one
I guess I have to use charsets of whitespace and non-whitespace
just seems that it should be easier to split up a block of text by 
the whitespace
Sunanda
16-Nov-2007
[2272]
Have you tried 
 parse/all trim/lines "..." " "
Graham
16-Nov-2007
[2273x2]
it's getting fooled by "{" chars I think
parse doesn't like " and  {
Sunanda
16-Nov-2007
[2275]
That rings a bell --- I vaguely remember having to do stuff like 
replacing 
   " or }
with
    to-char 0
before doing some parses, and then changing back afterwards.
That works if you have no to-char 0 in your strings
Graham
16-Nov-2007
[2276]
I'll have to go back over my old scripts where I solved this before 
:(
Oldes
16-Nov-2007
[2277]
If I remember well, this behaviour is because of CSV parsing - parse 
with delimiters (rules as a string) was designed mainly for that 
case.
Graham
16-Nov-2007
[2278x2]
I'll try Gregg's split function
Nice to have code snippets on line when the brain is too tired to 
create one's own
Brock
22-Nov-2007
[2280x3]
What's wrong with this?  I'm trying to retrieve the "area" query 
string parameter out of this web log record...

test: {10.200.55.63 - - [22/Oct/2007:10:32:57 -0500] "GET /irj/servlet/prt/portal/prtroot/com.cpc.km.Redirect?userid=KALEFBM&area=chm&Rurl=http://bjzprd
/sellserve/displaysalesupdate.aspx?id=3815" 302 182}
with the following parse statement...
parse test [
	thru "area="
	copy new-area
	[to " " | to "?" | to "&"]
	to end
	(if debug? [print new-area])
]
I expect the return to be just the characters   chm, however the 
remainder of the querystring text is also being transfered.  So the 
   to "&"     is not  being considered within the rule.
Chris
22-Nov-2007
[2283]
I don't think you can use copy in that way.
Brock
22-Nov-2007
[2284]
meaning I would nead to have 3   thru... copy... to...  rules?
Steeve
22-Nov-2007
[2285]
parse/all test [thru "&area=" copy val to "&"]
print val
Chris
22-Nov-2007
[2286x3]
Hmm, no - I'm wrong.  Try parse/all first though (for to " ")
Or, instead of parse, do -- select decode-cgi find/tail string "?" 
to-set-word 'area
string = test