r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

[unknown: 5]
31-Aug-2007
[2214x2]
Ok ran into an issue.  Is there an easy way to parse a string that 
has doublequotes in it together.  Such as {some chars "" some more 
chars"" and more}
I need the quotes to be single just one set and not two together 
and the parse to keep intact the string section because often it 
is a part of an html tag.
Robert
1-Sep-2007
[2216x2]
Paul, do a search & replace upfront. Much simpler than to create 
complex parse rules.
I often use this pattern. Do some basic action on the parse input, 
parse the first round, again do some other processing than using 
parse again. Much simpler and faster to get where you want to go.
Tomc
1-Sep-2007
[2218]
paul what rule are you using for your parse
[unknown: 5]
1-Sep-2007
[2219x3]
Thanks Robert, I'll look into that further as I did place with replace 
but because they were quotes it seemed that parse/all still wanted 
to break apart at a quote even though I told it only tabs.
Tom, for parse I only want to parse/all data tab.  Problem is that 
parse will break apart html tags and more.  I don't want to parse 
out tags because they will be needed to be left intact to some extent.
It just seems to me that parse/all data tab doesn't ONLY parse out 
the tabs but breaks at these doublequotes together.
Tomc
2-Sep-2007
[2222x3]
Paul how are you defining tab?  it seems to work for me.
str
== {some chars "" some more chars"" and more}
>> parse/all str "^-"
== [{some chars "" some more chars"" and more}]
there are no beraks at the double quotes.
[unknown: 5]
2-Sep-2007
[2225x11]
It looks like it breaks on html tags that might be broken.  For example, 
I was testing parse on a tab deliminated file and performing the 
following parse:
parse data "^-"
The problem is that some of these broken html tags cause parse to 
not work correctly.  The tags will contain double quotes (The result 
of an export from oscommerce).
sorry i was using parse/all data "^-"
Doesn't even have to be broken tags it appears
Just when a quote is preceeding the tag
data: {my string^-"<span style="font: 12px arial;>some text</span>"}
>> parse/all data "^-"

== ["my string" "<span style=" {font: 12px arial;>some text</span>"}]
Notice you get it breaking the string even where there is NOT a tab.
Is this a bug?
I've looked at this some more and it only seems to be a problem if 
the quote is preceeding the <span> tag.  If you move the quote around 
you get what is expected and get the correct expected parsing.
btiffin
2-Sep-2007
[2236x2]
Your example still doesn't seem to jive with the documentation.  
Reading the docs, I would expected two strings in the output block. 
 "my string" and the rest, in braces.  It has something to do with 
a double quote starting a parse sequence.   {"abc"def} parses as 
["abc" "def"]  { "abc"def"} parses as a single string as expected 
[{ "abc"def}]
typos; expected = expect  second example was supposed to be { "abc"def}


The space after the brace seems to trigger different behaviour than 
{" with no space after the brace.  Any character actually, the bad 
behaviour is only with brace immediately followed by double quote.
[unknown: 5]
3-Sep-2007
[2238]
btiffin use this example:       data: {my string^-"<span style="font: 
12px arial;>some text</span>"}
btiffin
3-Sep-2007
[2239]
Yeah, I think the weird parsing behavior is due to the fact that 
the tab seperator is followed immediately by a token that begins 
with double quote.  If you change the data to ... -^ "<span...  (note 
the space after the tab),  the behaviour changes. giving
>> parse/all data2 to string! tab

== ["my string" { "<span style="font: 12px arial;>some text</span>"}]

As I would expect.  You've uncovered something here.  parse seems 
dependent on quote as the first symbol in a token.
[unknown: 5]
3-Sep-2007
[2240x2]
Yeah but inserting the extra space is a crude workaround that still 
requires extra processing to then remove the space that was added. 
 You think this is a bug with parse?
I would add it to Rambo but not sure if it is one just yet.
RobertS
3-Sep-2007
[2242]
is there something I could test on 2.7.5 ?
[unknown: 5]
3-Sep-2007
[2243x2]
sure test this:
parse/all data: {my string^-"<span style="font: 12px arial;>some 
text</span>"} "^-"
btiffin
3-Sep-2007
[2245]
Paul;  re; Inserting extra space... No sorry, didn't mean to imply 
that.  Just pointing out that you've discovered a bug; afaik.
[unknown: 5]
3-Sep-2007
[2246]
Yeah I message Gabriel - want him to take a look at it.
Gabriele
4-Sep-2007
[2247]
it's not a bug - parse without a rule is meant for csv parsing, and 
quotes delimit a field. it's not as useful as it was intended to 
be, but it's intentional behavior. you need to provide your own rule 
if you don't want quotes to be parsed.
btiffin
4-Sep-2007
[2248]
Umm, that's not quite what is happening here. imho.

parse/all {"abc"def} to string! tab  should return [{"abc"def}]  
should it not?  ["abc" {def}] seems wrong.

parse/all { "abc"def} to string! tab  returns [{ "abc"def"}] as expected. 
 The quote being in the first postion effects the parse behaviour 
that much?
[unknown: 5]
4-Sep-2007
[2249]
Yeah it definately seems like odd behavior to me.  Also, isn't the 
TAB string the rule?  Maybe, I don't get what your saying Gabriele.
PeterWood
4-Sep-2007
[2250]
Paul: I don't think the TAB string counts as a rule. It is a parameter 
supplying a specified delimiter when using parse for splitting strings 
(paraphrasing the User Guide).
Gabriele
4-Sep-2007
[2251]
tab is the delimiter. " after delimiter (which also means " as first 
char) means that the field is delimited by quotes. as i said, it 
was intended to parse csv files easily, however, i think it gets 
on the way most often than not. there should at least be a refinement 
to disable this. in any case, currently the only way around it is 
using your own rule.
[unknown: 5]
4-Sep-2007
[2252]
So how would you fix this problem with a rule?
Tomc
4-Sep-2007
[2253]
data: {my string^-"<span style="font: 12px arial;>some text</span>"}

rule: [copy token [to tab | to end](insert/only tail result token) 
skip rule]
parse/all data [(result: copy []) rule]
result


 ["my string" {"<span style="font: 12px arial;>some text</span>"}]
btiffin
4-Sep-2007
[2254]
Thanks for the clarification Gabriele.  Tomc et al; we rebols really 
really need a place for long term storage of these types of 'work 
arounds'  :)
Tomc
4-Sep-2007
[2255x2]
don't see it as a workaround. If I am using parse/all I always have 
a rule in a block and not a simple string.
I see using parse data string as a shortcut that I can rarely afford
btiffin
4-Sep-2007
[2257x2]
It's kinda why I put work around in quotes, as it isn't really a 
workaround, more a means to an end.
Still think we need a hints and tips pile somewhere  :)
[unknown: 5]
5-Sep-2007
[2259]
thanks Tomc.
PatrickP61
5-Sep-2007
[2260x2]
Hi all,    Have any of you written a parser to handle .rtf files?


I am trying create a simple template file that I can parse against 
to identify Underlined, Bold, Italic, or Regular field values.


Example:  (Since I cannot Bold, Italic, or underline within Altme, 
please pretend to see what I'm saying).

 Config File: (when I typed the following using WordPad) looks like 
 this


Default Arial font 10 * Regular Courier New font 11 * Italic * Bold
Bold Italic * Regular Underline * Regular Strikeout
Regular Underline Strikeout
Bold Italic Underline Strikeout

	Same file when using Notepad to view:


{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 
Arial;}{\f1\fmodern\fprq1\fcharset0 Courier New;}}
{\colortbl ;\red0\green0\blue0;}

{\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20 
Default Arial font 10 * \cf1\f1\fs22 Regular Courier New font 11 
* \i Italic * \b\i0 Bold\par

\i Bold Italic * \ul\b0\i0 Regular Underline * \ulnone\strike Regular 
Strikeout\par
\ul Regular Underline Strikeout\par
\b\i Bold Italic Underline Strikeout\par
\cf0\ulnone\b0\i0\strike0\f0\fs20\par
}


I can guess that fs20 refers to the default Arial font, while FS22 
is the Courier New font.
\i italic
\b bold
\ul underline
\par may mean newline
 

I am not sure of what I want the parser to return the results as 
and was wondering if someone has already made a generic parser of 
.rtf files, or can point me out to info regarding them?
I found some info in Wiki
Tomc
5-Sep-2007
[2262]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp
PatrickP61
5-Sep-2007
[2263]
Wow, I never realized how incredibly extensive RTF is.


The ONLY thing I need is to identify the character position and length 
of Regular, Italic, Bold, Underline, or Strikeout and the text, so 
in my above example, maybe the parser could return this:  Note: birsu 
stands for Bold, Italic, Regular, Strikeout, Underline.
Line	Pos	Len	birsu	Text
1	1	24	..r..	"Default Arial font 10 * " 
1	25	(n)	..r..	"Regular Courier New font 11 * "
1	(..)	(..)	.i...	"Italic * "		

1	(..)	(..)	b....	"Bold"(newline)			<-- note \i0 turns off itialic

2	1	14	bi...	"Bold Italic * "			<-- note \b is still in effect from 
a previous setting

2	15	(..)	..r.u	"Regular Underline * "		<-- note \i\b is turned off.
2	(..)	(..)	..rs.	"Regular Strikeout"(newline)
3	1	(..)	..rsu	"Regular Underline Strikeout"(newline)
4	1	(..)	bi.su	"Bold Italic Underline Strikeout"(newline)

Ideas on how to do this as a start?