r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

[unknown: 5]
1-Sep-2007
[2220x2]
Tom, for parse I only want to parse/all data tab.  Problem is that 
parse will break apart html tags and more.  I don't want to parse 
out tags because they will be needed to be left intact to some extent.
It just seems to me that parse/all data tab doesn't ONLY parse out 
the tabs but breaks at these doublequotes together.
Tomc
2-Sep-2007
[2222x3]
Paul how are you defining tab?  it seems to work for me.
str
== {some chars "" some more chars"" and more}
>> parse/all str "^-"
== [{some chars "" some more chars"" and more}]
there are no beraks at the double quotes.
[unknown: 5]
2-Sep-2007
[2225x11]
It looks like it breaks on html tags that might be broken.  For example, 
I was testing parse on a tab deliminated file and performing the 
following parse:
parse data "^-"
The problem is that some of these broken html tags cause parse to 
not work correctly.  The tags will contain double quotes (The result 
of an export from oscommerce).
sorry i was using parse/all data "^-"
Doesn't even have to be broken tags it appears
Just when a quote is preceeding the tag
data: {my string^-"<span style="font: 12px arial;>some text</span>"}
>> parse/all data "^-"

== ["my string" "<span style=" {font: 12px arial;>some text</span>"}]
Notice you get it breaking the string even where there is NOT a tab.
Is this a bug?
I've looked at this some more and it only seems to be a problem if 
the quote is preceeding the <span> tag.  If you move the quote around 
you get what is expected and get the correct expected parsing.
btiffin
2-Sep-2007
[2236x2]
Your example still doesn't seem to jive with the documentation.  
Reading the docs, I would expected two strings in the output block. 
 "my string" and the rest, in braces.  It has something to do with 
a double quote starting a parse sequence.   {"abc"def} parses as 
["abc" "def"]  { "abc"def"} parses as a single string as expected 
[{ "abc"def}]
typos; expected = expect  second example was supposed to be { "abc"def}


The space after the brace seems to trigger different behaviour than 
{" with no space after the brace.  Any character actually, the bad 
behaviour is only with brace immediately followed by double quote.
[unknown: 5]
3-Sep-2007
[2238]
btiffin use this example:       data: {my string^-"<span style="font: 
12px arial;>some text</span>"}
btiffin
3-Sep-2007
[2239]
Yeah, I think the weird parsing behavior is due to the fact that 
the tab seperator is followed immediately by a token that begins 
with double quote.  If you change the data to ... -^ "<span...  (note 
the space after the tab),  the behaviour changes. giving
>> parse/all data2 to string! tab

== ["my string" { "<span style="font: 12px arial;>some text</span>"}]

As I would expect.  You've uncovered something here.  parse seems 
dependent on quote as the first symbol in a token.
[unknown: 5]
3-Sep-2007
[2240x2]
Yeah but inserting the extra space is a crude workaround that still 
requires extra processing to then remove the space that was added. 
 You think this is a bug with parse?
I would add it to Rambo but not sure if it is one just yet.
RobertS
3-Sep-2007
[2242]
is there something I could test on 2.7.5 ?
[unknown: 5]
3-Sep-2007
[2243x2]
sure test this:
parse/all data: {my string^-"<span style="font: 12px arial;>some 
text</span>"} "^-"
btiffin
3-Sep-2007
[2245]
Paul;  re; Inserting extra space... No sorry, didn't mean to imply 
that.  Just pointing out that you've discovered a bug; afaik.
[unknown: 5]
3-Sep-2007
[2246]
Yeah I message Gabriel - want him to take a look at it.
Gabriele
4-Sep-2007
[2247]
it's not a bug - parse without a rule is meant for csv parsing, and 
quotes delimit a field. it's not as useful as it was intended to 
be, but it's intentional behavior. you need to provide your own rule 
if you don't want quotes to be parsed.
btiffin
4-Sep-2007
[2248]
Umm, that's not quite what is happening here. imho.

parse/all {"abc"def} to string! tab  should return [{"abc"def}]  
should it not?  ["abc" {def}] seems wrong.

parse/all { "abc"def} to string! tab  returns [{ "abc"def"}] as expected. 
 The quote being in the first postion effects the parse behaviour 
that much?
[unknown: 5]
4-Sep-2007
[2249]
Yeah it definately seems like odd behavior to me.  Also, isn't the 
TAB string the rule?  Maybe, I don't get what your saying Gabriele.
PeterWood
4-Sep-2007
[2250]
Paul: I don't think the TAB string counts as a rule. It is a parameter 
supplying a specified delimiter when using parse for splitting strings 
(paraphrasing the User Guide).
Gabriele
4-Sep-2007
[2251]
tab is the delimiter. " after delimiter (which also means " as first 
char) means that the field is delimited by quotes. as i said, it 
was intended to parse csv files easily, however, i think it gets 
on the way most often than not. there should at least be a refinement 
to disable this. in any case, currently the only way around it is 
using your own rule.
[unknown: 5]
4-Sep-2007
[2252]
So how would you fix this problem with a rule?
Tomc
4-Sep-2007
[2253]
data: {my string^-"<span style="font: 12px arial;>some text</span>"}

rule: [copy token [to tab | to end](insert/only tail result token) 
skip rule]
parse/all data [(result: copy []) rule]
result


 ["my string" {"<span style="font: 12px arial;>some text</span>"}]
btiffin
4-Sep-2007
[2254]
Thanks for the clarification Gabriele.  Tomc et al; we rebols really 
really need a place for long term storage of these types of 'work 
arounds'  :)
Tomc
4-Sep-2007
[2255x2]
don't see it as a workaround. If I am using parse/all I always have 
a rule in a block and not a simple string.
I see using parse data string as a shortcut that I can rarely afford
btiffin
4-Sep-2007
[2257x2]
It's kinda why I put work around in quotes, as it isn't really a 
workaround, more a means to an end.
Still think we need a hints and tips pile somewhere  :)
[unknown: 5]
5-Sep-2007
[2259]
thanks Tomc.
PatrickP61
5-Sep-2007
[2260x2]
Hi all,    Have any of you written a parser to handle .rtf files?


I am trying create a simple template file that I can parse against 
to identify Underlined, Bold, Italic, or Regular field values.


Example:  (Since I cannot Bold, Italic, or underline within Altme, 
please pretend to see what I'm saying).

 Config File: (when I typed the following using WordPad) looks like 
 this


Default Arial font 10 * Regular Courier New font 11 * Italic * Bold
Bold Italic * Regular Underline * Regular Strikeout
Regular Underline Strikeout
Bold Italic Underline Strikeout

	Same file when using Notepad to view:


{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 
Arial;}{\f1\fmodern\fprq1\fcharset0 Courier New;}}
{\colortbl ;\red0\green0\blue0;}

{\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20 
Default Arial font 10 * \cf1\f1\fs22 Regular Courier New font 11 
* \i Italic * \b\i0 Bold\par

\i Bold Italic * \ul\b0\i0 Regular Underline * \ulnone\strike Regular 
Strikeout\par
\ul Regular Underline Strikeout\par
\b\i Bold Italic Underline Strikeout\par
\cf0\ulnone\b0\i0\strike0\f0\fs20\par
}


I can guess that fs20 refers to the default Arial font, while FS22 
is the Courier New font.
\i italic
\b bold
\ul underline
\par may mean newline
 

I am not sure of what I want the parser to return the results as 
and was wondering if someone has already made a generic parser of 
.rtf files, or can point me out to info regarding them?
I found some info in Wiki
Tomc
5-Sep-2007
[2262]
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/rtfspec.asp
PatrickP61
5-Sep-2007
[2263]
Wow, I never realized how incredibly extensive RTF is.


The ONLY thing I need is to identify the character position and length 
of Regular, Italic, Bold, Underline, or Strikeout and the text, so 
in my above example, maybe the parser could return this:  Note: birsu 
stands for Bold, Italic, Regular, Strikeout, Underline.
Line	Pos	Len	birsu	Text
1	1	24	..r..	"Default Arial font 10 * " 
1	25	(n)	..r..	"Regular Courier New font 11 * "
1	(..)	(..)	.i...	"Italic * "		

1	(..)	(..)	b....	"Bold"(newline)			<-- note \i0 turns off itialic

2	1	14	bi...	"Bold Italic * "			<-- note \b is still in effect from 
a previous setting

2	15	(..)	..r.u	"Regular Underline * "		<-- note \i\b is turned off.
2	(..)	(..)	..rs.	"Regular Strikeout"(newline)
3	1	(..)	..rsu	"Regular Underline Strikeout"(newline)
4	1	(..)	bi.su	"Bold Italic Underline Strikeout"(newline)

Ideas on how to do this as a start?
Gregg
10-Sep-2007
[2264]
First, you may need to spend some time with PARSE, so you're *really* 
comfortable with it. Taking on something like RTF--even just a subset--is 
going to be a sizable task. I would start by identifying the escapes 
(backslash words) and figuring out how you're going to maintain state 
as attributes are applied and removed.
PatrickP61
10-Sep-2007
[2265]
Hey Gregg -- That is just what I've been doing.  I have identified 
the following:

1. That all printable \ { and } will show up in RTF as backslash 
along with the special character like \\   \{  or \}  any remaining 
\, {, or } will be RTF commands.

2.  {  }  and ; identify groupings with the open brace and terminating 
the group with close brace within the RTF.  The semicolon is used 
to terminate sub parameters for a particular command.

3.  \xxx  will always identify a particular command with an optional 
number appended to it.  Example: \b  means bold while \b0 meand bold 
off.


What I am toying with is to define simple rules to break apart a 
string of the RTF commands and embedded text into two parts, the 
command part and a parameter part.  (some parameters may be a block 
of multiple values).


I'm studying the Parse command to see what I can do simply and progress 
from there.
Steeve
16-Oct-2007
[2266x2]
i know your script Gabriele and other similar scripts , i just think 
we could be more concise to write a grammar using reflexive rules
I am aware that it increases the complexity of the parser understanding 
but it is just an intellectual exercise for the moment
Graham
16-Nov-2007
[2268x2]
How to reliably break a block of text up by whitespace?
I tried parse/all text "^/^- " but I still get large blocks of text 
as one