r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Josh
3-Jun-2008
[2570x5]
I'm finally digging into parse now, but I have a question about HTML. 
  Big idea:  pulling the data out of an HTML table (made in Word--ugh!). 
 Where I am stuck:  Is there a way to create a rule for opening tags 
such as <tr> that include a lot of formatting:  i.e. <tr style="mso........> 
?   I want to pull the info inbetween the opening and closing tags.
Here is some data:
<tr style='mso-yfti-irow:6;mso-row-margin-left:.18%;mso-row-margin-right:20.4%'>

  <td style='mso-cell-special:placeholder;border:none;padding:0in 0in 
  0in 0in'
  width="0%"><p class='MsoNormal'>&nbsp;</td>

  <td width="23%" colspan=2 style='width:23.6%;padding:0in 0in 0in 
  0in'>

  <p class=MsoNormal><span style='font-family:"Lucida Sans Unicode"'>&nbsp;MNDLDA09Mar03a_e<o:p></o:p></span></p>
  </td>
with the </tr> at the end
I came up with a rule:  [some [thru "<td" thru ">" y: to "</td>" 
(a: remove-each tag load/markup y [tag? tag])]]  but it seems to 
not be as efficient as it could be.
Brock
3-Jun-2008
[2575]
wouldn't you use "copy y " instead of "y:"?
Geomol
3-Jun-2008
[2576x2]
Josh, if you do a load/markup on the whole string, you get a block 
with tags and strings. You can then pick the string from the block, 
maybe doing TRIM on them to sort out newlines and spaces. Like:

blk: load/markup your-data
foreach f blk [if all [string? f "" <> trim f] [print f]]
If you wanna use PARSE, you can do something like:


parse your-data [some ["<" thru ">" | copy y to "<" (if "" <> trim 
y [print y])]]
Chris
3-Jun-2008
[2578x2]
I've been toying with this to obtain a very parsable "dialect" -- 
my goal being to scrape live game updates from a certain sports web 
site (for personal use, natch).  It's reliant on 'parse-xml though, 
so ymmv....

do http://www.ross-gill.com/r/scrape.r
probe load-xml some-xml
Result is a little like:

	from -- <tag attr="attribute">Content</tag>
	to -- <tag> /attr attribute "Content"
Anton
4-Jun-2008
[2580]
Josh, using the REMOVE-EACH very often is what makes your parse slow. 
A remove operation in the middle of a large string is slow, and you 
are doing many removes. That's why the others suggested using copy.
Josh
6-Jun-2008
[2581]
Thanks for the input.  I will have to play around with those later 
as I am trying to get this finished up and then I can go back and 
clean up the code. The data is minimal enough for the script to finish 
in under a second anyway.   Parse is pretty sweet.   Makes this much 
neater than the alternative
Anton
7-Jun-2008
[2582]
No worries.
amacleod
30-Jun-2008
[2583]
I'm trying to copy some text from the position found iwhile parsing 
a document.
I'm using something like: 


rule: [some digit copy text to newline]    (--where "digit has ben 
defined as all digits 0 to 9)

 This copies eveerything after the digit. How would I copy the digit 
 itself as well?
Brock
30-Jun-2008
[2584x2]
would it not simply be....    to some digit    instead of what you 
have above?  I'll start playing around and see if I can be of any 
help (if you haven't already figured it out)
Not as easy as it seemed to be.  Will take more time than I have 
right now.
amacleod
30-Jun-2008
[2586]
Is there a difference between using "to" and "thru"
[unknown: 5]
30-Jun-2008
[2587]
yes
Graham
30-Jun-2008
[2588]
is this block parsing?
[unknown: 5]
30-Jun-2008
[2589]
to goes to the point and thru includes the point
amacleod
30-Jun-2008
[2590x2]
No
So to newline does not include the newline?
[unknown: 5]
30-Jun-2008
[2592]
no it wouldn't
Graham
30-Jun-2008
[2593x2]
rule: [ digit copy text to newline skip ]
parse stuff [ some rule ]
digits: [ some digit ]
rule: [ digits ... ]
amacleod
30-Jun-2008
[2595]
Graham, the digit String would be included in the copied text with 
this?
Graham
30-Jun-2008
[2596x3]
nope
because it matches digit and the cursor moves on
past the digit
amacleod
30-Jun-2008
[2599]
Right. Anyway to capture the digit?
[unknown: 5]
30-Jun-2008
[2600]
you can always do something like set n number!
Graham
30-Jun-2008
[2601x2]
rule: [ copy d thru digits  .... ]
He's using string parsing .. not block parsing
[unknown: 5]
30-Jun-2008
[2603]
yeah can't use set then.
amacleod
30-Jun-2008
[2604]
I'll try that Graham. Thanks
Graham
30-Jun-2008
[2605x3]
or
non-digits: complement digit
parse [ copy digit-text to non-digits copy text to newline skip ]
and correct syntax helps :)
[unknown: 5]
30-Jun-2008
[2608x2]
>> str: "193920347REBOL ROCKS!^/"
== "193920347REBOL ROCKS!^/"

>> parse str  compose [some (charset "0123456789") text: copy text 
thru newline]
== true
>> text
== "REBOL ROCKS!^/"
Something like that?
Brock
30-Jun-2008
[2610]
he was looking for the number and the string though.
amacleod
30-Jun-2008
[2611x2]
No

I have a text document with section numbers in front:

2. Hello
2.1 Hello Again
2.1.1 Hello already
3. Goodbye

I want the section number inclued in hte copy
It need not be included in hte same copy just as long as I can record 
it.
[unknown: 5]
30-Jun-2008
[2613]
So you just want each line then really?
amacleod
30-Jun-2008
[2614x2]
Well it gets a little more complicated.
some parts of the docment will be multilined.
I thought it would be a simple thing that I was missing. I may need 
to re-think the formatting of the document.
[unknown: 5]
30-Jun-2008
[2616x2]
So even if something is multiline you would still want each line 
of the multiline correct?
Or do you mean a multiline might looks something like this:

2.1 Hello
       Goodbye

Where the second line doesn't have the preceeding number?
amacleod
30-Jun-2008
[2618]
Yes and formating may need to be retained
[unknown: 5]
30-Jun-2008
[2619]
Ahhh yes that gets a bit more complicated.