r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

amacleod
16-May-2008
[2559]
I'll play with this . Thanks
BrianH
16-May-2008
[2560x2]
If you are making your decisions on a per-line basis, you might consider 
doing a read/lines and parsing each line individually, maintaining 
your own state to tell you where you are in the greater document. 
It's the only way to parse documents greater than memory in size.
Well, at least the only way that doesn't rely on deep magic :)
Chris
16-May-2008
[2562]
Reminder: [integer!] is shorthand for [some digit] : )
PeterWood
17-May-2008
[2563]
..but only for values between -2**31 to 2**31 -1

>> parse [1] [integer!]
== true

>> parse reduce [2147483647] [integer!]
== true

>> parse reduce [2147483648] [integer!]
== false
Chris
17-May-2008
[2564]
String parsing too: parse "1234" [integer!] == true
Anton
17-May-2008
[2565]
BrianH, eh? read/lines would still try to read the whole document 
wouldn't it ?

Or are you just suggesting that as a way which is then easily modified 
to allow larger than memory documents?
Gregg
17-May-2008
[2566]
I think the string parsing behavior might go away in R3 Chris. Without 
support for other types as well, not many people seem to use it.
Chris
17-May-2008
[2567]
That would suck -- I use it.  Seems like a common enough scenario....
BrianH
19-May-2008
[2568]
I mean you can do open/lines/direct and stream - then you would only 
need the memory for one line and a state machine.
Anton
20-May-2008
[2569]
Right, that makes sense.
Josh
3-Jun-2008
[2570x5]
I'm finally digging into parse now, but I have a question about HTML. 
  Big idea:  pulling the data out of an HTML table (made in Word--ugh!). 
 Where I am stuck:  Is there a way to create a rule for opening tags 
such as <tr> that include a lot of formatting:  i.e. <tr style="mso........> 
?   I want to pull the info inbetween the opening and closing tags.
Here is some data:
<tr style='mso-yfti-irow:6;mso-row-margin-left:.18%;mso-row-margin-right:20.4%'>

  <td style='mso-cell-special:placeholder;border:none;padding:0in 0in 
  0in 0in'
  width="0%"><p class='MsoNormal'>&nbsp;</td>

  <td width="23%" colspan=2 style='width:23.6%;padding:0in 0in 0in 
  0in'>

  <p class=MsoNormal><span style='font-family:"Lucida Sans Unicode"'>&nbsp;MNDLDA09Mar03a_e<o:p></o:p></span></p>
  </td>
with the </tr> at the end
I came up with a rule:  [some [thru "<td" thru ">" y: to "</td>" 
(a: remove-each tag load/markup y [tag? tag])]]  but it seems to 
not be as efficient as it could be.
Brock
3-Jun-2008
[2575]
wouldn't you use "copy y " instead of "y:"?
Geomol
3-Jun-2008
[2576x2]
Josh, if you do a load/markup on the whole string, you get a block 
with tags and strings. You can then pick the string from the block, 
maybe doing TRIM on them to sort out newlines and spaces. Like:

blk: load/markup your-data
foreach f blk [if all [string? f "" <> trim f] [print f]]
If you wanna use PARSE, you can do something like:


parse your-data [some ["<" thru ">" | copy y to "<" (if "" <> trim 
y [print y])]]
Chris
3-Jun-2008
[2578x2]
I've been toying with this to obtain a very parsable "dialect" -- 
my goal being to scrape live game updates from a certain sports web 
site (for personal use, natch).  It's reliant on 'parse-xml though, 
so ymmv....

do http://www.ross-gill.com/r/scrape.r
probe load-xml some-xml
Result is a little like:

	from -- <tag attr="attribute">Content</tag>
	to -- <tag> /attr attribute "Content"
Anton
4-Jun-2008
[2580]
Josh, using the REMOVE-EACH very often is what makes your parse slow. 
A remove operation in the middle of a large string is slow, and you 
are doing many removes. That's why the others suggested using copy.
Josh
6-Jun-2008
[2581]
Thanks for the input.  I will have to play around with those later 
as I am trying to get this finished up and then I can go back and 
clean up the code. The data is minimal enough for the script to finish 
in under a second anyway.   Parse is pretty sweet.   Makes this much 
neater than the alternative
Anton
7-Jun-2008
[2582]
No worries.
amacleod
30-Jun-2008
[2583]
I'm trying to copy some text from the position found iwhile parsing 
a document.
I'm using something like: 


rule: [some digit copy text to newline]    (--where "digit has ben 
defined as all digits 0 to 9)

 This copies eveerything after the digit. How would I copy the digit 
 itself as well?
Brock
30-Jun-2008
[2584x2]
would it not simply be....    to some digit    instead of what you 
have above?  I'll start playing around and see if I can be of any 
help (if you haven't already figured it out)
Not as easy as it seemed to be.  Will take more time than I have 
right now.
amacleod
30-Jun-2008
[2586]
Is there a difference between using "to" and "thru"
[unknown: 5]
30-Jun-2008
[2587]
yes
Graham
30-Jun-2008
[2588]
is this block parsing?
[unknown: 5]
30-Jun-2008
[2589]
to goes to the point and thru includes the point
amacleod
30-Jun-2008
[2590x2]
No
So to newline does not include the newline?
[unknown: 5]
30-Jun-2008
[2592]
no it wouldn't
Graham
30-Jun-2008
[2593x2]
rule: [ digit copy text to newline skip ]
parse stuff [ some rule ]
digits: [ some digit ]
rule: [ digits ... ]
amacleod
30-Jun-2008
[2595]
Graham, the digit String would be included in the copied text with 
this?
Graham
30-Jun-2008
[2596x3]
nope
because it matches digit and the cursor moves on
past the digit
amacleod
30-Jun-2008
[2599]
Right. Anyway to capture the digit?
[unknown: 5]
30-Jun-2008
[2600]
you can always do something like set n number!
Graham
30-Jun-2008
[2601x2]
rule: [ copy d thru digits  .... ]
He's using string parsing .. not block parsing
[unknown: 5]
30-Jun-2008
[2603]
yeah can't use set then.
amacleod
30-Jun-2008
[2604]
I'll try that Graham. Thanks
Graham
30-Jun-2008
[2605x3]
or
non-digits: complement digit
parse [ copy digit-text to non-digits copy text to newline skip ]
and correct syntax helps :)
[unknown: 5]
30-Jun-2008
[2608]
>> str: "193920347REBOL ROCKS!^/"
== "193920347REBOL ROCKS!^/"

>> parse str  compose [some (charset "0123456789") text: copy text 
thru newline]
== true
>> text
== "REBOL ROCKS!^/"