r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
16-May-2008
[2539]
Any reason that the headings with one number have a trailing period 
and the rest don't?
amacleod
16-May-2008
[2540]
BrianH, sorry BRian the text above is just from a random and simpler 
section of the document.

if I copied the from the begining the first line would not have a 
number at all.
BrianH
16-May-2008
[2541]
Actually, the inconsistency affects the parse rules. I ask again...
amacleod
16-May-2008
[2542]
I thought you ment the document heading...

No reason but my rules account for it. The rules work in simpler 
tests..
BrianH
16-May-2008
[2543]
Are you creating the documents or are others doing so? For that matter, 
does it just go to 3 levels of numbers?
amacleod
16-May-2008
[2544]
THE docs come from pdf's that I have converted to text and tried 
to reformat by hand to hte similest form whilepreserving the structure 
of the doc. In addition to sections, sub-sections and sub-sub-seections 
there are nubered lists, letter lists, photos/diagrams, and tables 
to deal with. I thought I start with sorting out the sections and 
tackle the rest later.
BrianH
16-May-2008
[2545]
Well, first of all you need to put the longer matches first in your 
alternates, so they will be tested first.
amacleod
16-May-2008
[2546x2]
in the above code the following will work:
format_section:   [copy rest to newline (print reduce [rest ])

but this fails: 

format_section:   [copy sec section copy rest to newline (print reduce 
[sec rest])
longer matches...

This is where I get lost in parse.

What do you mean?
BrianH
16-May-2008
[2548]
It checks the alternates (sections separated by | ) in order. If 
there is ambiguity, the way to get it to go for the longest match 
is to check for that match first.
amacleod
16-May-2008
[2549]
so check sub-sub-sections then sub-section then sections in that 
order?
BrianH
16-May-2008
[2550]
Yes, or combine them (which I will demonstrate).
amacleod
16-May-2008
[2551]
How does parse evaluate the rule and document?


does it check each rule through the whle doc or line first then goes 
back and checks with hte next alternate? and so on?
BrianH
16-May-2008
[2552x4]
section: [some digits (level: 1) ["." some digits (level: level + 
1) | "."]]
Note that I did the longer alternate first.
But I made a mistake.
section: [some digits (level: 1) [some ["." some digits (level: level 
+ 1)] | "."]]
amacleod
16-May-2008
[2556x2]
This will give me a hit on any section or sub or sub sub?


I may want to do something different depending on each. does this 
allow me to ?
sorry that is the level:?
BrianH
16-May-2008
[2558]
Yup.
amacleod
16-May-2008
[2559]
I'll play with this . Thanks
BrianH
16-May-2008
[2560x2]
If you are making your decisions on a per-line basis, you might consider 
doing a read/lines and parsing each line individually, maintaining 
your own state to tell you where you are in the greater document. 
It's the only way to parse documents greater than memory in size.
Well, at least the only way that doesn't rely on deep magic :)
Chris
16-May-2008
[2562]
Reminder: [integer!] is shorthand for [some digit] : )
PeterWood
17-May-2008
[2563]
..but only for values between -2**31 to 2**31 -1

>> parse [1] [integer!]
== true

>> parse reduce [2147483647] [integer!]
== true

>> parse reduce [2147483648] [integer!]
== false
Chris
17-May-2008
[2564]
String parsing too: parse "1234" [integer!] == true
Anton
17-May-2008
[2565]
BrianH, eh? read/lines would still try to read the whole document 
wouldn't it ?

Or are you just suggesting that as a way which is then easily modified 
to allow larger than memory documents?
Gregg
17-May-2008
[2566]
I think the string parsing behavior might go away in R3 Chris. Without 
support for other types as well, not many people seem to use it.
Chris
17-May-2008
[2567]
That would suck -- I use it.  Seems like a common enough scenario....
BrianH
19-May-2008
[2568]
I mean you can do open/lines/direct and stream - then you would only 
need the memory for one line and a state machine.
Anton
20-May-2008
[2569]
Right, that makes sense.
Josh
3-Jun-2008
[2570x5]
I'm finally digging into parse now, but I have a question about HTML. 
  Big idea:  pulling the data out of an HTML table (made in Word--ugh!). 
 Where I am stuck:  Is there a way to create a rule for opening tags 
such as <tr> that include a lot of formatting:  i.e. <tr style="mso........> 
?   I want to pull the info inbetween the opening and closing tags.
Here is some data:
<tr style='mso-yfti-irow:6;mso-row-margin-left:.18%;mso-row-margin-right:20.4%'>

  <td style='mso-cell-special:placeholder;border:none;padding:0in 0in 
  0in 0in'
  width="0%"><p class='MsoNormal'>&nbsp;</td>

  <td width="23%" colspan=2 style='width:23.6%;padding:0in 0in 0in 
  0in'>

  <p class=MsoNormal><span style='font-family:"Lucida Sans Unicode"'>&nbsp;MNDLDA09Mar03a_e<o:p></o:p></span></p>
  </td>
with the </tr> at the end
I came up with a rule:  [some [thru "<td" thru ">" y: to "</td>" 
(a: remove-each tag load/markup y [tag? tag])]]  but it seems to 
not be as efficient as it could be.
Brock
3-Jun-2008
[2575]
wouldn't you use "copy y " instead of "y:"?
Geomol
3-Jun-2008
[2576x2]
Josh, if you do a load/markup on the whole string, you get a block 
with tags and strings. You can then pick the string from the block, 
maybe doing TRIM on them to sort out newlines and spaces. Like:

blk: load/markup your-data
foreach f blk [if all [string? f "" <> trim f] [print f]]
If you wanna use PARSE, you can do something like:


parse your-data [some ["<" thru ">" | copy y to "<" (if "" <> trim 
y [print y])]]
Chris
3-Jun-2008
[2578x2]
I've been toying with this to obtain a very parsable "dialect" -- 
my goal being to scrape live game updates from a certain sports web 
site (for personal use, natch).  It's reliant on 'parse-xml though, 
so ymmv....

do http://www.ross-gill.com/r/scrape.r
probe load-xml some-xml
Result is a little like:

	from -- <tag attr="attribute">Content</tag>
	to -- <tag> /attr attribute "Content"
Anton
4-Jun-2008
[2580]
Josh, using the REMOVE-EACH very often is what makes your parse slow. 
A remove operation in the middle of a large string is slow, and you 
are doing many removes. That's why the others suggested using copy.
Josh
6-Jun-2008
[2581]
Thanks for the input.  I will have to play around with those later 
as I am trying to get this finished up and then I can go back and 
clean up the code. The data is minimal enough for the script to finish 
in under a second anyway.   Parse is pretty sweet.   Makes this much 
neater than the alternative
Anton
7-Jun-2008
[2582]
No worries.
amacleod
30-Jun-2008
[2583]
I'm trying to copy some text from the position found iwhile parsing 
a document.
I'm using something like: 


rule: [some digit copy text to newline]    (--where "digit has ben 
defined as all digits 0 to 9)

 This copies eveerything after the digit. How would I copy the digit 
 itself as well?
Brock
30-Jun-2008
[2584x2]
would it not simply be....    to some digit    instead of what you 
have above?  I'll start playing around and see if I can be of any 
help (if you haven't already figured it out)
Not as easy as it seemed to be.  Will take more time than I have 
right now.
amacleod
30-Jun-2008
[2586]
Is there a difference between using "to" and "thru"
[unknown: 5]
30-Jun-2008
[2587]
yes
Graham
30-Jun-2008
[2588]
is this block parsing?