World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
amacleod 16-May-2008 [2559]	I'll play with this . Thanks
BrianH 16-May-2008 [2560x2]	If you are making your decisions on a per-line basis, you might consider doing a read/lines and parsing each line individually, maintaining your own state to tell you where you are in the greater document. It's the only way to parse documents greater than memory in size.
BrianH 16-May-2008 [2560x2]	Well, at least the only way that doesn't rely on deep magic :)
Chris 16-May-2008 [2562]	Reminder: [integer!] is shorthand for [some digit] : )
PeterWood 17-May-2008 [2563]	..but only for values between -231 to 231 -1 >> parse [1] [integer!] == true >> parse reduce [2147483647] [integer!] == true >> parse reduce [2147483648] [integer!] == false
Chris 17-May-2008 [2564]	String parsing too: parse "1234" [integer!] == true
Anton 17-May-2008 [2565]	BrianH, eh? read/lines would still try to read the whole document wouldn't it ? Or are you just suggesting that as a way which is then easily modified to allow larger than memory documents?
Gregg 17-May-2008 [2566]	I think the string parsing behavior might go away in R3 Chris. Without support for other types as well, not many people seem to use it.
Chris 17-May-2008 [2567]	That would suck -- I use it. Seems like a common enough scenario....
BrianH 19-May-2008 [2568]	I mean you can do open/lines/direct and stream - then you would only need the memory for one line and a state machine.
Anton 20-May-2008 [2569]	Right, that makes sense.
Josh 3-Jun-2008 [2570x5]	I'm finally digging into parse now, but I have a question about HTML. Big idea: pulling the data out of an HTML table (made in Word--ugh!). Where I am stuck: Is there a way to create a rule for opening tags such as <tr> that include a lot of formatting: i.e. <tr style="mso........> ? I want to pull the info inbetween the opening and closing tags.
	Here is some data:
	<tr style='mso-yfti-irow:6;mso-row-margin-left:.18%;mso-row-margin-right:20.4%'> <td style='mso-cell-special:placeholder;border:none;padding:0in 0in 0in 0in' width="0%"><p class='MsoNormal'> </td> <td width="23%" colspan=2 style='width:23.6%;padding:0in 0in 0in 0in'> <p class=MsoNormal><span style='font-family:"Lucida Sans Unicode"'> MNDLDA09Mar03a_e<o:p></o:p></span></p> </td>
	with the </tr> at the end
	I came up with a rule: [some [thru "<td" thru ">" y: to "</td>" (a: remove-each tag load/markup y [tag? tag])]] but it seems to not be as efficient as it could be.
Brock 3-Jun-2008 [2575]	wouldn't you use "copy y " instead of "y:"?
Geomol 3-Jun-2008 [2576x2]	Josh, if you do a load/markup on the whole string, you get a block with tags and strings. You can then pick the string from the block, maybe doing TRIM on them to sort out newlines and spaces. Like: blk: load/markup your-data foreach f blk [if all [string? f "" <> trim f] [print f]]
Geomol 3-Jun-2008 [2576x2]	If you wanna use PARSE, you can do something like: parse your-data [some ["<" thru ">" \| copy y to "<" (if "" <> trim y [print y])]]
Chris 3-Jun-2008 [2578x2]	I've been toying with this to obtain a very parsable "dialect" -- my goal being to scrape live game updates from a certain sports web site (for personal use, natch). It's reliant on 'parse-xml though, so ymmv.... do http://www.ross-gill.com/r/scrape.r probe load-xml some-xml
Chris 3-Jun-2008 [2578x2]	Result is a little like: from -- <tag attr="attribute">Content</tag> to -- <tag> /attr attribute "Content"
Anton 4-Jun-2008 [2580]	Josh, using the REMOVE-EACH very often is what makes your parse slow. A remove operation in the middle of a large string is slow, and you are doing many removes. That's why the others suggested using copy.
Josh 6-Jun-2008 [2581]	Thanks for the input. I will have to play around with those later as I am trying to get this finished up and then I can go back and clean up the code. The data is minimal enough for the script to finish in under a second anyway. Parse is pretty sweet. Makes this much neater than the alternative
Anton 7-Jun-2008 [2582]	No worries.
amacleod 30-Jun-2008 [2583]	I'm trying to copy some text from the position found iwhile parsing a document. I'm using something like: rule: [some digit copy text to newline] (--where "digit has ben defined as all digits 0 to 9) This copies eveerything after the digit. How would I copy the digit itself as well?
Brock 30-Jun-2008 [2584x2]	would it not simply be.... to some digit instead of what you have above? I'll start playing around and see if I can be of any help (if you haven't already figured it out)
Brock 30-Jun-2008 [2584x2]	Not as easy as it seemed to be. Will take more time than I have right now.
amacleod 30-Jun-2008 [2586]	Is there a difference between using "to" and "thru"
[unknown: 5] 30-Jun-2008 [2587]	yes
Graham 30-Jun-2008 [2588]	is this block parsing?
[unknown: 5] 30-Jun-2008 [2589]	to goes to the point and thru includes the point
amacleod 30-Jun-2008 [2590x2]	No
amacleod 30-Jun-2008 [2590x2]	So to newline does not include the newline?
[unknown: 5] 30-Jun-2008 [2592]	no it wouldn't
Graham 30-Jun-2008 [2593x2]	rule: [ digit copy text to newline skip ] parse stuff [ some rule ]
Graham 30-Jun-2008 [2593x2]	digits: [ some digit ] rule: [ digits ... ]
amacleod 30-Jun-2008 [2595]	Graham, the digit String would be included in the copied text with this?
Graham 30-Jun-2008 [2596x3]	nope
	because it matches digit and the cursor moves on
	past the digit
amacleod 30-Jun-2008 [2599]	Right. Anyway to capture the digit?
[unknown: 5] 30-Jun-2008 [2600]	you can always do something like set n number!
Graham 30-Jun-2008 [2601x2]	rule: [ copy d thru digits .... ]
Graham 30-Jun-2008 [2601x2]	He's using string parsing .. not block parsing
[unknown: 5] 30-Jun-2008 [2603]	yeah can't use set then.
amacleod 30-Jun-2008 [2604]	I'll try that Graham. Thanks
Graham 30-Jun-2008 [2605x3]	or
	non-digits: complement digit parse [ copy digit-text to non-digits copy text to newline skip ]
	and correct syntax helps :)
[unknown: 5] 30-Jun-2008 [2608]	>> str: "193920347REBOL ROCKS!^/" == "193920347REBOL ROCKS!^/" >> parse str compose [some (charset "0123456789") text: copy text thru newline] == true >> text == "REBOL ROCKS!^/"
older newer	first last