World: r3wp

Join the discussions in the REBOL3 world...

[Parse] Discussion of PARSE dialect

older newer	first last
Josh 3-Jun-2008 [2570x5]	I'm finally digging into parse now, but I have a question about HTML. Big idea: pulling the data out of an HTML table (made in Word--ugh!). Where I am stuck: Is there a way to create a rule for opening tags such as <tr> that include a lot of formatting: i.e. <tr style="mso........> ? I want to pull the info inbetween the opening and closing tags.
	Here is some data:
	<tr style='mso-yfti-irow:6;mso-row-margin-left:.18%;mso-row-margin-right:20.4%'> <td style='mso-cell-special:placeholder;border:none;padding:0in 0in 0in 0in' width="0%"><p class='MsoNormal'> </td> <td width="23%" colspan=2 style='width:23.6%;padding:0in 0in 0in 0in'> <p class=MsoNormal><span style='font-family:"Lucida Sans Unicode"'> MNDLDA09Mar03a_e<o:p></o:p></span></p> </td>
	with the </tr> at the end
	I came up with a rule: [some [thru "<td" thru ">" y: to "</td>" (a: remove-each tag load/markup y [tag? tag])]] but it seems to not be as efficient as it could be.
Brock 3-Jun-2008 [2575]	wouldn't you use "copy y " instead of "y:"?
Geomol 3-Jun-2008 [2576x2]	Josh, if you do a load/markup on the whole string, you get a block with tags and strings. You can then pick the string from the block, maybe doing TRIM on them to sort out newlines and spaces. Like: blk: load/markup your-data foreach f blk [if all [string? f "" <> trim f] [print f]]
Geomol 3-Jun-2008 [2576x2]	If you wanna use PARSE, you can do something like: parse your-data [some ["<" thru ">" \| copy y to "<" (if "" <> trim y [print y])]]
Chris 3-Jun-2008 [2578x2]	I've been toying with this to obtain a very parsable "dialect" -- my goal being to scrape live game updates from a certain sports web site (for personal use, natch). It's reliant on 'parse-xml though, so ymmv.... do http://www.ross-gill.com/r/scrape.r probe load-xml some-xml
Chris 3-Jun-2008 [2578x2]	Result is a little like: from -- <tag attr="attribute">Content</tag> to -- <tag> /attr attribute "Content"
Anton 4-Jun-2008 [2580]	Josh, using the REMOVE-EACH very often is what makes your parse slow. A remove operation in the middle of a large string is slow, and you are doing many removes. That's why the others suggested using copy.
Josh 6-Jun-2008 [2581]	Thanks for the input. I will have to play around with those later as I am trying to get this finished up and then I can go back and clean up the code. The data is minimal enough for the script to finish in under a second anyway. Parse is pretty sweet. Makes this much neater than the alternative
Anton 7-Jun-2008 [2582]	No worries.
amacleod 30-Jun-2008 [2583]	I'm trying to copy some text from the position found iwhile parsing a document. I'm using something like: rule: [some digit copy text to newline] (--where "digit has ben defined as all digits 0 to 9) This copies eveerything after the digit. How would I copy the digit itself as well?
Brock 30-Jun-2008 [2584x2]	would it not simply be.... to some digit instead of what you have above? I'll start playing around and see if I can be of any help (if you haven't already figured it out)
Brock 30-Jun-2008 [2584x2]	Not as easy as it seemed to be. Will take more time than I have right now.
amacleod 30-Jun-2008 [2586]	Is there a difference between using "to" and "thru"
[unknown: 5] 30-Jun-2008 [2587]	yes
Graham 30-Jun-2008 [2588]	is this block parsing?
[unknown: 5] 30-Jun-2008 [2589]	to goes to the point and thru includes the point
amacleod 30-Jun-2008 [2590x2]	No
amacleod 30-Jun-2008 [2590x2]	So to newline does not include the newline?
[unknown: 5] 30-Jun-2008 [2592]	no it wouldn't
Graham 30-Jun-2008 [2593x2]	rule: [ digit copy text to newline skip ] parse stuff [ some rule ]
Graham 30-Jun-2008 [2593x2]	digits: [ some digit ] rule: [ digits ... ]
amacleod 30-Jun-2008 [2595]	Graham, the digit String would be included in the copied text with this?
Graham 30-Jun-2008 [2596x3]	nope
	because it matches digit and the cursor moves on
	past the digit
amacleod 30-Jun-2008 [2599]	Right. Anyway to capture the digit?
[unknown: 5] 30-Jun-2008 [2600]	you can always do something like set n number!
Graham 30-Jun-2008 [2601x2]	rule: [ copy d thru digits .... ]
Graham 30-Jun-2008 [2601x2]	He's using string parsing .. not block parsing
[unknown: 5] 30-Jun-2008 [2603]	yeah can't use set then.
amacleod 30-Jun-2008 [2604]	I'll try that Graham. Thanks
Graham 30-Jun-2008 [2605x3]	or
	non-digits: complement digit parse [ copy digit-text to non-digits copy text to newline skip ]
	and correct syntax helps :)
[unknown: 5] 30-Jun-2008 [2608x2]	>> str: "193920347REBOL ROCKS!^/" == "193920347REBOL ROCKS!^/" >> parse str compose [some (charset "0123456789") text: copy text thru newline] == true >> text == "REBOL ROCKS!^/"
[unknown: 5] 30-Jun-2008 [2608x2]	Something like that?
Brock 30-Jun-2008 [2610]	he was looking for the number and the string though.
amacleod 30-Jun-2008 [2611x2]	No I have a text document with section numbers in front: 2. Hello 2.1 Hello Again 2.1.1 Hello already 3. Goodbye I want the section number inclued in hte copy
amacleod 30-Jun-2008 [2611x2]	It need not be included in hte same copy just as long as I can record it.
[unknown: 5] 30-Jun-2008 [2613]	So you just want each line then really?
amacleod 30-Jun-2008 [2614x2]	Well it gets a little more complicated. some parts of the docment will be multilined.
amacleod 30-Jun-2008 [2614x2]	I thought it would be a simple thing that I was missing. I may need to re-think the formatting of the document.
[unknown: 5] 30-Jun-2008 [2616x2]	So even if something is multiline you would still want each line of the multiline correct?
[unknown: 5] 30-Jun-2008 [2616x2]	Or do you mean a multiline might looks something like this: 2.1 Hello Goodbye Where the second line doesn't have the preceeding number?
amacleod 30-Jun-2008 [2618]	Yes and formating may need to be retained
[unknown: 5] 30-Jun-2008 [2619]	Ahhh yes that gets a bit more complicated.
older newer	first last