World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Josh 3-Jun-2008 [2570x5] | I'm finally digging into parse now, but I have a question about HTML. Big idea: pulling the data out of an HTML table (made in Word--ugh!). Where I am stuck: Is there a way to create a rule for opening tags such as <tr> that include a lot of formatting: i.e. <tr style="mso........> ? I want to pull the info inbetween the opening and closing tags. |
Here is some data: | |
<tr style='mso-yfti-irow:6;mso-row-margin-left:.18%;mso-row-margin-right:20.4%'> <td style='mso-cell-special:placeholder;border:none;padding:0in 0in 0in 0in' width="0%"><p class='MsoNormal'> </td> <td width="23%" colspan=2 style='width:23.6%;padding:0in 0in 0in 0in'> <p class=MsoNormal><span style='font-family:"Lucida Sans Unicode"'> MNDLDA09Mar03a_e<o:p></o:p></span></p> </td> | |
with the </tr> at the end | |
I came up with a rule: [some [thru "<td" thru ">" y: to "</td>" (a: remove-each tag load/markup y [tag? tag])]] but it seems to not be as efficient as it could be. | |
Brock 3-Jun-2008 [2575] | wouldn't you use "copy y " instead of "y:"? |
Geomol 3-Jun-2008 [2576x2] | Josh, if you do a load/markup on the whole string, you get a block with tags and strings. You can then pick the string from the block, maybe doing TRIM on them to sort out newlines and spaces. Like: blk: load/markup your-data foreach f blk [if all [string? f "" <> trim f] [print f]] |
If you wanna use PARSE, you can do something like: parse your-data [some ["<" thru ">" | copy y to "<" (if "" <> trim y [print y])]] | |
Chris 3-Jun-2008 [2578x2] | I've been toying with this to obtain a very parsable "dialect" -- my goal being to scrape live game updates from a certain sports web site (for personal use, natch). It's reliant on 'parse-xml though, so ymmv.... do http://www.ross-gill.com/r/scrape.r probe load-xml some-xml |
Result is a little like: from -- <tag attr="attribute">Content</tag> to -- <tag> /attr attribute "Content" | |
Anton 4-Jun-2008 [2580] | Josh, using the REMOVE-EACH very often is what makes your parse slow. A remove operation in the middle of a large string is slow, and you are doing many removes. That's why the others suggested using copy. |
Josh 6-Jun-2008 [2581] | Thanks for the input. I will have to play around with those later as I am trying to get this finished up and then I can go back and clean up the code. The data is minimal enough for the script to finish in under a second anyway. Parse is pretty sweet. Makes this much neater than the alternative |
Anton 7-Jun-2008 [2582] | No worries. |
amacleod 30-Jun-2008 [2583] | I'm trying to copy some text from the position found iwhile parsing a document. I'm using something like: rule: [some digit copy text to newline] (--where "digit has ben defined as all digits 0 to 9) This copies eveerything after the digit. How would I copy the digit itself as well? |
Brock 30-Jun-2008 [2584x2] | would it not simply be.... to some digit instead of what you have above? I'll start playing around and see if I can be of any help (if you haven't already figured it out) |
Not as easy as it seemed to be. Will take more time than I have right now. | |
amacleod 30-Jun-2008 [2586] | Is there a difference between using "to" and "thru" |
[unknown: 5] 30-Jun-2008 [2587] | yes |
Graham 30-Jun-2008 [2588] | is this block parsing? |
[unknown: 5] 30-Jun-2008 [2589] | to goes to the point and thru includes the point |
amacleod 30-Jun-2008 [2590x2] | No |
So to newline does not include the newline? | |
[unknown: 5] 30-Jun-2008 [2592] | no it wouldn't |
Graham 30-Jun-2008 [2593x2] | rule: [ digit copy text to newline skip ] parse stuff [ some rule ] |
digits: [ some digit ] rule: [ digits ... ] | |
amacleod 30-Jun-2008 [2595] | Graham, the digit String would be included in the copied text with this? |
Graham 30-Jun-2008 [2596x3] | nope |
because it matches digit and the cursor moves on | |
past the digit | |
amacleod 30-Jun-2008 [2599] | Right. Anyway to capture the digit? |
[unknown: 5] 30-Jun-2008 [2600] | you can always do something like set n number! |
Graham 30-Jun-2008 [2601x2] | rule: [ copy d thru digits .... ] |
He's using string parsing .. not block parsing | |
[unknown: 5] 30-Jun-2008 [2603] | yeah can't use set then. |
amacleod 30-Jun-2008 [2604] | I'll try that Graham. Thanks |
Graham 30-Jun-2008 [2605x3] | or |
non-digits: complement digit parse [ copy digit-text to non-digits copy text to newline skip ] | |
and correct syntax helps :) | |
[unknown: 5] 30-Jun-2008 [2608x2] | >> str: "193920347REBOL ROCKS!^/" == "193920347REBOL ROCKS!^/" >> parse str compose [some (charset "0123456789") text: copy text thru newline] == true >> text == "REBOL ROCKS!^/" |
Something like that? | |
Brock 30-Jun-2008 [2610] | he was looking for the number and the string though. |
amacleod 30-Jun-2008 [2611x2] | No I have a text document with section numbers in front: 2. Hello 2.1 Hello Again 2.1.1 Hello already 3. Goodbye I want the section number inclued in hte copy |
It need not be included in hte same copy just as long as I can record it. | |
[unknown: 5] 30-Jun-2008 [2613] | So you just want each line then really? |
amacleod 30-Jun-2008 [2614x2] | Well it gets a little more complicated. some parts of the docment will be multilined. |
I thought it would be a simple thing that I was missing. I may need to re-think the formatting of the document. | |
[unknown: 5] 30-Jun-2008 [2616x2] | So even if something is multiline you would still want each line of the multiline correct? |
Or do you mean a multiline might looks something like this: 2.1 Hello Goodbye Where the second line doesn't have the preceeding number? | |
amacleod 30-Jun-2008 [2618] | Yes and formating may need to be retained |
[unknown: 5] 30-Jun-2008 [2619] | Ahhh yes that gets a bit more complicated. |
older newer | first last |