World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Chris 11-May-2008 [2522x2] | Assuming you want to assign values to function locals from the external parse rules, you can a) bind as you are doing, b) create a larger context for the function encompassing your rules or c) compile the parse rule, either on creation of the function or for each instance. a) rule: [set tag tag!] test: func [data /local tag][bind rule 'data parse data rule tag] b) test: use [tag][ rule: [set tag tag!] func [data][parse data rule tag] ] c) rule: [set tag tag!] test: func [data /local tag] compose/only [parse data (rule) tag] Also, note that when you bind, it alters the original block -- no need to reassign to a new word. |
When it comes to complex rules, I opt for b). At that, I'd go for context [] where there are a lot of associated words... | |
Henrik 12-May-2008 [2524] | the function is recursive, so that may put a twist on b). I forgot that detail with BIND on a) so thanks for that. c) seems to work best. |
amacleod 15-May-2008 [2525x4] | I'm just not getting the hang of parsing. I've read tutorials an looked at scripts but when I try to adapt it to my work it fails. |
I'm trying to parse a tex document that I've formated into lines of text with blank lines between simialr to make doc format | |
Most lines begin with a section number (2.), or a sub-section (2.3) or a sub-sub-section (2.3.5). | |
I've got rules to find each: (some digit "." some space) etc. and it works. I've been able to copy the text following with (copy text thru end) but how do I copy the section number? | |
Oldes 15-May-2008 [2529x2] | ch_section: charset "0123456789." parse/all "2.1.3 line" [copy section some ch_section copy rest to end] probe reduce [section rest] ;== ["2.1.3" " line"] |
or something like that: ch_digits: charset "0123456789" r_section: [pos1: some [some ch_digits opt #"."] pos2: (section: copy/part pos1 pos2)] parse/all "2.3.4 line" [r_section copy rest to end] probe reduce [section rest] ;== ["2.3.4" " line"] | |
BrianH 16-May-2008 [2531x3] | If the section numbers always end with a period, you can do this: some [some digits "."] If the section numbers don't end with period you can do this: some digits any ["." some digits] |
Look up recursive descent parsing, and take a not of the difference between left recursion and right recursion. | |
not -> note | |
Chris 16-May-2008 [2534] | Don't want to add too much, but with parse you can really build up a vocubulary based on the patterns you know: section: [integer! ["." | 1 4 ["." integer!]]] ; -- or whatever rule covers all permutations chars-sp: charset " " space: [some chars-sp] parse/all [copy sn section space [to newline | to end]] Vocabularies are easy to wrap in their own context too. Note also that [integer!] is a shorthand for [some digit] -- very useful : ) |
amacleod 16-May-2008 [2535x4] | Oldes, thanks for your suggestion. It works when I do a simple one line rule as you suggested but when I try to use multiple rules it fails. Example of what I'm trying to do: Example of the text document: |
3. CONSTRUCTION OF PORTABLE ALUMINUM LADDERS 3.1 Aluminum ladders are divided into two basic types of construction, viz:, solid beam and truss. 3.1.1 Solid Beam Aluminum Construction- This type of ladder has a solid side rail construction with aluminum rungs connecting with the side rails at fourteen inch intervals. The connection is generally either by a welded joint between rung and side rails, or by an expansion plug pinching the rung tightly to the side rails and internal backup plates. (Figure 2 A) 3.1.2 Aluminum Truss Construction- In the aluminum truss design, the top and bottom rails are connected to rung assemblies or rung blocks by rivets. The rungs are either welded or expansion plugged to the rung plate assemblies, which are supported by the top and bottom rails. (Figure 2B) 3.2 The base of the portable aluminum ladder is provided with either steel spikes or swiveling rubber safety shoes and aluminum spikes. For ladders equipped with the swiveling device, the rubber pads should be utilized when the ladder is to be raised and used on hard surfaces. (Figure 2A, 2B) 3. CONSTRUCTION OF PORTABLE ALUMINUM LADDERS | |
space: charset " ^-" spaces: [some space] chars: complement charset " ^-^/" digit: charset "0123456789" digits: [some digit] section: [digits "." some space] sub-sec: [digits "." digits spaces] sub-sub-sec: [digits "." digits "." digits spaces] rules: [heading some parts done] (where heading is the first line of the text file] parts: [newline | section format_section | sub-section | sub-sub-section] format_section: copy sec section copy rest to newline (print reduce [sec rest]) | |
If I use format_section code directly with parse it works but i get nothing when I redirect it to another line. THe above code is similar to what Carl used in his text to html script. | |
BrianH 16-May-2008 [2539] | Any reason that the headings with one number have a trailing period and the rest don't? |
amacleod 16-May-2008 [2540] | BrianH, sorry BRian the text above is just from a random and simpler section of the document. if I copied the from the begining the first line would not have a number at all. |
BrianH 16-May-2008 [2541] | Actually, the inconsistency affects the parse rules. I ask again... |
amacleod 16-May-2008 [2542] | I thought you ment the document heading... No reason but my rules account for it. The rules work in simpler tests.. |
BrianH 16-May-2008 [2543] | Are you creating the documents or are others doing so? For that matter, does it just go to 3 levels of numbers? |
amacleod 16-May-2008 [2544] | THE docs come from pdf's that I have converted to text and tried to reformat by hand to hte similest form whilepreserving the structure of the doc. In addition to sections, sub-sections and sub-sub-seections there are nubered lists, letter lists, photos/diagrams, and tables to deal with. I thought I start with sorting out the sections and tackle the rest later. |
BrianH 16-May-2008 [2545] | Well, first of all you need to put the longer matches first in your alternates, so they will be tested first. |
amacleod 16-May-2008 [2546x2] | in the above code the following will work: format_section: [copy rest to newline (print reduce [rest ]) but this fails: format_section: [copy sec section copy rest to newline (print reduce [sec rest]) |
longer matches... This is where I get lost in parse. What do you mean? | |
BrianH 16-May-2008 [2548] | It checks the alternates (sections separated by | ) in order. If there is ambiguity, the way to get it to go for the longest match is to check for that match first. |
amacleod 16-May-2008 [2549] | so check sub-sub-sections then sub-section then sections in that order? |
BrianH 16-May-2008 [2550] | Yes, or combine them (which I will demonstrate). |
amacleod 16-May-2008 [2551] | How does parse evaluate the rule and document? does it check each rule through the whle doc or line first then goes back and checks with hte next alternate? and so on? |
BrianH 16-May-2008 [2552x4] | section: [some digits (level: 1) ["." some digits (level: level + 1) | "."]] |
Note that I did the longer alternate first. | |
But I made a mistake. | |
section: [some digits (level: 1) [some ["." some digits (level: level + 1)] | "."]] | |
amacleod 16-May-2008 [2556x2] | This will give me a hit on any section or sub or sub sub? I may want to do something different depending on each. does this allow me to ? |
sorry that is the level:? | |
BrianH 16-May-2008 [2558] | Yup. |
amacleod 16-May-2008 [2559] | I'll play with this . Thanks |
BrianH 16-May-2008 [2560x2] | If you are making your decisions on a per-line basis, you might consider doing a read/lines and parsing each line individually, maintaining your own state to tell you where you are in the greater document. It's the only way to parse documents greater than memory in size. |
Well, at least the only way that doesn't rely on deep magic :) | |
Chris 16-May-2008 [2562] | Reminder: [integer!] is shorthand for [some digit] : ) |
PeterWood 17-May-2008 [2563] | ..but only for values between -2**31 to 2**31 -1 >> parse [1] [integer!] == true >> parse reduce [2147483647] [integer!] == true >> parse reduce [2147483648] [integer!] == false |
Chris 17-May-2008 [2564] | String parsing too: parse "1234" [integer!] == true |
Anton 17-May-2008 [2565] | BrianH, eh? read/lines would still try to read the whole document wouldn't it ? Or are you just suggesting that as a way which is then easily modified to allow larger than memory documents? |
Gregg 17-May-2008 [2566] | I think the string parsing behavior might go away in R3 Chris. Without support for other types as well, not many people seem to use it. |
Chris 17-May-2008 [2567] | That would suck -- I use it. Seems like a common enough scenario.... |
BrianH 19-May-2008 [2568] | I mean you can do open/lines/direct and stream - then you would only need the memory for one line and a state machine. |
Anton 20-May-2008 [2569] | Right, that makes sense. |
Josh 3-Jun-2008 [2570x2] | I'm finally digging into parse now, but I have a question about HTML. Big idea: pulling the data out of an HTML table (made in Word--ugh!). Where I am stuck: Is there a way to create a rule for opening tags such as <tr> that include a lot of formatting: i.e. <tr style="mso........> ? I want to pull the info inbetween the opening and closing tags. |
Here is some data: | |
older newer | first last |