World: r3wp

[Parse] Discussion of PARSE dialect

["test" '| 1 1 123]
I think my parse rules use lots of temporary variables.. How do you 
prefer to hide these?
1: Hide them from whom, and why?

In general, if you want to hide something about your parse rules, 
you need to hide the parse rules altogether. That is not to say that 
it is a good idea; I've found that in most cases that someone wants 
to hide some code or variables in REBOL, they really want to do something 
else and the something else depends on the circumstances. What do 
you hope to accomplish?

2: You have to be careful with temporary variables.

REBOL parse rules are often recursive, and the temporary variables 
used with them are not. You have to be extra careful to not recurse 
to another trip through the same parse rule before you are done with 
the temporary variables in the first round, or put off setting the 
temps until just before they are used. It's not as hard as it sounds.
1. Hide them from myself :) I don't mind having lots of global variables 
in a small script, but I really don't like it in larger programs. 
To keep things well organized I prefer if variables aren't valid 
in a larger context than necessary, to avoid overwriting, accidental 
use etc. Does context [ ... ] add alot of overhead btw? Maybe I should 
try to use that more often
2. I don't use alot of recursion so far. some [...] usually works 
equally well in my applications. But it's definitely a valid point, 
and I'll try to keep it in mind
The only execution overhead of context is when it is built - nothing 
extra at runtime. The memory overhead is minimal. Every word is defined 
in a context, even the global ones. Overall, using an object to wrap 
the temporary variables that your rules use is not a bad idea. As 
long as you are doing this to better manage your program and reduce 
the scope of errors, it is great.
context [ ]  is just a shortcut for  make object! [ ]   and it's 
great.  The more we hide in objects the easier it will be share, 
or at the least, easier to use code from a variety of developer sources. 
 Programming in the Many is important  in our context as there are 
relativily few of us in the "many" - so far.  So when even our small 
stuff is shareable we all  win.
I often use contexts with parsers, to contain the rules.
what about 'use

tmp: 1 use [tmp][ parse "test" [copy tmp to end (probe tmp) ]] probe 
Does a bind/copy on its code block every time it is used.
That kind of overhead is usually only worth it when you can't get 
rid of concurrent use any other way.
Wait, USE may not copy in R2 - that could be even worse.
I should probably not to use the code evaluation so much directly 
in the parse rule block and rather call a function if I need a lot 
of temp variables to process the action.
>> parse [>] [>]
== false
>> parse [>] ['>]
** Syntax Error: Invalid word -- '>

How do you parse that block?
(note this block can only be made without a space at the end in rebol 
I'm using help words like:
	slash: to-lit-word first [/]
	dslash: to-lit-word "//"
	rShift: to-lit-word ">>"
	UrShift: to-lit-word ">>>"
	_greater: to-lit-word ">"
	_less: to-lit-word "<"
	_noteql: to-lit-word "<>"
	_lesseql: to-lit-word "<="
	_greatereql: to-lit-word ">="
nice, thanks
>> parse to-block load "<" [_less]
== true
yep, works
if I have a rule-block that does not exist in the same context as 
the main parse block, is there a simple way to rebind it without 
composing it into the main parse block? my current solution is to 
bind it to a temp block and use the temp block as a rule in the main 
parse block, which is less than optimal, I think.
set 'html-gen func [
    "Low level HTML dialect"

    data [none! string! tag! url! number! time! date! get-word! word! 
    /local cmd blk header row-blk start-tag dr tr pr wr
  ] [
    if get-word? data [data: get data]

    if any [url? data string? data number? data word? data time? data 
    date? data] [out data return true]
    if none? data [return true]

    dr: bind data-rules 'data ; this is the easiest way? can we not bind 
    directly in the parse block?
    tr: bind tag-rules 'data
    pr: bind page-rules 'data
    wr: bind word-rules 'data
    parse data [any [cmd: [dr | tr | pr | wr]]]
the five last lines in the function are the important ones.
Assuming you want to assign values to function locals from the external 
parse rules, you can a) bind as you are doing, b) create a larger 
context for the function encompassing your rules or c) compile the 
parse rule, either on creation of the function or for each instance.

rule: [set tag tag!]
test: func [data /local tag][bind rule 'data parse data rule tag]

test: use [tag][
    rule: [set tag tag!]
    func [data][parse data rule tag]

rule: [set tag tag!]
test: func [data /local tag] compose/only [parse data (rule) tag]

Also, note that when you bind, it alters the original block -- no 
need to reassign to a new word.
When it comes to complex rules, I opt for b).  At that, I'd go for 
context [] where there are a lot of associated words...
the function is recursive, so that may put a twist on b). I forgot 
that detail with BIND on a) so thanks for that. c) seems to work 
I'm just not getting the hang of parsing. I've read tutorials an 
looked at scripts but when I try to adapt it to my work it fails.
I'm trying to parse a tex document that I've formated into lines 
of text with blank lines between simialr to make doc format
Most lines begin with a section number (2.), or a sub-section (2.3) 
or a sub-sub-section (2.3.5).
I've got rules to find each: (some digit "." some space) etc. and 
it works. I've been able to copy the text following with (copy text 
thru end) but how do I copy the section number?
ch_section: charset "0123456789."

parse/all "2.1.3 line" [copy section some ch_section copy rest to 
end] probe reduce [section rest] ;== ["2.1.3" " line"]
or something like that:
ch_digits: charset "0123456789"

r_section: [pos1: some [some ch_digits opt #"."] pos2: (section: 
copy/part pos1 pos2)]

parse/all "2.3.4 line" [r_section copy rest to end] probe reduce 
[section rest] ;== ["2.3.4" " line"]
If the section numbers always end with a period, you can do this:
    some [some digits "."]
If the section numbers don't end with period you can do this:
    some digits any ["." some digits]
Look up recursive descent parsing, and take a not of the difference 
between left recursion and right recursion.
not -> note
Don't want to add too much, but with parse you can really build up 
a vocubulary based on the patterns you know:

 section: [integer! ["." | 1 4 ["." integer!]]] ; -- or whatever rule 
 covers all permutations
	chars-sp: charset " " space: [some chars-sp]

	parse/all [copy sn section space [to newline | to end]]

Vocabularies are easy to wrap in their own context too.  Note also 
that [integer!] is a shorthand for [some digit] -- very useful : 
Oldes, thanks for your suggestion. It works when I do a simple one 
line rule as you suggested but when I try to use multiple rules it 

Example of what I'm trying to do:
Example of the text document:

3.1 Aluminum ladders are divided into two basic types of construction, 
viz:, solid beam and truss.

3.1.1  Solid Beam Aluminum Construction- This type of ladder has 
a solid side rail construction with aluminum rungs connecting with 
the side rails at fourteen inch intervals. The connection is generally 
either by a welded joint between rung and side rails, or by an expansion 
plug pinching the rung tightly to the side rails and internal backup 
plates. (Figure 2 A)

3.1.2  Aluminum Truss Construction- In the aluminum truss design, 
the top and bottom rails are connected to rung assemblies or rung 
blocks by rivets. The rungs are either welded or expansion plugged 
to the rung plate assemblies, which are supported by the top and 
bottom rails. (Figure 2B)

3.2 The base of the portable aluminum ladder is provided with either 
steel spikes or swiveling rubber safety shoes and aluminum spikes. 
For ladders equipped with the swiveling device, the rubber pads should 
be utilized when the ladder is to be raised and used on hard surfaces. 
(Figure 2A, 2B)
space: charset " ^-"
spaces: [some space]
chars: complement charset " ^-^/"
digit: charset "0123456789"
digits: [some digit]
section: [digits "." some space]
sub-sec: [digits "." digits spaces]
sub-sub-sec: [digits "." digits "." digits spaces]

rules: [heading some parts done] (where heading is the first line 
of the text file]

parts: [newline | section format_section | sub-section | sub-sub-section]

format_section:   copy sec section copy rest to newline (print reduce 
[sec rest])
If I use format_section code directly with parse it works but  i 
get nothing when I redirect it to another line.

THe above code is similar to what Carl used in his text to html script.
Any reason that the headings with one number have a trailing period 
and the rest don't?
BrianH, sorry BRian the text above is just from a random and simpler 
section of the document.

if I copied the from the begining the first line would not have a 
number at all.
Actually, the inconsistency affects the parse rules. I ask again...
I thought you ment the document heading...

No reason but my rules account for it. The rules work in simpler 
Are you creating the documents or are others doing so? For that matter, 
does it just go to 3 levels of numbers?
THE docs come from pdf's that I have converted to text and tried 
to reformat by hand to hte similest form whilepreserving the structure 
of the doc. In addition to sections, sub-sections and sub-sub-seections 
there are nubered lists, letter lists, photos/diagrams, and tables 
to deal with. I thought I start with sorting out the sections and 
tackle the rest later.
Well, first of all you need to put the longer matches first in your 
alternates, so they will be tested first.
in the above code the following will work:
format_section:   [copy rest to newline (print reduce [rest ])

but this fails: 

format_section:   [copy sec section copy rest to newline (print reduce 
[sec rest])
longer matches...

This is where I get lost in parse.

What do you mean?
It checks the alternates (sections separated by | ) in order. If 
there is ambiguity, the way to get it to go for the longest match 
is to check for that match first.
so check sub-sub-sections then sub-section then sections in that 