r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
6-Nov-2008
[2889x2]
In theory you could even do something like block parsing on event 
ports, like SAX pull. Same seekable restrictions apply - no backtracking 
or position setting or getting unless the port supports seeking.
That would shunt the cache management into the port scheme :)
Anton
6-Nov-2008
[2891x2]
Ah, that makes sense. My model of how parse would handle ports was 
wrong. I was assuming it would work just like string parse, except 
working on a limited buffer, supplied by the port.
Block parsing ? How are you going to do that when you can't even 
see the final ']'  in the buffer yet ?
BrianH
6-Nov-2008
[2893x2]
With seekable ports the buffering is handled by the ports, rather 
than provided by them. I wonder if there will be cache control APIs 
:)
By "something like block parsing", I mean ports that return other 
REBOL values than bytes or characters can be parsed as if the values 
were contained in a block and being parsed there. Any buffering of 
these values would be handled by the port scheme code. Only whole 
REBOL values would be returned by such ports, so any inner blocks 
returned would be parsed by INTO as actual blocks.
Anton
6-Nov-2008
[2895]
Hmm.. that could work. I suppose the outermost block that usually 
encompasses loaded rebol data would have to be "ignored".
BrianH
6-Nov-2008
[2896x2]
No, it would be virtual :)
Actually, there are no [ and ] in REBOL blocks once they are loaded. 
Block parse works on data structures.
Anton
6-Nov-2008
[2898]
'Virtual' is the right word.
Pekr
6-Nov-2008
[2899]
I thought along the Anton's thoughts - that it would work like parsing 
a string, using some limited buffer ...
BrianH
6-Nov-2008
[2900x2]
Ports don't work like series in R3. If anything, port PARSE would 
simplify port handling by making seekable ports act more like series.
I gotta suggest this to Carl :)
Anton
6-Nov-2008
[2902]
At least if you could add "3.12 port parsing" to the Parse_Project 
page... :)
Pekr
6-Nov-2008
[2903]
OTOH - I never did some binary format parsing. Oldes has some experience 
here IIRC. Dunno how encoders/decoders will be built, maybe those 
will be in native C code anyway ...
Tomc
6-Nov-2008
[2904]
the potential for backtracking is initiated by setting a placeholder 
  i.e. :here  

caching only as far back as the earliest current placeholder may 
be sufficent
BrianH
6-Nov-2008
[2905x5]
There are three operations that can cause you to change your position 
from the standard foward-on-recognition: get-words (:a), alternation 
( | ) and REVERSE. You can check for alternation because it will 
always be within the current rule block. Get-words and REVERSE may 
be in inner blocks that may change.
Here's an example of what you could do with the PARSE proposals:

use [r d f] [ ; External words from standard USE statement
    parse f: read d: %./ r: [
        use [d1 f p] [ ; These words override the outer words
            any [
            ; Check for directory filename

                (d1: d) ; This maintains a recursive directory stack
                p: ; Save the position

                change [ ; This rule must be matched before the change happens

                    ; Set f to the filename if it is a directory else fail
                    set f into file! [to end reverse "/" to end]
                    ; f is a directory filename, so process it
                    (

                        d: join d f ; Add the directory name to the current path

                        f: read d   ; Read the directory into a block
                    )
                    ; f is now a block of filenames.
                ] f ; The file is now the block read above
                :p  ; Go back to the saved position
                into block! r ; Now recurse into the new block
                (d: d1) ; Pop the directory stack
            ; Otherwise backtrack and skip
                | skip
            ] ; end any
        ] ; end use
    ] ; end parse
    f ; This is the expanded directory block
]
I could probably save that p position word using FAIL and backtracking 
:)
Here's an revised version with more of the PARSE proposals:

use [r d res] [ ; External words from standard USE statement
    parse res: read d: %./ r: [
        use [ds f] [ ; These words override the outer words
            any [
            ; Check for directory filename

                (ds: d) ; This maintains a recursive directory stack
                [ ; Save the position through alternation

                    change [ ; This rule must be matched before the change happens

                        ; Set f to the filename if it is a directory else fail

                        set f into file! [to end reverse "/" to end]

                        ; f is a directory filename, so process it
                        (

                            d: join d f ; Add the directory name to the current path

                            f: read d   ; Read the directory into a block
                        )
                        ; f is now a block of filenames.
                    ] f ; The file is now the block read above
					fail ; Backtrack to the saved position
					|
					into block! r ; Now recurse into the new block
				]
                (d: ds) ; Pop the directory stack
            ; Otherwise backtrack and skip
                | skip
            ] ; end any
        ] ; end use
    ] ; end parse
    res ; This is the expanded directory block
]
Sorry, somehow those became tabs :(
Pekr
6-Nov-2008
[2910]
Don't know why, but most of the time when parsing CSV structure I 
have to do something like:

parse/all append item ";" ";" 


Simply put, to get all columns, I need to add the last semicolon 
to the input string ...
BrianH
6-Nov-2008
[2911]
Show an example string that requires that hack and maybe we can help.
Pekr
6-Nov-2008
[2912]
http://www.rebol.net/cgi-bin/rambo.r?id=3813&
BrianH
6-Nov-2008
[2913]
I remember that. It shouldn't be as much of a problem when the ordinal 
functions return none rather than out-of-bounds errors....
Still, I'll bring it up.
Tomc
6-Nov-2008
[2914]
comes from data using seperators instead of terminators ... I use 
'|  and have  a command line "tailpipe" script to fix data
Steeve
7-Nov-2008
[2915]
is that all folks ?
BrianH
7-Nov-2008
[2916x5]
Aside from a bugfix in the last example I gave (forgot the only) 
I would say yes for now. There will be more changes when Carl gets 
back to this so that we can discuss his proposals. Everyone else's 
proposals seem to have been covered except THROW (which also need 
Carl feedback). Incorporating COLLECT and KEEP into PARSE is both 
unnecessary and doesn't help at all for building hierarchical structures. 
PARSE doesn't have anything to do with parsing REBOL's syntax, so 
Graham's problems are out-of-scope. If you have more ideas this or 
the same group in the alpha world are the places to bring them up.
Changes to simple parsing (not rule-based) are out of scope, but 
have been brought up nonetheless. Parsing or ports is also out of 
scope for the proposals document, but will also be brought up. Everything 
in Gabriele's PARSE REP page has been covered or rejected (except 
THROW).
Here is the page with the PARSE syntax requests - see for yourself: 
http://www.rebol.net/wiki/Parse_Project
Parsing or ports -> Parsing of ports
That page is it unless we get more suggestions. We haven't decided 
what makes the cut yet even for those.
Steeve
7-Nov-2008
[2921x2]
hum (i have to be a little bit rude), i just read your response on 
rebol.net about the opportunity to turn or not return into a more 
genralized EMIT functions (as i proposedl).

I will not discuss about the difficulty to implement that idea (i 
don't have the sources). But what i can say, is that a COLLECT behaviour 
will be more usefull than all return break/return stuffs u posted.

Have you inspected scripts in Rebol.org recently ? If u had done, 
you would see that many coders use parsing  to collect data.

The problem Graham, is that  when i read your arguments, i have the 
unpleasant impression that your are alone to decide if an idea is 
bad or good.  

The narrow minded sentence " Incorporating COLLECT and KEEP into 
PARSE is both unnecessary and doesn't help at all for building hierarchical 
structures" suggest that you had not  widely used parse in your code. 
 I don't think you are the best  people here to made these choices. 
Many script contributors on Rebol.org have made some masterfull piece 
using parse (not you).

So when you reject an idea you should be more sensitive with this 
simple fact: many poeple  here  have an equal or better  experience 
whit  parsing than you.
by the way, many people have proposed the idea you posted in the 
wiki (just read some scripts on Rebol.org) you should be a little 
bit  less quick to credit you of  ideas that are here since several 
years.
Anton
7-Nov-2008
[2923]
(Steeve, I think you are addressing BrianH, not Graham.)
Steeve
7-Nov-2008
[2924x3]
really ?
oh my...
yes it talk to BrianH, what do u mean ?
Anton
7-Nov-2008
[2927]
You wrote above, "The problem Graham, is that  when i read your arguments..."
Steeve
7-Nov-2008
[2928x2]
oh i see, my Apologies to Graham
I was a little upset when I wrote it ;-)
Pekr
8-Nov-2008
[2930x2]
uhmm, well, Steeve, as for me, if my proposal is going to be implemented, 
I don't care if I am credited or not. Because - parser REPs are floating 
here or there for some 8 years maybe :-) As for BrianH and his judgements 
- he might not be better in parse than others, but I would not try 
to upset him - BrianH is our guru here. Along with Gabriele, Cyphre, 
and after loss of Ladislav, he is one of the most skilled rebollers. 
I think that his intention is to help REBOL being better. He might 
be also the one, who will bring JIT or compiler in the future, and 
he understand consequences of what he suggests ...
I have to ask - what ppl are you referring to, regarding rebol.org? 
Why they are not here, or posting to blog? BrianH might be quick 
in his decision, because Carl selected him to collect the ideas, 
so let's forgive him a little bit of guru behaviour :-) And in the 
end, it is Carl who decides, if REP is going to be implemented or 
not. If you have another pov on some REP, why not to talk about it 
here, where more ppl can judge?
BrianH
8-Nov-2008
[2932]
I'm not angry, promise :)
Pekr
8-Nov-2008
[2933]
:-) OK
BrianH
8-Nov-2008
[2934x4]
Nonetheless, I think I need to apologize to Steeve, especially in 
the original sense of explanation.
I am the editor of the PARSE proposals.


It was decided that I perform this role because Carl is focused on 
the GUI work right now and someone qualified had to do it. With Carl 
busy and Ladislav not here, I am the one left who has the most background 
in parsing and the most understanding of what can be done efficiently 
and what can't. When the PARSE REPs of old were discussed, I was 
right there in the conversation and the originator of about half 
of them, mostly based on my experience with other parsers and parser 
generators. Because of this I am well aware of the original motivation 
behind them, and have had many years to think them through. It's 
just head start, really.


I am also the author of the current implementation of COLLECT and 
KEEP, based on Gabriele's original idea, which was a really great 
idea. It is also really limited. Collecting information and building 
data structures out of it is the basic function that programming 
languages do, and something that REBOL is really good at. I am not 
in any way denigrating the importance of building data structures. 
I certainly did not mean to imply that your appreciation of that 
important task was in any way less important.


The role of an editor is not just to collect proposals, but to make 
sure they fit with the overall goal of the project. This sometimes 
means rejecting proposals, or reshaping them. This is not a role 
that I am sorry about - someone has to do it to make our tool better. 
We are not Perl, this is not anything goes, we actually try to make 
the best decisions here. I hate to seem the bad guy sometimes, but 
someone has to do it :(


PARSE is a portion of REBOL that is dedicated to a particular role. 
It recognizes patterns in data, extracts some of the data, and then 
calls out to the DO dialect to do something with the data. It doesn't 
really do anything to the data itself - everything happens in the 
DO dialect code in the parens. It is fairly simple really, and from 
carefully designed simplicity it gets a heck of a lot of power and 
speed. That is its strength.


The thing that a lot of people don't remember when making improvements 
to a dialect like PARSE is that PARSE is only one part of REBOL. 
If something doesn't go into PARSE, it can go into another part of 
REBOL. We have to consider the language as a whole when we are doing 
things like this.

Here is the overall rationale for the PARSE dialect proposals:

- All new features need to be simple to explain and use, and fast 
at runtime.
- A good feature would be one of these:

  - An extremely powerful enhancement of PARSE's language recognition.

  - A fix to a design flaw in an existing feature, or a compatibility 
  fix.

  - A serious improvement to a sufficiently common use case, or common 
  error.


The reason I didn't want to put COLLECT and KEEP into PARSE is because 
it is a small part of a much bigger problem that really needs a lot 
of flexibility. Different structure collection and building situations 
require different behavior. It just so happens that the DO dialect 
is much better suited to solving this particular problem than the 
PARSE dialect is. Remember, PARSE is a native dialect, and as such 
is rather fixed.


There are some PARSE proposals that make parse actually do something 
with the data itself: CHANGE, INSERT and REMOVE. We were very careful 
when we designed those proposals. In particular, we wanted to provide 
the bare minimum that would be necessary to handle some very common 
idioms that are usually done wrong, even by the best PARSE programmers. 
Sometimes we add stuff into REBOL that is just there to solve a commonly 
messed up problem, so that a well debugged solution would be there 
for people to choose instead of trying to solve it again themselves, 
badly. (This is why the MOVE function got added to R3 and 2.7.6, 
btw.) Even with that justification those features might not make 
it into PARSE because they change the role of PARSE from recognition 
to modification. I have high hopes, though.


Another proposal that might not make it into PARSE is RETURN. RETURN 
is another ease-of-use addition. In particular, the thing it makes 
easy is stopping the parse in the middle to return some recognized 
information. However, it changes the return characteristics of PARSE 
in ways that may have unpredictable results, and may not have enough 
benefit. The proposal that has a better chance of making it is BREAK/return, 
though I'd like to see both (we can hope, right?).


Most of the REPs from Gabriele's doc have been covered. Most of them 
have been changed because we have had time in the last several years 
to give them some thought; the only unchanged ones are NOT and FAIL, 
so far. Some have been rejected because they just weren't going to 
work at all (8 and 12). THROW and DO are still under discussion - 
the proposals won't work as is, but the ideas behind them have merit. 
The rest have been debated and changed into good proposals. Note 
that the DO proposal would be rejected outright for R2, but R3's 
changes to word binding make it possible to make it safe (as figured 
out during a conversation with Anton this evening).


There are other features that are not really changes to the PARSE 
dialect, and so are out of scope for these proposals. That doesn't 
mean that they won't be implemented, just that they are a separate 
subject. That includes delimiter parsing (sorry, Petr), tracing (sorry, 
Henrik), REBOL language syntax (sorry, Graham), and port parsing 
(sorry, Steeve, Anton, Doc, Tomc, et al). If it makes you feel better, 
while discussing the subject with Anton here I figured out a way 
to do port parsing with the R3 port model (it wouldn't work with 
the R2 port model). I will bring these all up with Carl when it comes 
to that.


I hope that this makes the situation and my position on the subject 
clearer. I'm sorry for any misunderstandings that arose during this 
process.
Note that I am quite familiar with collecting data from hierarchical 
and other structures and putting that data into hierarchical and 
other data structures. I have done this with PARSE, with DO dialect 
code, and with a combination of the two. I have found that PARSE 
is good for recognition, but DO dialect code is best for the construction. 
A mix of both is usually the best strategy. You can use the existing 
COLLECT and KEEP with PARSE quite well. PARSE is not a standalone 
dialect - it is meant to be integrated with other dialects, particularly 
the DO dialect that gets executed in the parens.
However, most of my contributions to REBOL.org were lost during one 
of their reorgs years ago and I have been mostly contributing in 
other ways lately. Like helping people out here and writing REBOL's 
mezzanine functions. I barely go to REBOL.org anymore except to search 
the code there for mezzanine usage so that I know what is safe to 
change. Outside of work that goes into REBOL community projects, 
most of my scripts have been either one-offs or under NDA lately. 
Sorry.
Sunanda
8-Nov-2008
[2938]
BrianH -- is it possible to incorporate the TRACE/DEBUG suggesion 
as part of the doc? Parse is so complex/deep/subtle that it needs 
some transparency.
See my earlier message above, or here:
http://www.rebol.org/aga-display-posts.r?post=r3wp210x2855