r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Oldes
5-Oct-2006
[1489x2]
And Rebolek, you can use this my code to remove unwanted tags (It's 
already here - posted a few days befere - but with a little bug - 
this should be OK as I'm using it)

remove-tags: func[html /except allowed-tags /local new x tag name 
tagchars][
	if not string? html [return html]
	new: make string! length? html
	tagchars: charset [#"a" - #"z" #"A" - #"Z"]
	parse/all html [
		any [
			copy x to {<} copy tag thru {>}  (
				if not none? x [insert tail new x]
				if all [
					except
					parse/all tag ["<" opt #"/" copy name some tagchars to end]
					find allowed-tags name
				][	insert tail new tag ]
			)
		]
		copy x to end (if not none? x [insert tail new x])
	]
	new
]
I'm thinking about to improve it to be able remove unwanted tag attributes 
as well
Rebolek
5-Oct-2006
[1491x2]
Thanks to everybody, I used Ladislav's example, as it is easily extendible 
to support more HTML entities than just "<" and ">"
Oldes I'm not removing any tags, I'm just 'translating' unwanted 
tags to html-entities
Oldes
5-Oct-2006
[1493x3]
It's up to you how you moddify it, do what you need:-)
And if you are converting html-entities, you can find useful this 
http://oldes.multimedia.cz/rebol/html-entities_latest.rip
Do you think it would be possible to make BNF to PARSE RULES converter?
Rebolek
5-Oct-2006
[1496]
Don't know. Is 'parse Turing-complete? :)
Oldes
5-Oct-2006
[1497x4]
With such a converter we should theoretically be able to easily parse 
any language
http://www.garshol.priv.no/download/text/bnf.html
...There are actually lots of programs that can be given (E)BNF grammars 
as input and automatically produce code for parsers for the given 
grammar. In fact, this is the most common way to produce a compiler: 
by using a so-called compiler-compiler that takes a grammar as input 
and produces parser code in some programming language....
It looks like interesting project for long winter evenings:-)
Anton
5-Oct-2006
[1501x4]
Well, I just spent two days making a matching algorithm for searching 
file contents, and I was considering making a "compile-rules" function 
(possibly similar to Gabriele or someone else's). Looks like I don't 
have to make that for now, but my mind is in this place at the moment. 
I long for the day when I don't have to use filesystems at all (which 
obviates the need for file search programs) - hopefully we can stick 
all our info in a database soon. Probably an associative database.
While on this topic - Was it Gregg or Sunanda who made a mini dialect 
for a file contents matcher ? That's the algorithm I just made, and 
I'm now interested to review other implementations. While developing 
I also came to an apparent cross-roads, a choice between a simple, 
"digital", logical algorithm or a more "fuzzy" algorithm with a ranking 
system like Google. This reminded me of a discussion a while back 
where this point was made.
Example of the dialect:

This finds all the files containing "anton" and "gabriele" but not 
"anthropomorphic":
	find-file *.r ["anton" "gabriele" not "anthropomorphic"]
This finds all the files containing "reichart" or "oldes":
	find-file *.r [some ["reichart" "oldes"]]
NOT, ALL and SOME can be combined recursively.
Gregg
5-Oct-2006
[1505x2]
That one wasn't me (AFAIK). I did mini dialects for matching file 
dates and sizes, but not contents. I have an old AWK dialect, but 
that's as close as I've come.
WRT BNF, it should be possible. I think Brett Handley did it, or 
the reverse, at one point; might be on codeconscious.com, not sure. 
I've also done something similar, for ABNF. It was built for a client, 
so I'd have to ask if it could be released. ABNF is what is used 
in a lot of RFCs, so it could be used on a lot of things for standards 
interop.
BrianH
5-Oct-2006
[1507]
The syntax on (E)BNF languages varies widely, widely enough that 
the parse dialect itself can be considered an EBNF language.
Robert
9-Oct-2006
[1508]
The main problem I see is that a "normal" BNF parser checks all rules 
in parallel and uses the first match. Whereas PARSE uses a sequential 
approach using the first match. So, the rule to use PARSE is, always 
have the maximum width matching rule at the beginning.

For example you want to parse for:
.
..
...


You need to put the ... as first. Otherwise the rule will match for 
a single . first and be fired three times.
Pekr
9-Oct-2006
[1509]
which one aproach is considered better?
Gregg
9-Oct-2006
[1510]
Neither is better. :-)
BrianH
10-Oct-2006
[1511x2]
Actually Robert, "normal" BNF parsers usually have similar restrictions 
to the parse dialect, only more so. Shift-reduce parsers like yacc 
need the maximum width rule first; recursive-descent parsers need 
to be refactored extensively (in a way that is too complicated to 
go into now). The parse dialect is recursive-descent with backtracking, 
which in theory is less restricted than either LR (shift-reduce) 
or LL (recursive-descent). I tend to do LL refactoring on my parse 
rules just because that makes them faster, but it's nice that it 
is not always required, that I can do LR-style rules if I need to.
Perhaps you are thinking of lexers that convert a source syntax with 
restrictions similar to those of regular expressions into a state 
machine. Those could be thought to operate in parallel (not really, 
but close enough), but the languages they accept are quite restricted 
compared to full parsers, let alone the parse dialect.
Maxim
10-Oct-2006
[1513]
<sigh>  and I tought I was getting parse  ;-)  reading Brian's post... 
well, makes me reconsider  ;-)
BrianH
10-Oct-2006
[1514]
Sorry, I came to the parse dialect from a history of using and making 
parser generators. It's annoying that the behavior of parse and the 
tricks you can use to optimize your parse rules have all of these 
arcane CS terms referring to them. At least the parse dialect is 
a lot more flexible than most of those parser generators, and easier 
to write, use and debug too.
james_nak
10-Oct-2006
[1515]
I have an easy one for you gurus. Let's say I want to parse a file 
and get all the "www..." out of it. The thing is that they end in 
either a space or a linefeed. How do I do a (written in pseudo parse 
to give you an idea) "to "www" copy tag to 'either a linefeed or 
a space'"? I've tried charsets, vars, blocks but the best I can do 
is one or the other. Note, finding the "www" is the easy part, it's 
ending the string that is giving me fits. Thanks in advance.
Oldes
10-Oct-2006
[1516x2]
>> stoppers: charset " ^/^M^-"
== make bitset! #{
0026000001000000000000000000000000000000000000000000000000000000
}
>> validchars: complement stoppers
== make bitset! #{
FFD9FFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
}
>> test: {www.c.x xxxx www.x.x^/xxxx www.q.q}
== "www.c.x xxxx www.x.x^/xxxx www.q.q"

>> parse/all test [any [to "www" copy url some validchars (probe 
url)]]
www.c.x
www.x.x
www.q.q
== true
it should be parse/all test [any [to "www" copy url some validchars 
(probe url)] to end] if you want to receive true event if the string 
do not end with the url
james_nak
10-Oct-2006
[1518]
Oldes, thanks. I get it. You one smart cookie. Gracias.
Maxim
27-Oct-2006
[1519x7]
I am almost sure this question is asked many times before... its 
my turn  :-)


is there a way for a parse rule to detect situations in which is 
should fail, because we have a succeeding rule which we know will 
match?
ex:  I am parsing :  ABCZXYXYCBA
I have rules to parse ABC explicitely and a fall back which can parse 
anything.
but I'd like to detect in my fall-back that it should stop, cause 
I  know I'm at the end.
the length of the string is not known, until I hit the trailing CBA
note... the example is simple and consider each character a different 
matching condition.
also, in reality, each letter in the above over-simplification is 
a word... not just one char (and there is overlap) so I can't just 
match charsets.
Gabriele
27-Oct-2006
[1526]
is BREAK what you are looking for?
Coccinelle
27-Oct-2006
[1527x2]
I'm not sure if I understand, but perhaps this :
rule: ["CBA" | skip rule]
parse "ABCZXYXYCBA" ["ABC" rule]
Ladislav
28-Oct-2006
[1529x2]
Maxim: I suppose that the trouble is, that your fall-back rule accepts 
empty string? If that is the case, then the rule like:

    [any [fall-back]]

is an "endless loop". Therefore you may need something like:

    [any [end break | fall-back]]

to be able to stop
on the other hand, this may not be enough in some cases (e.g. if 
the fall-back rule isn't able to get to the end)
Maxim
28-Oct-2006
[1531]
the break seems to be what I am looking for,I'll test something out 
and if its not conclusive I will come back with a better example 
:-)  thanks guys.
Graham
25-Nov-2006
[1532]
Posted on reboltalk ...

>> parse/case "AAABBBaaaBBBAAAaaa" "A"
== ["" "" "" "BBBaaaBBB" "" "" "aaa"]

how come there are only two "" after the BBBaaaBBB ?
Henrik
25-Nov-2006
[1533]
>> parse/case "AAABBBaaaAAA" "A"
== ["" "" "" "BBBaaa" "" ""]
>> parse/case "BAAABBBaaaAAA" "A"
== ["B" "" "" "BBBaaa" "" ""]
>> parse/case "BA" "A"
== ["B"]

hmmm...
Ladislav
25-Nov-2006
[1534]
it's OK, because every A means one closing #"^"". The first A was 
used to close the "...a" string
Anton
25-Nov-2006
[1535]
Yep, makes sense to me.
Ingo
26-Nov-2006
[1536]
This may make it easier for some, just exchange the "A"s for "," 
and mentally read it like you would read a csv file:

>> parse/case ",,,BBBaaaBBB,,,aaa" ","
== ["" "" "" "BBBaaaBBB" "" "" "aaa"]
Anton
26-Nov-2006
[1537]
It's like cutting a piece of wood. You only cut twice but you end 
up with three pieces.
Maxim
26-Nov-2006
[1538]
but parse does have an inconsistency:
>> parse/all "/1/2/3/" "/"
==  ["" "1" "2" "3"]

>> parse/all "/1/2/3" "/"
== ["" "1" "2" "3"]


two different strings on entry, the same output.  IMHO the first 
example shoul have an extra trailing ""  in the block.