World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Anton 4-Oct-2006 [1476x3] | Ok, give this a burl. |
string: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" string: " > >> < <<good tag><bad tag> 3 > 5 <other tag><good tag etc> >> > " ; (1) search for end tags >, they are erroneous so replace them ; (2) search for start tags <, if there is more than one, replace all except the last one ; (3) search for end tag >, check tag body and replace if necessary entity: "&entity;" ntag: complement charset "<>" ; non tag parse/all result: copy string [ any [ ; (1) any [ any ntag start: ">" end: ( change/part start entity 1 end: skip start length? entity ;print [1 index? start] ) :end ] ; (2) (start: none stop?: none) any [ any ntag start: "<" end: ;(print [2 mold start]) any ntag "<" ( ;print "found a second start tag" change/part start entity 1 end: skip start length? entity ;(print [2.1 mold copy/part start end]) start: none ) :end ] (if none? start [stop?: 'break]) stop? ; ok, we found at least one start tag ;(print ["OK we found at least one start tag" mold start]) :start skip ; (3) any ntag end: ">" ;(print [3 mold copy/part start end]) (if not find copy/part start end "good tag" [ ;print ["found a bad tag" mold copy/part start end] change/part start entity 1 ; fix up END (for when your entity is other than a 1-character long string) end: skip end (length? entity) - 1 change/part end entity 1 ; fix up END again end: skip end (length? entity) - 1 ]) :end skip ] to end ] result | |
All you need to do now is define two separate entity strings for < and > and then use the right one when replacing. | |
Rebolek 4-Oct-2006 [1479] | great, I'll test it, thanks |
Anton 4-Oct-2006 [1480x2] | Holy ---- ! where did two and a half hours go ? |
oh no.. maybe I only spent one and a half hours on it, but still...! | |
Rebolek 4-Oct-2006 [1482] | Erhm sorry ;) |
Anton 4-Oct-2006 [1483] | Ahh don't worry about that. |
Ladislav 4-Oct-2006 [1484x2] | this looks like an alternative: |
result: "" parse/all string [ any [ ; starting good tag copy s ["<good tag" thru ">"] (append result s) | ; ending good tag "</good tag>" (append result "</good tag>") | ; entity replacement "<" (append result "<") | ">" (append result ">") | copy s skip (append result s) ] ] print result | |
Volker 4-Oct-2006 [1486] | In this case you may also look at load/markup ;) |
Tomc 4-Oct-2006 [1487] | what Volker said. s: "<good tag><bad tag> 3 > 5 <other tag><good tag with something inside>" b: load/markup s while [not tail? b][ either tag? first b [ either find/match first b "good tag" [print first b] [print rejoin["X" to string! first b "X"]] ] [print first b] b: next b ] |
Oldes 5-Oct-2006 [1488x3] | I think there is some limit in load/markup - I would not used it for large data |
And Rebolek, you can use this my code to remove unwanted tags (It's already here - posted a few days befere - but with a little bug - this should be OK as I'm using it) remove-tags: func[html /except allowed-tags /local new x tag name tagchars][ if not string? html [return html] new: make string! length? html tagchars: charset [#"a" - #"z" #"A" - #"Z"] parse/all html [ any [ copy x to {<} copy tag thru {>} ( if not none? x [insert tail new x] if all [ except parse/all tag ["<" opt #"/" copy name some tagchars to end] find allowed-tags name ][ insert tail new tag ] ) ] copy x to end (if not none? x [insert tail new x]) ] new ] | |
I'm thinking about to improve it to be able remove unwanted tag attributes as well | |
Rebolek 5-Oct-2006 [1491x2] | Thanks to everybody, I used Ladislav's example, as it is easily extendible to support more HTML entities than just "<" and ">" |
Oldes I'm not removing any tags, I'm just 'translating' unwanted tags to html-entities | |
Oldes 5-Oct-2006 [1493x3] | It's up to you how you moddify it, do what you need:-) |
And if you are converting html-entities, you can find useful this http://oldes.multimedia.cz/rebol/html-entities_latest.rip | |
Do you think it would be possible to make BNF to PARSE RULES converter? | |
Rebolek 5-Oct-2006 [1496] | Don't know. Is 'parse Turing-complete? :) |
Oldes 5-Oct-2006 [1497x4] | With such a converter we should theoretically be able to easily parse any language |
http://www.garshol.priv.no/download/text/bnf.html | |
...There are actually lots of programs that can be given (E)BNF grammars as input and automatically produce code for parsers for the given grammar. In fact, this is the most common way to produce a compiler: by using a so-called compiler-compiler that takes a grammar as input and produces parser code in some programming language.... | |
It looks like interesting project for long winter evenings:-) | |
Anton 5-Oct-2006 [1501x4] | Well, I just spent two days making a matching algorithm for searching file contents, and I was considering making a "compile-rules" function (possibly similar to Gabriele or someone else's). Looks like I don't have to make that for now, but my mind is in this place at the moment. I long for the day when I don't have to use filesystems at all (which obviates the need for file search programs) - hopefully we can stick all our info in a database soon. Probably an associative database. |
While on this topic - Was it Gregg or Sunanda who made a mini dialect for a file contents matcher ? That's the algorithm I just made, and I'm now interested to review other implementations. While developing I also came to an apparent cross-roads, a choice between a simple, "digital", logical algorithm or a more "fuzzy" algorithm with a ranking system like Google. This reminded me of a discussion a while back where this point was made. | |
Example of the dialect: This finds all the files containing "anton" and "gabriele" but not "anthropomorphic": find-file *.r ["anton" "gabriele" not "anthropomorphic"] This finds all the files containing "reichart" or "oldes": find-file *.r [some ["reichart" "oldes"]] | |
NOT, ALL and SOME can be combined recursively. | |
Gregg 5-Oct-2006 [1505x2] | That one wasn't me (AFAIK). I did mini dialects for matching file dates and sizes, but not contents. I have an old AWK dialect, but that's as close as I've come. |
WRT BNF, it should be possible. I think Brett Handley did it, or the reverse, at one point; might be on codeconscious.com, not sure. I've also done something similar, for ABNF. It was built for a client, so I'd have to ask if it could be released. ABNF is what is used in a lot of RFCs, so it could be used on a lot of things for standards interop. | |
BrianH 5-Oct-2006 [1507] | The syntax on (E)BNF languages varies widely, widely enough that the parse dialect itself can be considered an EBNF language. |
Robert 9-Oct-2006 [1508] | The main problem I see is that a "normal" BNF parser checks all rules in parallel and uses the first match. Whereas PARSE uses a sequential approach using the first match. So, the rule to use PARSE is, always have the maximum width matching rule at the beginning. For example you want to parse for: . .. ... You need to put the ... as first. Otherwise the rule will match for a single . first and be fired three times. |
Pekr 9-Oct-2006 [1509] | which one aproach is considered better? |
Gregg 9-Oct-2006 [1510] | Neither is better. :-) |
BrianH 10-Oct-2006 [1511x2] | Actually Robert, "normal" BNF parsers usually have similar restrictions to the parse dialect, only more so. Shift-reduce parsers like yacc need the maximum width rule first; recursive-descent parsers need to be refactored extensively (in a way that is too complicated to go into now). The parse dialect is recursive-descent with backtracking, which in theory is less restricted than either LR (shift-reduce) or LL (recursive-descent). I tend to do LL refactoring on my parse rules just because that makes them faster, but it's nice that it is not always required, that I can do LR-style rules if I need to. |
Perhaps you are thinking of lexers that convert a source syntax with restrictions similar to those of regular expressions into a state machine. Those could be thought to operate in parallel (not really, but close enough), but the languages they accept are quite restricted compared to full parsers, let alone the parse dialect. | |
Maxim 10-Oct-2006 [1513] | <sigh> and I tought I was getting parse ;-) reading Brian's post... well, makes me reconsider ;-) |
BrianH 10-Oct-2006 [1514] | Sorry, I came to the parse dialect from a history of using and making parser generators. It's annoying that the behavior of parse and the tricks you can use to optimize your parse rules have all of these arcane CS terms referring to them. At least the parse dialect is a lot more flexible than most of those parser generators, and easier to write, use and debug too. |
james_nak 10-Oct-2006 [1515] | I have an easy one for you gurus. Let's say I want to parse a file and get all the "www..." out of it. The thing is that they end in either a space or a linefeed. How do I do a (written in pseudo parse to give you an idea) "to "www" copy tag to 'either a linefeed or a space'"? I've tried charsets, vars, blocks but the best I can do is one or the other. Note, finding the "www" is the easy part, it's ending the string that is giving me fits. Thanks in advance. |
Oldes 10-Oct-2006 [1516x2] | >> stoppers: charset " ^/^M^-" == make bitset! #{ 0026000001000000000000000000000000000000000000000000000000000000 } >> validchars: complement stoppers == make bitset! #{ FFD9FFFFFEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF } >> test: {www.c.x xxxx www.x.x^/xxxx www.q.q} == "www.c.x xxxx www.x.x^/xxxx www.q.q" >> parse/all test [any [to "www" copy url some validchars (probe url)]] www.c.x www.x.x www.q.q == true |
it should be parse/all test [any [to "www" copy url some validchars (probe url)] to end] if you want to receive true event if the string do not end with the url | |
james_nak 10-Oct-2006 [1518] | Oldes, thanks. I get it. You one smart cookie. Gracias. |
Maxim 27-Oct-2006 [1519x7] | I am almost sure this question is asked many times before... its my turn :-) is there a way for a parse rule to detect situations in which is should fail, because we have a succeeding rule which we know will match? |
ex: I am parsing : ABCZXYXYCBA | |
I have rules to parse ABC explicitely and a fall back which can parse anything. | |
but I'd like to detect in my fall-back that it should stop, cause I know I'm at the end. | |
the length of the string is not known, until I hit the trailing CBA | |
note... the example is simple and consider each character a different matching condition. | |
also, in reality, each letter in the above over-simplification is a word... not just one char (and there is overlap) so I can't just match charsets. | |
older newer | first last |