World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
BrianH 24-May-2007 [1801] | That should be pretty fast, and it doesn't involve huge binary temporaries. |
Gregg 24-May-2007 [1802] | http://www.codeconscious.com/rebol/scripts/bitsets.r |
BrianH 24-May-2007 [1803x3] | To compare those, it looks like repeat x 256 [ c: to-char x - 1 if find b c [append s c] ] would be faster than for i 0 255 1 [ if parse/all to-string test-char: to-char i reduce [ bitset ] [ append result test-char ] ] because of the reduce, the to-string. the parse, and the use of the mezzanine for instead of the native repeat. |
Your's is more flexible, though. | |
Lots of other interesting stuff on that site. | |
Gregg 24-May-2007 [1806] | Yes, Brett has built a lot of very cool stuff. Haven't seen him around for a while though. |
Rebolek 26-May-2007 [1807] | So, this is my first attempt to do regular expressions in REBOL. Type on your console: do http://bolek.techno.cz/reb/regex.r Some things are missing and it can sometimes run in endless loop when it shouldn't, so please be benevolent :) But at least the email regex can be translated and parsed succesfully. |
Henrik 26-May-2007 [1808] | it's quite small |
Oldes 26-May-2007 [1809x2] | nice:)... just you should print regex not regexp in the example (or rename the function) |
and... it would be good to have just a function which returns the translated Rebol parse block | |
Graham 26-May-2007 [1811] | regex rules look quite complicated! |
Rebolek 26-May-2007 [1812x3] | Graham yes, they are :) |
Oldes: regex vs. regexp on google is 12mil vs 8mil, so that's the reason, but I can rename it (it was called regexp yesterday ;) | |
And yes, function returning just parse rules will be done, this is just a work in progress | |
Oldes 26-May-2007 [1815x2] | I don't mind how you caal the function.. just the printed one is different from the one used which is confusing as first what I did was copy and paste what you printed and got en error that regexp function do not exists |
and anyway... 12 or 8 millions google rusults is not a big difference if your page is not listed between first 20 pages:) | |
Rebolek 26-May-2007 [1817] | oh I see now what you mean...that print is from test function, I'll change it. And that google score - I used it just to compare what name will be better, I in no way don't expect to be there with my page :) It's just what I do when I'm not sure how some word spells (svatba vs. svadba and so on ;) - I put both terms in Google and the one with better score is probably the right one ;) |
Oldes 26-May-2007 [1818x2] | you can use... http://www.googlefight.com/or make a Rebol version... it's quite easy |
if you were talking about bitsets.... it reminded me that it would be good to have some common rules in Rebol3 available for parsing... | |
Rebolek 26-May-2007 [1820x2] | I had googlefight in Krabot, if you remeber :) |
like characters and digits? | |
Oldes 26-May-2007 [1822] | I mean.... I have to write.... spaces: charset " ^/^-^M" and so on on so many places in code |
Rebolek 26-May-2007 [1823] | and whitespaces, yes |
Oldes 26-May-2007 [1824x2] | it could be working like in stylize in VID |
but maybe it's not so important... | |
Rebolek 26-May-2007 [1826] | in the file i posted is a function REGSET that converts small bit of regex to bitset, it's syntax seems to be easier than charset's syntax (charset [#"a" - #"z" #"0" - #"9"] vs regset "a-z0-9") |
Gregg 26-May-2007 [1827x2] | Very nice Boleslav! What regex engine/syntax are you going for compatibility with (if any)? Charset syntax is probably that way because it's a dialect, and Carl wanted a string as input to be easy, without escapes and such; just my guess. |
Graham, the best book I know of on regexs is Jeff Friedl's 'Mastering Regular Expressions'. He has an email validating regex (i.e. it just matches the RFC822 spec for an email address) which is almost 5K IIRC. | |
Rebolek 26-May-2007 [1829] | Thanks Gregg. I started with some examples from http://regular-expressions.info, just to see, if I can do it, so after fixing bugs and adding some feature still missing, I'll see what next, if anything. There are already some incompatibilities - carret ^ cannot be used in rebol strings the way it's used in regex, so i'm using tilde ~ instead. |
BrianH 26-May-2007 [1830x5] | There are several different regex dialects. Are you following one of those, or making another? |
(Running commentary as I read your code) | |
You should wrap your code in a context. | |
You should seperate the regex compilation phase from its application phase, and just write a wrapper that calls both in order. The compilation phase is often more complex than just applying the results, so if you are using the regex repeatedly you should just compile it once. | |
i like regset, in theory. You seem to be applying block parse syntax to strings - it could be simpler. | |
Rebolek 26-May-2007 [1835] | I wanted to do some basic regex set, some common denominator, partly as an exrcesise in parse and partly to stop people looking at rebol saying "it has no regex". so now it has ;) So I'm not sure what dialect of regex is the best one. Block parse syntax to strings - you mean that char! ? Yes, it's probably not doing anything, the code needs some cleanup. I plan to wrap it in context, yes. |
BrianH 26-May-2007 [1836] | The char! 1 is useless. I am looking at that code now (reworking it). |
Rebolek 26-May-2007 [1837] | Oldes, I though about just a translator from regex to parse rules and I'm not sure it will be easy, I'm using my 'tail-parse that matches rules in reversed order that is better for regex syntax. Maybe there's some other way. |
BrianH 26-May-2007 [1838] | Reverse order is better? I think you aren't backtracking right in your generated rules. |
Rebolek 26-May-2007 [1839] | this is the problem with [some "a" "a"]. This is equivalent of "a*a" in regex which is perfectly valid, but problematic in parse. This is simple example, but it can get quite complicated so I'm not sure I can handle all cases. The reversed order seemed simpler. But you will probably prove me wrong :) |
BrianH 26-May-2007 [1840x2] | Are you supporting grouping? Haven't gotten to that point in the code yet. |
Do regex dialects support decreasing character ranges in their sets? | |
Rebolek 26-May-2007 [1842] | grouping - just the bitsets created with regset, groups like [cat|dog] not yet supported |
BrianH 26-May-2007 [1843x2] | Which regex dialect does grouping with [ and ], I thought they used ( and ). |
Are you supporting . wildcard characters? | |
Rebolek 26-May-2007 [1845x4] | oh yes you're right...just an old rebol habit, using square brackets everywhere :) |
wildcards - well dot is any character except newline | |
Right now it should support regex according tohttp://www.regular-expressions.info/quickstart.html | |
more or less | |
BrianH 26-May-2007 [1849x2] | BTW, "a*a" is directly equivalent to [any "a" "a"], not some. |
It still won't work directly. | |
older newer | first last |