World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
florin 25-May-2010 [5022] | Processing text and rules seems so natural to rebol. I think I'm going to enjoy this. |
Gregg 25-May-2010 [5023] | Once you get used to PARSE, it's very hard to use other languages. :-) |
Fork 11-Jul-2010 [5024x2] | I've uploaded my old Whitespace interpreter that I implemented in PARSE to GitHub: http://github.com/hostilefork/whitespacers/raw/master/rebol/whitespace.r |
What's interesting about it is the comparison that can be made to implementations in other languages. I've collected implementations of whitespace interpreters in several other languages (by other people) in that project: http://github.com/hostilefork/whitespacers/ | |
Anton 30-Jul-2010 [5026] | Ok, continuing the discussion from "Performance" group, I'd like to ask for some help with parsing rebol format files. Basically, I'd like to be able to extract a block near the beginning or end of a file, while minimizing disk access. The files to be parsed could be large, so I don't want to load the entire contents, but chunks at a time. So my parse rule should be able to detect when the input has been exhausted and ask for another chunk. (When extracting a block near the end of a file, I'll have to parse in reverse, but I'll try to implement that later.) |
Oldes 30-Jul-2010 [5027] | And why you don't want to use the load/next which was advised? |
Anton 30-Jul-2010 [5028x4] | Using LOAD/NEXT, I still have to use a O(n^2) algorithm. I'd now like to do my own parse, which can be O(n). |
As far as I know, there is no way to instruct LOAD/NEXT to only read chunks from the file as necessary to load the next value. | |
Which is why, in that algorithm, I had to iteratively: load a chunk, append it and try LOAD/NEXT until it succeeded. Which gives the algorithm O(n^2) performance. | |
My question for this O(n) parse algorithm is: What rebol syntax do I need identify? I suppose all I need is: - Comments (lines beginning with semi-colon ; ) - Strings (single-line "" and multi-line {} ) watching out for escape sequences eg. {^}} - Blocks | |
Oldes 30-Jul-2010 [5032] | yes |
Anton 30-Jul-2010 [5033] | Have I missed anything ? |
Oldes 30-Jul-2010 [5034x2] | just that comment may not be on line beginning so far |
and you want to get just the first block? Skipping any content before it? | |
Anton 30-Jul-2010 [5036x2] | Ok, yes, it may be preceded by whitespace. That seems easy to deal with. |
Well, I'd like the algorithm to be general enough to get any number of block in the file. So far I need just the first, second and last blocks. | |
Oldes 30-Jul-2010 [5038] | And you need to parse the Altme's *.set files or something else as well? |
Anton 30-Jul-2010 [5039x3] | The *.set files are what I need it for, yes. |
I imagine it could be useful in other similar situations, so I'd like it to be pretty general. I suppose a bonus functionality is to be able to get nested blocks. (And a super bonus will be to get any datatype at any level, but I won't bother doing that until I need it.) | |
Um, actually, I remember years ago I wanted to load the rebol header of many files, avoiding loading the whole of each file. | |
Oldes 30-Jul-2010 [5042] | Also char! must be supported: #"[" |
Anton 30-Jul-2010 [5043x2] | So this algorithm could be used for that, too. |
Must it ? I think if I can parse single-line strings correctly, then a bracket inside won't cause a problem. This means I'll be basically ignoring datatypes which allow strings in their syntax, and just jumping to the string part. | |
Oldes 30-Jul-2010 [5045] | you are right |
Anton 30-Jul-2010 [5046x8] | I don't think there's any way to make any type with a literal bracket in it (except blocks, of course). (But I am worrying about that a bit.) |
I tried to make some words with a single unmatched literal bracket, or literal string delimiter, but I failed so far. They don't load, so they won't be in well-formed rebol format files. | |
And then I tried issues, files, and I can't do it there, either. | |
One caveat: Misidentifying as a block, types like (what are they called?) "inline types"? eg. #[none] If I don't recognise it as none! (or maybe issue!) , then I might accidentally take it as a block. | |
Or how about this: #[block! [hello]] I would probably want to extract [hello], but if I don't recognise this form, then I'd get [block! [hello]] instead. | |
But that's ok, I can probably tweak the parse rule to support those later. | |
Does anyone have any advice on how I should structure this algorithm? I don't feel confident as I haven't studied parsing theory deeply. http://en.wikipedia.org/wiki/Parsing Should I do lexical analysis and syntactic analysis separately ? I think I can do it all with just one parse, but it might not be a good idea. | |
I'll make a start. | |
Oldes 30-Jul-2010 [5054x3] | try this: http://box.lebeda.ws/~hmm/rebol/load-first-block.r but it's not well tested. You can make the chunk size better. It's just 200 bytes just to not find the block easily in the first one. |
(bigger) | |
It will not be bug free yet.. the case like #[block! [hello]] will fail now as I'm reviewing it. | |
Anton 30-Jul-2010 [5057] | Having a look. Thanks for posting that. |
Oldes 30-Jul-2010 [5058] | Also the chunk inside comment line will fail now.. so it's far to be complete yet. |
Anton 30-Jul-2010 [5059] | I have to say you pulled that out pretty quickly. |
Oldes 30-Jul-2010 [5060] | the load-last-block function would be more complicated. |
Anton 30-Jul-2010 [5061x2] | I just found something interesting. I remember Gabriele saying he thought PARSE would convert chars it encountered in its rule with strings before using, so these are equivalent: parse "a" [#"a"] parse "a" ["a"] (Of course, the first one is a char and not a string, so consumes less memory.) But I was just thinking it might be clearer to use strings instead of chars in the parse rule. Then I discovered you can use issues: parse "a" [#a] and the escape characters is interesting as you only need to type one of them in the issue: parse "^^" [#^] |
Anyway, that's a side-issue. | |
Oldes 30-Jul-2010 [5063] | I would not use it.. I think you chould use char! where you expect char!. btw... the script is updated, but would require some test-cases (for chunk bordes inside strings/blocks/comments) to make it safe for use. |
Anton 30-Jul-2010 [5064x2] | Thankyou very much for this contribution. |
It's time for me to sleep. Good night. | |
BrianH 30-Jul-2010 [5066] | Anton, the cost of disk reads dwarfs the cost of LOAD/next. And PARSE is much slower at loading REBOL data than LOAD. You might consider finding out the max size of the value you are loading, rounded up to multiples of 4096 (disk blocks), and just READ/part a bit more than that from the disk for each file. Then LOAD/next from the resulting string. There is no reason to do speculative reads once you have an upper bound on the size you will need to read. In a language like REBOL, minimizing disk reads sometimes means minimizing the number of calls to READ, not just the amount read. |
Oldes 30-Jul-2010 [5067x3] | My script is much more faster.. when I do: tm: func [count [integer!] code [block!] /local t][t: now/time/precise loop count code probe now/time/precise - t] tm 10 [ dir: %/F/REBOL\altme\worlds\rebol3\chat\ foreach file read dir [ load-first-block dir/:file ] ] tm 10 [ dir: %/F/REBOL\altme\worlds\rebol3\chat\ foreach file read dir [ first load/next dir/:file ] ] the result is: 0:00:01.047 0:00:06.39 |
of course with chunk size of 1024 | |
0:00:00.968 for chunk 512 | |
BrianH 30-Jul-2010 [5070x2] | Oldes, did you notice that I wrote READ/part, not READ ? |
And loading from the resulting string, not the file? | |
older newer | first last |