r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Oldes
30-Jul-2010
[5035]
and you want to get just the first block? Skipping any content before 
it?
Anton
30-Jul-2010
[5036x2]
Ok, yes, it may be preceded by whitespace. That seems easy to deal 
with.
Well, I'd like the algorithm to be general enough to get any number 
of block in the file. So far I need just the first, second and last 
blocks.
Oldes
30-Jul-2010
[5038]
And you need to parse the Altme's *.set files or something else as 
well?
Anton
30-Jul-2010
[5039x3]
The *.set files are what I need it for, yes.
I imagine it could be useful in other similar situations, so I'd 
like it to be pretty general.

I suppose a bonus functionality is to be able to get nested blocks.

(And a super bonus will be to get any datatype at any level, but 
I won't bother doing that until I need it.)
Um, actually, I remember years ago I wanted to load the rebol header 
of many files, avoiding loading the whole of each file.
Oldes
30-Jul-2010
[5042]
Also char! must be supported:  #"["
Anton
30-Jul-2010
[5043x2]
So this algorithm could be used for that, too.
Must it ?

I think if I can parse single-line strings correctly, then a bracket 
inside won't cause a problem.

This means I'll be basically ignoring datatypes which allow strings 
in their syntax, and just jumping to the string part.
Oldes
30-Jul-2010
[5045]
you are right
Anton
30-Jul-2010
[5046x8]
I don't think there's any way to make any type with a literal bracket 
in it (except blocks, of course). (But I am worrying about that a 
bit.)
I tried to make some words with a single unmatched literal bracket, 
or literal string delimiter, but I failed so far. They don't load, 
so they won't be in well-formed rebol format files.
And then I tried issues, files, and I can't do it there, either.
One caveat:

Misidentifying as a block, types like (what are they called?) "inline 
types"?
eg.  #[none]

If I don't recognise it as none! (or maybe issue!) , then I might 
accidentally take it as a block.
Or how about this:
#[block! [hello]]

I would probably want to extract  [hello], but if I don't recognise 
this form, then I'd get  [block! [hello]]  instead.
But that's ok, I can probably tweak the parse rule to support those 
later.
Does anyone have any advice on how I should structure this algorithm?

I don't feel confident as I haven't studied parsing theory deeply.
http://en.wikipedia.org/wiki/Parsing

Should I do lexical analysis and syntactic analysis separately ?

I think I can do it all with just one parse, but it might not be 
a good idea.
I'll make a start.
Oldes
30-Jul-2010
[5054x3]
try this: http://box.lebeda.ws/~hmm/rebol/load-first-block.r

but it's not well tested. You can make the chunk size better. It's 
just 200 bytes just to not find the block easily in the first one.
(bigger)
It will not be bug free yet.. the case like #[block! [hello]] will 
fail now as I'm reviewing it.
Anton
30-Jul-2010
[5057]
Having a look. Thanks for posting that.
Oldes
30-Jul-2010
[5058]
Also the chunk inside comment line will fail now.. so it's far to 
be complete yet.
Anton
30-Jul-2010
[5059]
I have to say you pulled that out pretty quickly.
Oldes
30-Jul-2010
[5060]
the load-last-block function would be more complicated.
Anton
30-Jul-2010
[5061x2]
I just found something interesting.

I remember Gabriele saying he thought PARSE would convert chars it 
encountered in its rule with strings before using, so these are equivalent:
	parse "a" [#"a"]
	parse "a" ["a"]

(Of course, the first one is a char and not a string, so consumes 
less memory.)

But I was just thinking it might be clearer to use strings instead 
of chars in the parse rule.
Then I discovered you can use issues:
	parse "a" [#a]

and the escape characters is interesting as you only need to type 
one of them in the issue:
	parse "^^" [#^]
Anyway, that's a side-issue.
Oldes
30-Jul-2010
[5063]
I would not use it.. I think you chould use char! where you expect 
char!. btw... the script is updated, but would require some test-cases 
(for chunk bordes inside strings/blocks/comments) to make it safe 
for use.
Anton
30-Jul-2010
[5064x2]
Thankyou very much for this contribution.
It's time for me to sleep. Good night.
BrianH
30-Jul-2010
[5066]
Anton, the cost of disk reads dwarfs the cost of LOAD/next. And PARSE 
is much slower at loading REBOL data than LOAD. You might consider 
finding out the max size of the value you are loading, rounded up 
to multiples of 4096 (disk blocks), and just READ/part a bit more 
than that from the disk for each file. Then LOAD/next from the resulting 
string. There is no reason to do speculative reads once you have 
an upper bound on the size you will need to read. In a language like 
REBOL, minimizing disk reads sometimes means minimizing the number 
of calls to READ, not just the amount read.
Oldes
30-Jul-2010
[5067x3]
My script is much more faster.. when I do:

tm: func [count [integer!] code [block!] /local t][t: now/time/precise 
loop count code probe now/time/precise - t]
tm 10 [
	dir: %/F/REBOL\altme\worlds\rebol3\chat\
	foreach file read dir [
		load-first-block dir/:file
	]
]

tm 10 [
	dir: %/F/REBOL\altme\worlds\rebol3\chat\
	foreach file read dir [
		first load/next dir/:file
	]
]

the result is:
0:00:01.047
0:00:06.39
of course with chunk size of 1024
0:00:00.968 for chunk 512
BrianH
30-Jul-2010
[5070x3]
Oldes, did you notice that I wrote READ/part, not READ ?
And loading from the resulting string, not the file?
If you are reading from a hard drive there is no point to using a 
chunk size of less than 4096. For floppies, 512 will do.
Oldes
30-Jul-2010
[5073x3]
load-first-block2: func[file /local port buffer result tmp ][
	port: open/direct file
	buffer: copy ""
	result: none
	chunks: 0
	until [
		chunks: chunks + 1
		if any [
			chunks > 10
			none? tmp: copy/part port 512
		] [close port return none]
		insert tail buffer tmp
		not error? try [result: first load/next buffer]
	]
	close port
	result
]
is faster... a little bit... 0:00:00.859
But we are in "parse" topic so we were making parse solution.
BrianH
30-Jul-2010
[5076]
LOAD parses.
Oldes
30-Jul-2010
[5077x3]
ok.. the precise topic is: Parse (Discussing of PARSE dialect)... 
anyway.. I started as well asking why not to use load/next.
btw.. the second function is not complete as it does not check for 
block but any rebol value.
Btw... what would be best way to get the last block of the file without 
loading complete file?
Steeve
30-Jul-2010
[5080]
impossible
BrianH
30-Jul-2010
[5081x2]
Your solution is similar to what I suggested, but is missing a couple 
speedups:

- Getting an estimate of how many characters the value would take 
on the high end, and using that as the initial read part.

- Chunk value reflecting hard disk sector value (the OS may load 
in 4096 byte blocks)
Last block... OPEN/seek perhaps, then SKIP ?
Anton
31-Jul-2010
[5083x2]
BrianH, finding out the size that is sure to encompass the desired 
block in all my input files requires prescanning the entirety of 
all of the files at least once. That's a good optimization for my 
specific case, but I want to make the function general enough that 
it can be used in other situations where the data may not be so consistent, 
and the desired block may not be always near the beginning or end.
So I'm going to persist with PARSE.