r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

sqlab
6-Mar-2006
[903x2]
So it is

parse [1 2 3 4 b 5][ some [ set val number! v:   number! :v (print 
val)]  to end (?? val)]
too late.(
Oldes
7-Mar-2006
[905x4]
Maybe someone will find this usefull:
count-word-frequency: func[
	"Counts word frequency from the given text"
	text [string!] "text to analyse"
	/exclude ex [block!] "words which should not be counted"
	/local counts f wordchars nonwordchars
][
	counts: make hash! 100000

 wordchars: charset [#"a" - #"z" #"A" - #"Z" "̊؎ύѪ"]
	nonwordchars: complement wordchars
	parse/all text [
		any nonwordchars
		any [
			copy word some wordchars (
				;probe word
				if any [not exclude none? find ex word][
					either none? f: find/tail counts word [
						repend counts [ word 1 ]
					][
						change f (f/1 + 1)
					]
				]
			)
			any nonwordchars
		]
	]
	counts: to-block counts
	sort/skip/compare/reverse counts 2 2
	new-line/skip counts true 2
]
If you know some other chars, which should be included in the words, 
please let me know, now it should be complete for czech language 
and hope that for spanish too (as I use it to count spanish words:).
found missing czech chars->  wordchars: charset [#"a" - #"z" #"A" 
- #"Z" "̊؎ύѪ"]
Oldes
13-Mar-2006
[909]
Is this a bug?
parse/all {"some words"} {" }
;== ["some words"]
parse/all {and "some words"} {" }
;== ["and" "some words"]
parse {and "some words"} {" }
;== ["and" "some" "words"]
parse {"some words"} {" }
;== ["some words"]
Geomol
13-Mar-2006
[910]
Good question! It's in a tough corner of REBOL - parsing. REBOL is 
in many ways more like a human language, than a computer language. 
Strictly speaking, you can argue, that those examples have a bug 
or two, but can you live with it? The behaviour might make it difficult 
to parse input strings, written by humans, because people write all 
sorts of things. (If it can go wrong, it will.)


Try change the quotation marks to something else and see the results 
change, like:

>> parse/all {Xsome wordsX}{X }
== ["" "some" "words"]
Gabriele
13-Mar-2006
[911]
parse, without a rule, treats quotes specially. this is to allow 
parse to be used directly with things like csv data.
Oldes
14-Mar-2006
[912x2]
I think it's a bug! I was trying to use this to divide large string 
to words and found that I have all sentences inside , instead of 
just words. It's problem only if you have the divider on the edge.
In the Geomol's example I would expect the result to be ["some" "words"] 
so it must be bug - it's inconsistent
Gabriele
14-Mar-2006
[914]
this behavior is the one intended by Carl. so, it's so by design, 
and not a bug. but, you may try to convince Carl that you don't like 
it. ;)
Oldes
14-Mar-2006
[915x5]
I still think it's a bug - I cannot see the diference between parse 
and parse/all in this example. If Carl don't want to fix it, no problem 
for me, I used more complicated rule to do the same thing, just still 
think, it's a bug and it will confuse more people in the future as 
well.
but the true is, that in CSV is logical to have: parse {,d ,d} {,} 
== ["" "d" "d"]
and parse {,"a b, d"  ,d} {,} == ["" "a b, d" "d"]  (so probably 
Carl has true;-)
But it should be in documentation, that the quotes are very special 
characters for such a type of parsing!
There is also bug in doc: http://www.rebol.com/docs/core23/rebolcore-15.html
(section 2 - Simple Splitting) -> there is sentence: "To avoid that 
action, you can use the /any refinement." where shoud be /all as 
there is no /any refinement in parse!
Graham
14-Mar-2006
[920]
oldes, rambo the documentation problem.
Oldes
14-Mar-2006
[921]
done
Thr
4-Apr-2006
[922]
.
Oldes
28-Apr-2006
[923]
I think it would be good to have some standard place for common parsing 
rules and charsets used in parse rules, like 'digits, 'spaces' and 
other, what do you thing?
Anton
28-Apr-2006
[924]
I like the idea in theory, but what are standard parse rules ? There's 
an argument already - look, I'm arguing ! :)

I would prefer to call the "digit" rule "digits". Also, for this 
example, it's faster to define and be clear with it:
	digit: charset "0123456789"
than being abstract: (even though it would become well known):
	digit: system/parse/rules/digit
JaimeVargas
28-Apr-2006
[925]
Oldes a regex context will be a good addition. Where regex are the 
basic rules for numbers, white space, *words* and their negations.
Oldes
28-Apr-2006
[926x5]
anton: I think, that any parse rule which don have to be global variable, 
but you can still the name used in parse block. But probably it would 
be a security issue
regex would be very nice
the problem with the idea is, that we are mixing code and parse rules
but at least spaces and digits could be used - it means charsets 
- which could be available during parse without need to define it 
all the time
(but it was just the idea how to improve the 'parse function)
Anton
28-Apr-2006
[931]
Hmmm......
Gregg
28-Apr-2006
[932x4]
I've thought about that as well. There are some base charsets we 
could probably standardize on, and that would be good (IMO). Beyond 
a few basics, though, consensus gets tough.
The singular/plural argument seems easy, but isn't (IMO); DIGITS 
could be done as SOME DIGIT, and you could argue that things like 
2 DIGITS reads better, though 1 DIGITS does not. You could double-define 
it, but that gets ugly too. So, what about DIG? That doesn't imply 
any singularity, though it's a bit terse, and not a full word (or, 
rather, the wrong full word).
I'm all for proposing some basics though. Worst case, you can override 
them, which is no more work than we do today.
space/spc
whitespace/wsp
alpha
digit(s)
alpha-num	; should digit be num?
ctl/control
non-US-ASCII/high-ASCII
quoted-string
escaped-char    ; what is the escape though; REBOL ^, C \, etc.?

What other standard sets would we want?
Sunanda
28-Apr-2006
[936]
II was sure I'd posted this just after Oldes' message.....But it 
ain't there now.....Maybe it's in the wrong group)
Andrew has a nice starter set:

http://www.rebol.org/cgi-bin/cgiwrap/rebol/view-script.r?script=common-parse-values.r

And I know he has extended that list extensively to include things 
like email address and URL
Gregg
28-Apr-2006
[937x2]
It would be great (again, IMO), if we had parse rules for REBOL datatypes. 
For those that want the power of block parsing, with the ability 
to load strings that aren't valid REBOL, it would be very handy.
Good starter set! I forgot about that. Thanks Sunanda.
Graham
28-Apr-2006
[939x2]
the problem I find with block parsing is the rigid interpretation 
of datatypes.
So, if Rebol gets the datatype wrong ( and real word data is dirty 
), you're screwed.
Gregg
28-Apr-2006
[941]
That's the tradeoff. :\
Graham
28-Apr-2006
[942x3]
real world data is dirty ..
Maybe there should be no invalid datatypes .... everything can be 
converted to a datatype
if the parser thinks a datatype is invalid, well, let's call it an 
invalid! datatype!!
Gregg
28-Apr-2006
[945]
I think that's where string parsing comes in, and where having rules 
for REBOL datatypes would ease the pain.
Graham
28-Apr-2006
[946x3]
I do screen validation by datatypes ( for data input ).  If the user 
enters an invalid datatype ... ..
anyway, I think rebol should recognise all data ..
have a catchall for stuff it thinks is wrong
Oldes
30-Apr-2006
[949x2]
I agree with you Graham, I was mentioning this many times, that there 
could be something to handle datatype exceptions
About the spaces charset - most people do not know that we have one 
more space char - non braking space:  >> to-char 160 <<
Volker
1-May-2006
[951x2]
How about another way: integrate datatypes in string-parser. Basically 
a  load/next and check for type.
Then we can write (note i parse a string): 
parse "1 a , #2" [ integer! word! "," issue! ]
'invalite! has a problem: its easy to recognize where the wrong part 
starts, but harder to recognize where the wrong part ends.