r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Ammon
17-Apr-2009
[3643]
Using your code to do the same thing...

match func [ data rules ] [
	parse rules [ 
		SOME 

  [ 	set L lit-word! blk: ( either equal? L reduce first data [ data: 
  next data ] [ blk: tail blk ] ) :blk | 
			set W word! ( set :W first data  data: next data ) 
		] 
	]
]
Graham
23-Apr-2009
[3644]
I'd like to take an english sentence and tidy it up.  I want to automatically 
apply english grammar to it ... so capitalize the first letter after 
a period, and remove extraneous spaces eg. a comma after a space. 
 Anyone done anything like this with 'parse?
Ammon
24-Apr-2009
[3645]
Not yet but I've been thinking about it for quite a while now... 
 I think I have a pretty good idea what the parse rules should look 
like but I haven't written any code for it yet.
Steeve
24-Apr-2009
[3646]
Good start...

letter: charset [#"a" - #"z" #"A" - #"Z"]
dirt: complement letter
word: [some letter]
clean: [here: dirt :here (remove here)]
space: [here: (insert here #" ") skip]
capital: [here: letter (uppercase/part here 1)]
sentence: [
	some [
		  capital opt word break
		| clean
	]
	any [
		  [#";" | #","] any clean space word
		| #"." any clean space capital opt word
		| #" " word
		| clean
	]
]

parse/all text: {test  test . test;; test ..test } sentence
probe text
>>"Test test. Test; test. Test"
Janko
24-Apr-2009
[3647x2]
I have made auto capitalising first words for some bot once .. it 
wasn't anything special , I can find the code and send it to you
ah, Steeve's already works
Steeve
24-Apr-2009
[3649]
Has to be ehanced indeed
Graham
24-Apr-2009
[3650]
Hey, nice start ...
Steeve
24-Apr-2009
[3651]
indeed, i'm nice
Graham
24-Apr-2009
[3652x2]
:)
have to add #"'" ie. ' to the letter charset
Steeve
24-Apr-2009
[3654x2]
#"-" too and what with the numbers ?
for #"'" you should add a rule to remove spaces
Janko
24-Apr-2009
[3656]
Mine was meant so I cold make pretty texts with all upper case in 
some search engine.. maybe it doesn't work that great in all cases..

smart-uc-after: func [ str sep ] [

 parse str [ ANY [ thru sep mark: ( uppercase/part trim mark 1 insert 
 mark " " ) :mark ] ]
	str
] 

smart-case: func [ str ] [
	calc-with X [ 	
		[ lowercase str ]
		[ uppercase/part X 1 ]
		[ smart-uc-after X "." ]
		[ smart-uc-after X "?" ]
		[ smart-uc-after X "!" ]
]]
>> smart-case "HI HOW ARE YOU! we will go. bye!"
== "Hi how are you! We will go. Bye! "
Graham
24-Apr-2009
[3657]
numbers aren't usually part of words.  Unless it's trademark like 
3M
Janko
24-Apr-2009
[3658x2]
but mine is also worse because it does 3 parses instead of one like 
Steeve
calc-with: func [ 'wrd bs ] [  foreach b bs [ set wrd do b ] ] ; 
it uses this func also
Graham
24-Apr-2009
[3660]
Stevee's looks faster :)
Janko
24-Apr-2009
[3661]
yes, I agree :)
Steeve
24-Apr-2009
[3662x4]
this is the rule for #"-" 
| #"'" any clean word
with that you supress unwanted spaces.
it'  s a good day
 --> "it's a good day"
so don't add ""'" as a vali
d letter
Graham
24-Apr-2009
[3666]
ahh ...
Steeve
24-Apr-2009
[3667]
do as you want... :-)
Graham
24-Apr-2009
[3668x2]
trailing "." or "," gets lost
Also, I think have to add ' to the letter charset because words ending 
in s can have a trailing ' for possession ...
Steeve
24-Apr-2009
[3670]
but what if they have inserted a space after or before '
Graham
24-Apr-2009
[3671]
so, Miles' wallet and not Miles's wallet
Steeve
24-Apr-2009
[3672x4]
hum ok, but you could handle that specif case with a different rule
of course, for the trailing #".", just add 
| #"." end
or better change the rule: 
| #"." any clean [end | space capital opt word]
parse is just amazing for such simple grammar.
A simple add and it's doing all you want.
Pekr
3-May-2009
[3676]
Have I found a parse bug?

1)

>> parse/all {zybc} [ some ["b" break | "y"  break | skip] copy result 
thru "c" (print result)]
bc
== true

2)

>> parse/all {zybc} [ some ["b" break| "y"  break | skip] copy result 
thru "c" (print result)]
** Script Error: break| has no value

** Near: parse/all "zybc" [some ["b" break| "y" break | skip] copy 
result thru "c" (print result)]

3)

>> parse/all {zybc} [ some ["b" break | "y"  break| skip] copy result 
thru "c" (print result)]
== false


Such stupid bugs are really making the testing process difficult. 
I wondered at least 5 minutes, why the result of case 3 was wrong, 
and then I tried to add space behind the second break, and the code 
was corrected. How is that second break| does not report error? ;-)
shadwolf
3-May-2009
[3677x3]
3) is like 2)  you put a | to close of the second break. I noticed 
on rebol 2 strange reactions with find multi-case too
in rebol 2  for example if you do if  not find str any ["!" ";" ] 
that will work if you have str with "!" in it but not with ";" but 
if you invert the position of your find argument like this [";" "!"] 
then you detect when str have ";" in it but not when it have "!" 
in it
>> str: "tot!;"
== "tot!;"
>> if not find str any["!" ";"] [ print "found it!!"]
== none
>> str: "tot;"
== "tot;"
>> if not find str any["!" ";"] [ print "found it!!"]
found it!!
Pekr
3-May-2009
[3680x2]
Shadwolf - but that is your bug ;-) Simply put, you try to mix parse-like 
behaviour with how 'any behaves. 'any and 'all are just functions, 
so in the case of 'any it returns any true condition match, so any 
["!" ";"] always returns "!", because it is evaluated as 'true.
... so the code above behaves correctly, because in the second case 
your string does not contain "!"
Dockimbel
3-May-2009
[3682x2]
Pekr, try to run your 2) and 3) in trace mode, you'll see that there's 
no bug, parse rules evaluation looks consistent to me.
In 3), the second 'break| doesn't report error because it's never 
evaluated. The rule fails on the first input character when trying 
to match "y" and 'skip is never reached. In 2), 'skip helps consuming 
the input until the "y" character which leads to evaluate 'break| 
and raises the error.
Pekr
3-May-2009
[3684]
yes, you might be right doc. But - it is really very difficult to 
track down for user. It almost looks like scanner bug, but it is 
not. What actually happens in the case 3) is, that "break|" is being 
considered a regular word, which just does not have value. Stating 
that, it also means that 'skip is not part of OR expression. So, 
'some block fails on not matching "y" ....
Graham
16-May-2009
[3685x3]
Here's a parse question for the experts.
If I have a document with headings eg. a: b: .. z: and text optionally 
under each heading ... would it be possible to use parse to collect 
all the text from each heading if the headings are in any order and 
some headings with no text are optionally missing?
Each heading can only occur once in the document.
Maxim
16-May-2009
[3688]
sure
Graham
16-May-2009
[3689]
Ok, let me rephrase that .. sure it's possible, but I can imagine 
it would be quite complicated
Maxim
16-May-2009
[3690x2]
now was that a question of the "can you give me the solution" kind?
actually it can be done quite simply... depends on the headers themselves...
Graham
16-May-2009
[3692]
It's a little complicated because the headers can have spaces in 
them.