r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Steeve
16-May-2009
[3732]
Hmm...
Maxim
16-May-2009
[3733x3]
implementing later solution... this is easier
here you go  :-)


data: {CC:
Patient complains of sore throat.

HPI:
ONSET: Sudden, TIMING: Constant, DURATION: 3 days

INTENSITY: Moderate, QUALITY: Burning, MODIFYING FACTORS: head position

CURRENT MEDICATIONS:
TYLENOL W/ CODEINE NO. 3 300MG;30MG 1-2 po q 4-6 hrs prn "pain"
cyclobenzaprine Oral Tablet 10 MG 1 tab po TID prn "muscle spasm"

MEDICAL HISTORY:
Rheumatic heart disease, unspec. 391.9
Eczema, atopic dermatitis 691.8
dyslipidemia

ALLERGIES:
Penicillin - allergy: Allergy
Penicillin - allergy: Allergy
Penicillin - anaphylactic reaction
lovastatin - allergy: allergic
macrodantin - 1 po BID

SURGERIES:
}

data: parse/all data "^/"


header-lbl: ["CC" | "HPI" | "ONSET" | "INTENSITY" |"CURRENT MEDICATIONS" 
| "MEDICAL HISTORY" | "ALLERGIES" | "SURGERIES"]

spec: []
foreach line data [
	unless parse/all line [
		copy hdr [header-lbl ":"]
		here:
		(

   append spec to-set-word head remove back tail replace/all hdr " " 
   "-"
			append spec copy/part here tail line
		)
	][
	
		if string? item: last spec [
			append item line
		]
	]

]

probe context spec
ok for you?
Steeve
16-May-2009
[3736]
Assuming SRC: contains the source text, it seems to work too:

header-char: complement charset "^/:"
EOL2: rejoin [newline newline]
parse/all src [
	some [
		some [pos: #" " (change pos #"-") | header-char]
		#":" pos: newline (change/part pos " {" 1)
		[to EOL2 | to end] pos: (change pos "} ") skip skip
	]
]
probe construct to block! src
Graham
16-May-2009
[3737x2]
Yes ... but I'm going to have to study Steeve's
to see why it doesn't work yet
Steeve
16-May-2009
[3739]
it will not work if you have CRLF insteed of newlines in the source.
Is that the case ?
Graham
16-May-2009
[3740]
I just copied it from here.
Steeve
16-May-2009
[3741]
i mean for your source data, not for my code
Graham
16-May-2009
[3742]
that's what I meant .. I just copied the source data from here.
Steeve
16-May-2009
[3743x2]
ok, it works for me
i retry
Graham
16-May-2009
[3745x3]
working now.
Actually yours appears to be the better solution because you don't 
specify the headers
and just pick it up from the formmating of the text
Steeve
16-May-2009
[3748]
yep
Graham
16-May-2009
[3749]
well, I'm impressed :)
Steeve
16-May-2009
[3750]
you should not
Graham
16-May-2009
[3751]
sadly I am.
Graham
17-May-2009
[3752]
the parser dies when there is something like "2.5mg" in the text 
wiht invalid decimal error.
Steeve
17-May-2009
[3753x3]
should not, give the data please
There is no reason, the content is enclosed in a string before being 
loaded.
If it fails, it's because the whole grammar has changed
probaly blank lines are inserted in the content (where they should 
not)
Graham
17-May-2009
[3756]
{CC:
This is the presenting complaint.


HPI:
Developed over a few days

CURRENT MEDICATIONS:
METHOTREXATE SODIUM EQ 2.5MG BASE once weekly
METHOTREXATE SODIUM EQ 2.5MG BASE once weekly
Plaquenil 200 mg two daily
Prednisone 5 mg od
Salazopyrin EN 500 mg  two bd with food
Ultram Oral Tablet 50 MG qid prn
}
Steeve
17-May-2009
[3757x4]
ok i test that
at first sight, i can say there is too many blank lines
Right, i added skiping of useless newline.

parse/all src [
	some [
		any newline
		some [pos: #" " (change pos #"-") | header-char]
		#":" pos: newline (change/part pos " {" 1)
		[to EOL2 | to end] pos: (change pos "} ") skip skip
	]
]

Could you figure it ?
Anticipated fails:

- if blanks lines are inserted in the content (because blank lines 
should only used as delimiters between headers).
- if header's names can't be converted to words.
Maxim
17-May-2009
[3761]
afaik... my solution works flawlessly.  we could easily extend the 
header info so it recognises headers without naming them explicitely.
Steeve
17-May-2009
[3762]
In fact i could extend my solution easly to prevent those errors 
and throwing safe errors it the parsing failed.
I takes 5 minutes to do.

But adding such exceptions or other sub-rules is so easy that i don't 
see the interest to prevent those cases.

It's my philosophy when i write parsing rules.

They are so easy to extend, there is no reason to anticape thoses 
cases by guessing what is in the in the mind of the  final user.
Whe have to extend the grammar ? 
Ok, give me 5 minutes.
Graham
17-May-2009
[3763x2]
The thing is that the user can type what they want ... so have to 
be prepared for anything.
All I ask is that they type the headers in correctly.
Steeve
17-May-2009
[3765x2]
I'm not a magician, i can't figure all the cases if the given specifications 
are incompletes.

Everybody has a job to do, it's not mine to work on wrong specifications.
If you can't prevent them to insert blank lines in the content, then 
the Maxim's solution should be used isntead.
With a list of authorized headers.
Graham
17-May-2009
[3767]
It's free text ... no way can I prevent users from doing this.
Steeve
17-May-2009
[3768x2]
So you can't use automatic recognition of unspecified headers. Easy 
to figure.
if headers are not distinguishable from free text, there is no solution
Graham
17-May-2009
[3770]
Not if I use Max's method .. but the headers can be obtained from 
the original object specifications.
Steeve
17-May-2009
[3771]
do so
Maxim
17-May-2009
[3772]
the header-lbl rule in my example could be changed so it matches 
up to the first colon, but then, there is a flaw in that the text 
can also include something that LOOKS like a header and then you 
can have a stray value in the object...


in the original example data you posted... this would be hard to 
tackle...

Penicillin - allergy:
Graham
17-May-2009
[3773x2]
That was my original way of doing things.
I built the rule from the object and then parsed the data .. but 
my way relied on the headers being in the correct order.
Maxim
17-May-2009
[3775]
I started on steeve's course and had similar new-line issues, which 
is why I decided to parse liine by line.
Steeve
17-May-2009
[3776x3]
can't be the headers be prefixed, it would be so easy to treat...
Parsing line by line is not the solution (neither the problem) there.

All you can do line by line can be enrolled in only one parsing flow. 
It's just matter of your skills in using parse.
i saw many people proposing to parse line by line in many topics 
here.
I don't get it. 
It's slower and wasting memory for nothing.

They seem to be afraid of the use of any/some parsing loops, i don't 
understand why.
Maxim
17-May-2009
[3779]
its just MUCH easier in doing it line by line because the context 
of the parse isn't the same.  a parse rule going astray in multi-line 
doesn't react the same as for a single line which has a context of 
"this has a header" | "this doesn't"


I'm not saying my solution can't be done using only one parse, only 
that the rules are that much simpler.  in my first tests, handling 
the first and last headers needed special treatment, ultimately forcing 
me to add new rules, and generally making the whole much more complex.
Steeve
17-May-2009
[3780]
i never had to cut data into lines when parsing, and i will never 
have to
Maxim
17-May-2009
[3781]
steeve I did a 4000 line parse rule... outperforming C code.  but 
I'm pragmatic.  if the rules are going to be 50% smaller, and 100% 
bug free. then that's the better solution.