World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
Steeve 16-May-2009 [3732] | Hmm... |
Maxim 16-May-2009 [3733x3] | implementing later solution... this is easier |
here you go :-) data: {CC: Patient complains of sore throat. HPI: ONSET: Sudden, TIMING: Constant, DURATION: 3 days INTENSITY: Moderate, QUALITY: Burning, MODIFYING FACTORS: head position CURRENT MEDICATIONS: TYLENOL W/ CODEINE NO. 3 300MG;30MG 1-2 po q 4-6 hrs prn "pain" cyclobenzaprine Oral Tablet 10 MG 1 tab po TID prn "muscle spasm" MEDICAL HISTORY: Rheumatic heart disease, unspec. 391.9 Eczema, atopic dermatitis 691.8 dyslipidemia ALLERGIES: Penicillin - allergy: Allergy Penicillin - allergy: Allergy Penicillin - anaphylactic reaction lovastatin - allergy: allergic macrodantin - 1 po BID SURGERIES: } data: parse/all data "^/" header-lbl: ["CC" | "HPI" | "ONSET" | "INTENSITY" |"CURRENT MEDICATIONS" | "MEDICAL HISTORY" | "ALLERGIES" | "SURGERIES"] spec: [] foreach line data [ unless parse/all line [ copy hdr [header-lbl ":"] here: ( append spec to-set-word head remove back tail replace/all hdr " " "-" append spec copy/part here tail line ) ][ if string? item: last spec [ append item line ] ] ] probe context spec | |
ok for you? | |
Steeve 16-May-2009 [3736] | Assuming SRC: contains the source text, it seems to work too: header-char: complement charset "^/:" EOL2: rejoin [newline newline] parse/all src [ some [ some [pos: #" " (change pos #"-") | header-char] #":" pos: newline (change/part pos " {" 1) [to EOL2 | to end] pos: (change pos "} ") skip skip ] ] probe construct to block! src |
Graham 16-May-2009 [3737x2] | Yes ... but I'm going to have to study Steeve's |
to see why it doesn't work yet | |
Steeve 16-May-2009 [3739] | it will not work if you have CRLF insteed of newlines in the source. Is that the case ? |
Graham 16-May-2009 [3740] | I just copied it from here. |
Steeve 16-May-2009 [3741] | i mean for your source data, not for my code |
Graham 16-May-2009 [3742] | that's what I meant .. I just copied the source data from here. |
Steeve 16-May-2009 [3743x2] | ok, it works for me |
i retry | |
Graham 16-May-2009 [3745x3] | working now. |
Actually yours appears to be the better solution because you don't specify the headers | |
and just pick it up from the formmating of the text | |
Steeve 16-May-2009 [3748] | yep |
Graham 16-May-2009 [3749] | well, I'm impressed :) |
Steeve 16-May-2009 [3750] | you should not |
Graham 16-May-2009 [3751] | sadly I am. |
Graham 17-May-2009 [3752] | the parser dies when there is something like "2.5mg" in the text wiht invalid decimal error. |
Steeve 17-May-2009 [3753x3] | should not, give the data please |
There is no reason, the content is enclosed in a string before being loaded. If it fails, it's because the whole grammar has changed | |
probaly blank lines are inserted in the content (where they should not) | |
Graham 17-May-2009 [3756] | {CC: This is the presenting complaint. HPI: Developed over a few days CURRENT MEDICATIONS: METHOTREXATE SODIUM EQ 2.5MG BASE once weekly METHOTREXATE SODIUM EQ 2.5MG BASE once weekly Plaquenil 200 mg two daily Prednisone 5 mg od Salazopyrin EN 500 mg two bd with food Ultram Oral Tablet 50 MG qid prn } |
Steeve 17-May-2009 [3757x4] | ok i test that |
at first sight, i can say there is too many blank lines | |
Right, i added skiping of useless newline. parse/all src [ some [ any newline some [pos: #" " (change pos #"-") | header-char] #":" pos: newline (change/part pos " {" 1) [to EOL2 | to end] pos: (change pos "} ") skip skip ] ] Could you figure it ? | |
Anticipated fails: - if blanks lines are inserted in the content (because blank lines should only used as delimiters between headers). - if header's names can't be converted to words. | |
Maxim 17-May-2009 [3761] | afaik... my solution works flawlessly. we could easily extend the header info so it recognises headers without naming them explicitely. |
Steeve 17-May-2009 [3762] | In fact i could extend my solution easly to prevent those errors and throwing safe errors it the parsing failed. I takes 5 minutes to do. But adding such exceptions or other sub-rules is so easy that i don't see the interest to prevent those cases. It's my philosophy when i write parsing rules. They are so easy to extend, there is no reason to anticape thoses cases by guessing what is in the in the mind of the final user. Whe have to extend the grammar ? Ok, give me 5 minutes. |
Graham 17-May-2009 [3763x2] | The thing is that the user can type what they want ... so have to be prepared for anything. |
All I ask is that they type the headers in correctly. | |
Steeve 17-May-2009 [3765x2] | I'm not a magician, i can't figure all the cases if the given specifications are incompletes. Everybody has a job to do, it's not mine to work on wrong specifications. |
If you can't prevent them to insert blank lines in the content, then the Maxim's solution should be used isntead. With a list of authorized headers. | |
Graham 17-May-2009 [3767] | It's free text ... no way can I prevent users from doing this. |
Steeve 17-May-2009 [3768x2] | So you can't use automatic recognition of unspecified headers. Easy to figure. |
if headers are not distinguishable from free text, there is no solution | |
Graham 17-May-2009 [3770] | Not if I use Max's method .. but the headers can be obtained from the original object specifications. |
Steeve 17-May-2009 [3771] | do so |
Maxim 17-May-2009 [3772] | the header-lbl rule in my example could be changed so it matches up to the first colon, but then, there is a flaw in that the text can also include something that LOOKS like a header and then you can have a stray value in the object... in the original example data you posted... this would be hard to tackle... Penicillin - allergy: |
Graham 17-May-2009 [3773x2] | That was my original way of doing things. |
I built the rule from the object and then parsed the data .. but my way relied on the headers being in the correct order. | |
Maxim 17-May-2009 [3775] | I started on steeve's course and had similar new-line issues, which is why I decided to parse liine by line. |
Steeve 17-May-2009 [3776x3] | can't be the headers be prefixed, it would be so easy to treat... |
Parsing line by line is not the solution (neither the problem) there. All you can do line by line can be enrolled in only one parsing flow. It's just matter of your skills in using parse. | |
i saw many people proposing to parse line by line in many topics here. I don't get it. It's slower and wasting memory for nothing. They seem to be afraid of the use of any/some parsing loops, i don't understand why. | |
Maxim 17-May-2009 [3779] | its just MUCH easier in doing it line by line because the context of the parse isn't the same. a parse rule going astray in multi-line doesn't react the same as for a single line which has a context of "this has a header" | "this doesn't" I'm not saying my solution can't be done using only one parse, only that the rules are that much simpler. in my first tests, handling the first and last headers needed special treatment, ultimately forcing me to add new rules, and generally making the whole much more complex. |
Steeve 17-May-2009 [3780] | i never had to cut data into lines when parsing, and i will never have to |
Maxim 17-May-2009 [3781] | steeve I did a 4000 line parse rule... outperforming C code. but I'm pragmatic. if the rules are going to be 50% smaller, and 100% bug free. then that's the better solution. |
older newer | first last |