World: r3wp
[Parse] Discussion of PARSE dialect
older newer | first last |
BrianH 3-Dec-2011 [6014x3] | Here's the R2 version, though I haven't promoted the emitter to an option yet: load-csv: funct [ "Load and parse CSV-style delimited data. Returns a block of blocks." [catch] source [file! url! string! binary!] /binary "Don't convert the data to string (if it isn't already)" /with "Use another delimiter than comma" delimiter [char! string! binary!] /into "Insert into a given block, rather than make a new one" output [block!] "Block returned at position after the insert" ] [ ; Read the source if necessary if any [file? source url? source] [throw-on-error [ source: either binary [read/binary source] [read source] ]] unless binary [source: as-string source] ; No line conversion ; Use either a string or binary value emitter emit: either binary? source [:as-binary] [:as-string] ; Set up the delimiter unless with [delimiter: #","] valchar: remove/part charset [#"^(00)" - #"^(FF)"] join crlf delimiter ; Prep output and local vars unless into [output: make block! 1] line: [] val: make string! 0 ; Parse rules value: [ ; Value surrounded in quotes {"} (clear val) x: to {"} y: (insert/part tail val x y) any [{"} x: {"} to {"} y: (insert/part tail val x y)] {"} (insert tail line emit copy val) | ; Raw value x: any valchar y: (insert tail line emit copy/part x y) ] ; as-string because R2 doesn't parse binary that well parse/all as-string source [any [ end break | (line: make block! length? line) value any ["," value] [crlf | cr | lf | end] (output: insert/only output line) ]] also either into [output] [head output] (source: output: line: val: x: y: none) ; Free the locals ] All my tests pass, though they're not comprehensive; maybe you'll come up with more. Should I add support for making the row delimiter an option too? |
>> load-csv {^M^/" a""", a""^Ma^/^/} == [[""] [{ a"} { a""}] ["a"] [""]] >> load-csv/binary to-binary {^M^/" a""", a""^Ma^/^/} == [[#{}] [#{206122} #{20612222}] [#{61}] [#{}]] | |
The R3 version will be simpler and faster because of the PARSE changes and better binary handling. However, url handling might be trickier because READ/string is ignored by all schemes at the moment. | |
Steeve 3-Dec-2011 [6017] | Don't forget to post your script on rebol.org when finished :-) |
Gregg 4-Dec-2011 [6018] | Thanks for posting Brian. I second Steeve's suggestion, though I'll snag it here for testing. |
BrianH 4-Dec-2011 [6019] | The one above misses one of the Excel-like bad data handling patterns. Plus, I've added a few features, like multi-load, more option error checking , and R3 versions. I'll post them on REBOL.org today. |
BrianH 5-Dec-2011 [6020x3] | Full version with other CSV functions posted here: http://www.rebol.org/view-script.r?script=csv-tools.r |
Making the end-of-line delimiter an option turned out to be really tricky, too tricky to be worth it. The code and time overhead from just processing the option itself was pretty significant. It would be a better idea to make that kind of thing into a separate function which requires the delimiters to be specified, or a generator that takes a set of delimiters and generates a function to handle that specific set. | |
Nonetheless, this LOAD-CSV even handles multichar field delimiter options; in R2 that requires some interesting PARSE tricks :) | |
Henrik 5-Dec-2011 [6023] | Well, now, Brian, this looks very convenient. :-) I happen to be needing a better CSV parser, than the one I have here, but it needs to not convert cell values away from string, and I also need to parse partially, or N number of lines. Is this possible with this one? |
BrianH 5-Dec-2011 [6024] | It doesn't do conversion from string (or even from binary with LOAD-CSV/binary). This doesn't have a /part option but that is a good idea, especially since you can't just READ/lines a CSV file because it treats newlines differently depending on whether the value is in quotes or not. If you want to load incrementally (and can break up the lines yourself, for now) then LOAD-CSV supports the standard /into option. |
Henrik 5-Dec-2011 [6025x2] | can it be told to stop parsing after N lines instead? as far as I can tell from the source: (output: insert/only output line), it could do that. |
since you can't just READ/lines a CSV file - yes, mine does that, and that's no good. | |
BrianH 5-Dec-2011 [6027] | Yes, that is a possibility, but there yet. Resuming would be a problem because you'd have to either save a continuation position or reparse. Maybe something like LOAD/next would work here, preferably like the way R3's LOAD/next was before it was removed in favor of TRANSCODE/next. Making the /into option work with /next and /part would be interesting. |
Henrik 5-Dec-2011 [6028] | I don't really need anything but having the ability to parse the first 100 lines of a file and doing that many times, so I don't care so much about continuation. This is for real-time previews of large CSV files (> 10000 lines). |
BrianH 5-Dec-2011 [6029] | Funny, for my purposes it has to get over 100000 lines before it starts to be large :) |
Henrik 5-Dec-2011 [6030x2] | Well, sure, but I like to have complete control over things like that, so I usually settle for showing only the first 100 lines. |
(better response time, when the user abuses import adjustment buttons) | |
BrianH 5-Dec-2011 [6032x2] | Which do you prefer as a /next style? set [output data] load-csv/into data output or output: load-csv/into/next data output 'data |
Sorry that first one was: set [output data] load-csv/into/next data output | |
Henrik 5-Dec-2011 [6034x2] | second one looks ok |
output: load-csv/into/next data output 'data | |
BrianH 5-Dec-2011 [6036x2] | The latter makes chaining of the data to other functions easier, but requires a variable to hold the continuation; however, you usually use a variable for that anyway. The former makes it easier to chain both values (and looks nicer to R2 fans), but the only function you normally chain both values to is SET, so that's of limited value. |
The main problem with /part is that the current code reads the whole file into memory before parsing, and the parsing itself has miniscule overhead compared to the file overhead. Really doing /part properly might require incremental file reading, to the extent that that works (how well does it work for the http scheme on R3?). | |
Henrik 5-Dec-2011 [6038x2] | That's fine by me, as I read the file into memory once due to the need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV. |
I need to go to bed. If there are more questions, I'll be back tomorrow. | |
BrianH 5-Dec-2011 [6040] | LOAD has the same problem on R2 and R3. The continuation returned would be an offset reference to the entire data in the file, at the position after the part parsed. |
BrianH 6-Dec-2011 [6041] | http://www.rebol.org/view-script.r?script=csv-tools.rupdated, with the new LOAD-CSV /part option. The LOAD-CSV /part option takes two parameters: - count: The maximum number of decoded lines you want returned. - after: A word that will be set to the position of the data after the decoded portion, or none. If you are loading from a file or url then the entire data is read, and after is set to a position in the read data. If you are converting from binary then in R2 after is set an offset of an as-string alias of the binary, and in R3 after is set to an offset of the original binary. R3 does binary conversion on a per-value basis to avoid having to allocate a huge chunk of memory for a temporary, and R2 just does string aliasing for the same reason. Be careful to expect that if you are passing the value assigned to after to anything else than LOAD-CSV (which can handle it either way). |
Gregg 6-Dec-2011 [6042] | Thanks for posting that Brian! |
BrianH 6-Dec-2011 [6043x3] | I was a little concerned about making /part take two parameters, since it doesn't anywhere else, but the only time you need that continuation value is when you do /part, and you almost always need it then. Oh well, I hope it isn't too confusing :) |
This pass-by-word convention is a little too C-like for my tastes. If only we had multivalue return without overhead, like Lua and Go. | |
Just tweaked it to add any-path! support to the after parameter in the R3 version, since R3 supports SET any-path!. | |
ChristianE 7-Dec-2011 [6046] | Do you consider LOAD-CSV { " a " , " b " , " c " } yielding [[{ " a " } { " b " } { " c " }]] to be on spec? It says that spaces are part of a field's value, yet it states that fields may be enclosed in double quotes. I'd rather expected [[" a " " b " " c "]] as a result. The way it is, LOAD-CSV in such cases parses unescaped double quotes as part of the value, IMHO that's not conforming with the spec. |
BrianH 7-Dec-2011 [6047x3] | The values are only considered to be surrounded by quotes if those quotes are directly next to the commas; otherwise, the quotes are data. In the case you give above, according to the spec the quotes in the data should not be allowed - they are bad syntax. However, since the spec in the RFC doesn't define what to do in the case of data that doesn't match the spec, I decided to match the error fallback behavior of Excel, the most widely used CSV handler. Most of the other tools I've tried match the same behavior. |
I considered making a /strict option to make it trigger errors in that case, but then reread the RFC and checked the behavior again, and realized that noone took the spec that strictly. Most tools either behave exactly the same as my LOAD-CSV (because that's how Excel behaves), or completely fail when there are any quotes in the file, like PARSE data "," and PARSE/all data ",". | |
The RFC is fairly loose and incomplete documentation of the observed behavior of most CSV handling tools. Excel's behavior is the real defacto standard, for better or worse. | |
Pekr 8-Dec-2011 [6050] | BrianH: one of my guys returned from some MS training, and he pointed me out to LogParser. It seems even some guys at MS are kind of dialecting :-) It looks like SQL, and you can query logs, and do some nice stuff around them .... http://technet.microsoft.com/en-us/library/ee692659.aspx |
Endo 8-Dec-2011 [6051] | SELECT ... INTO Chart.gif Nice addition to SQL :) |
Sunanda 8-Dec-2011 [6052] | Debugging some live code here .... I wasn't expecting 'parse to drop the last space in the second case here: parse/all " a" " " == ["" "a"] parse/all " a " " " == ["" "a"] So after the parse, it seems that " a" = " a " Any thoughts on a quick work around? Thanks! |
PeterWood 8-Dec-2011 [6053] | Very crudely adding an additional space if the last character is space: >> s: " a " == " a " >> if #" " = last s [append s " " ] == " a " >> parse/all s " == [ a" ""] |
Sunanda 8-Dec-2011 [6054] | Crude maybe, yet looks effective -- thanks! |
Henrik 18-Dec-2011 [6055x2] | BrianH, testing csv-tools.r now. Is this a bug?: >> to-iso-date 18-Dec-2011/14:57:11 ** Script Error: Invalid path value: hour ** Where: ajoin ** Near: p0 date/hour 2 ":" p0 >> system/version == 2.7.8.3.1 |
Also it seems that TO-CSV does not like blocks for cells. | |
BrianH 18-Dec-2011 [6057x4] | Yeah, blocks for cells are so far outside the data model of everything else that uses CSV files that TO-CSV was written to assume that you forgot to put an explicit translation to a string or binary in there (MOLD, FORM, TO-BINARY), or more likely that the block got in there by accident. Same goes for functions and a few other types. |
As for that TO-ISO-DATE behavior, yes, it's a bug. Surprised I didn't know that you can't use /hour, /minute and /second on date! values with times in them in R2. It can be fixed by changing the date/hour to date/time/hour, etc. I'll update the script on REBOL.org. | |
Having to put an explicit conversion from blocks, parens, objects, maps, errors, function types, structs, routines and handles, reminds you that you would need to explicitly convert them back when you LOAD-CSV. Or more often, triggers valuable errors that tell you that unexpected data made it in to your output. | |
TO-ISO-DATE fixed on REBOL.org | |
Henrik 18-Dec-2011 [6061] | Thanks |
GrahamC 18-Dec-2011 [6062x2] | dunno if it's faster but to left pad days and months, I add 100 to the value and then do a next, followed by a form ie. regarding you p0 function |
eg. next form 100 + date/month | |
older newer | first last |