r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
2-Dec-2011
[5995x2]
CSV is not supposed to be forgiving of spaces around commas. Even 
the "" escaping to get a " character in the middle of a " surrounded 
value is supposed to be turned off when the comma, beginning of line, 
or end of line have spaces next to them.
For the purposes of discussion I'll put the CSV data inside {}, so 
you can see the ends, and the results in a block of line blocks.

This: { "a" }
should result in this: [[{ "a" }]]

This: { "a
b" }
should result in this: [[{ "a}] [{b" }]]

This: {"a
b"}
should result in this: [[{a
b}]]

This: {"a ""b"" c"}
should result in this: [[{a "b" c}]]

This: {a ""b"" c}
should result in this: [[{a ""b"" c}]]

This: {"a", "b"}
should result in this: [["a" { "b"}]]
Gregg
2-Dec-2011
[5997x4]
load-csv: func [
        "Parse newline delimited CSV records"
        input [file! string!]
        /local p1 p2 lines
    ] [
        lines: collect line [
            parse input [

                some [p1: [to newline | to end] p2: (line: copy/part p1 p2) skip]
            ]
        ]
        collect/only rec [
            foreach line lines [
                if not empty? line [rec: parse/all line ","]
            ]
        ]
    ]
Argh. Shouldn't just post the first one I find. Ignore that. It doesn't 
handle file!.
load-csv: func [
    "Load and parse a delimited text file."
    source [file! string!]
    /with
        delimiter
    /local lines
][
    if not with [delimiter: ","]

    lines: either file? source [read/lines source] [parse/all source 
    "^/"]
    remove-each line lines [empty? line]
    if empty? lines [return copy []]
    head forall lines [
        change/only lines parse/all first lines delimiter
    ]
]
I did head down the path of trying to handle all the things REBOL 
does wrong with quoted fields and such, but I have always found a 
way to avoid dealing with it.
Ashley
2-Dec-2011
[6001]
load-csv fails to deal with these 3 simple (and for me, common) cases:

1,"a
b"
2,"a""b"
3,

>> load-csv %test.csv
== [["1" "a"] [{b"}] ["2" "a" "b"] ["3"]]

I've reverted to an in situ brute force approach:

c: make function! [data /local s] [
		all [find data "|" exit]
		s: false
		repeat i length? trim data [
			switch pick data i [
				#"^""	[s: complement s]
				#","	[all [not s poke data i #"|"]]
				#"^/"	[all [s poke data i #" "]]
			]
		]
		remove-each char data [char = #"^""]

  all [#"|" = last data insert tail data #"|"]	; only required if we're 
  going to parse the data
		parse/all data "|^/"
]

which has 4 minor limitations:

1) the data can't contain the delimter you're going to use ("|" in 
my case)

2) it replaces quoted returns with another character (" " in my code)

3) it removes all quote (") characters (to allow SQLite .import and 
parse/all to function correctly)
4) Individual values are not trimmed (e.g.c "a ,b" -> ["a " "b"])


If you can live with these limitations then the big benefit is that 
you can omit the last two lines and have a string that is import 
friendly for SQLite (or SQL Server) ... this is especially important 
when dealing with large (100MB+) CSV files! ;)
BrianH
2-Dec-2011
[6002x2]
Individual values should not be trimmed if you want the loader to 
be CSV compatible. However, since TRIM is modifying you can post-process 
the values pretty quickly if you like.
I'm working on a fully standards-compliant full-file LOAD-CSV - actually 
two, one for R2 and one for R3. Need them both for work. For now 
I'm reading the entire file into memory before parsing it, but I 
hope to eventually make the reading incremental so there's more room 
in memory for the results.
Ashley
3-Dec-2011
[6004]
Actually, 4) above is easily solved by adding an additional switch 
case:

	#" "	[all [not s poke data i #"^""]]

This will ensure "a , b" -> ["a" "b"]
BrianH
3-Dec-2011
[6005x2]
But it doesn't assure that "a , b" -> ["a " " b"]. It doesn't work 
if it trims the values.
It needs to handle "" escaping too, but only in the case where values 
are quoted. Anyway, I have the function mostly done. I'll polish 
it up tomorrow.
Ashley
3-Dec-2011
[6007]
it doesn't work if it trims the values.

 - that may not be the standard, but when you come across values like:

	1, 2, 3


the intent is quite clear (they're numbers) ... if we retained the 
leading spaces then we'd be treating these values (erroneously) as 
strings. There's a lot of malformed CSV out there! ;)
BrianH
3-Dec-2011
[6008x2]
I figure that dealing with malformed data, or even converting the 
strings to other values, is best done post-process. Might as well 
take advantage of modifiable blocks.
I'm putting LOAD-CSV in the %rebol.r of my dbtools, treating it like 
a mezzanine. That's why I need R2 and R3 versions, because they use 
the same %rebol.r with mostly the same functions. My version is a 
little more forgiving than the RFC above, allowing quotes to appear 
in non-quoted values. I'm making sure that it is exactly as forgiving 
on load as Excel, Access and SQL Server, resulting in exactly the 
same data, spaces and all, because my REBOL scripts at work are drop-in 
replacements for office automation processes. If anything, I don't 
want the loader to do value conversion because those other tools 
have been a bit too presumptuous about that, converting things to 
numbers that weren't meant to be. It's better to do the conversion 
explicitly, based on what you know is supposed to go in that column.
Kaj
3-Dec-2011
[6010]
Sounds like a job for a dialect that specifies what is supposed to 
be in the columns
BrianH
3-Dec-2011
[6011]
Because of R2's crappy binary parsing (yes, you can put binary data 
in CSV files) I used an emitter function in the R2 version. This 
could easily be exported to an option, to let you provide your own 
emiter function which does whatever conversion you want.
Gregg
3-Dec-2011
[6012]
As far as standards compliance, I didn't know there was a single 
standard. ;-)
BrianH
3-Dec-2011
[6013x4]
There's an ad-hoc defacto standard, but it's pretty widely supported. 
I admit, the binary support came as a bit of a surprise :)
Here's the R2 version, though I haven't promoted the emitter to an 
option yet:

load-csv: funct [

 "Load and parse CSV-style delimited data. Returns a block of blocks."
	[catch]
	source [file! url! string! binary!]
	/binary "Don't convert the data to string (if it isn't already)"
	/with "Use another delimiter than comma"
	delimiter [char! string! binary!]
	/into "Insert into a given block, rather than make a new one"
	output [block!] "Block returned at position after the insert"
] [
	; Read the source if necessary
	if any [file? source url? source] [throw-on-error [
		source: either binary [read/binary source] [read source]
	]]
	unless binary [source: as-string source] ; No line conversion
	; Use either a string or binary value emitter
	emit: either binary? source [:as-binary] [:as-string]
	; Set up the delimiter
	unless with [delimiter: #","]

 valchar: remove/part charset [#"^(00)" - #"^(FF)"] join crlf delimiter
	; Prep output and local vars
	unless into [output: make block! 1]
	line: [] val: make string! 0
	; Parse rules
	value: [
		; Value surrounded in quotes
		{"} (clear val) x: to {"} y: (insert/part tail val x y)
		any [{"} x: {"} to {"} y: (insert/part tail val x y)]
		{"} (insert tail line emit copy val) |
		; Raw value
		x: any valchar y: (insert tail line emit copy/part x y)
	]
	; as-string because R2 doesn't parse binary that well
	parse/all as-string source [any [
		end break |
		(line: make block! length? line)
		value any ["," value] [crlf | cr | lf | end]
		(output: insert/only output line)
	]]
	also either into [output] [head output]
		(source: output: line: val: x: y: none) ; Free the locals
]


All my tests pass, though they're not comprehensive; maybe you'll 
come up with more. Should I add support for making the row delimiter 
an option too?
>> load-csv {^M^/" a""", a""^Ma^/^/}
== [[""] [{ a"} { a""}] ["a"] [""]]
>> load-csv/binary to-binary {^M^/" a""", a""^Ma^/^/}
== [[#{}] [#{206122} #{20612222}] [#{61}] [#{}]]
The R3 version will be simpler and faster because of the PARSE changes 
and better binary handling. However, url handling might be trickier 
because READ/string is ignored by all schemes at the moment.
Steeve
3-Dec-2011
[6017]
Don't forget to post your script on rebol.org when finished :-)
Gregg
4-Dec-2011
[6018]
Thanks for posting Brian. I second Steeve's suggestion, though I'll 
snag it here for testing.
BrianH
4-Dec-2011
[6019]
The one above misses one of the Excel-like bad data handling patterns. 
Plus, I've added a few features, like multi-load, more option error 
checking , and R3 versions. I'll post them on REBOL.org today.
BrianH
5-Dec-2011
[6020x3]
Full version with other CSV functions posted here: http://www.rebol.org/view-script.r?script=csv-tools.r
Making the end-of-line delimiter an option turned out to be really 
tricky, too tricky to be worth it. The code and time overhead from 
just processing the option itself was pretty significant. It would 
be a better idea to make that kind of thing into a separate function 
which requires the delimiters to be specified, or a generator that 
takes a set of delimiters and generates a function to handle that 
specific set.
Nonetheless, this LOAD-CSV even handles multichar field delimiter 
options; in R2 that requires some interesting PARSE tricks :)
Henrik
5-Dec-2011
[6023]
Well, now, Brian, this looks very convenient. :-) I happen to be 
needing a better CSV parser, than the one I have here, but it needs 
to not convert cell values away from string, and I also need to parse 
partially, or N number of lines. Is this possible with this one?
BrianH
5-Dec-2011
[6024]
It doesn't do conversion from string (or even from binary with LOAD-CSV/binary). 
This doesn't have a /part option but that is a good idea, especially 
since you can't just READ/lines a CSV file because it treats newlines 
differently depending on whether the value is in quotes or not. If 
you want to load incrementally (and can break up the lines yourself, 
for now) then LOAD-CSV supports the standard /into option.
Henrik
5-Dec-2011
[6025x2]
can it be told to stop parsing after N lines instead? as far as I 
can tell from the source: (output: insert/only output line), it could 
do that.
since you can't just READ/lines a CSV file
 - yes, mine does that, and that's no good.
BrianH
5-Dec-2011
[6027]
Yes, that is a possibility, but there yet. Resuming would be a problem 
because you'd have to either save a continuation position or reparse. 
Maybe something like LOAD/next would work here, preferably like the 
way R3's LOAD/next was before it was removed in favor of TRANSCODE/next. 
Making the /into option work with /next and /part would be interesting.
Henrik
5-Dec-2011
[6028]
I don't really need anything but having the ability to parse the 
first 100 lines of a file and doing that many times, so I don't care 
so much about continuation. This is for real-time previews of large 
CSV files (> 10000 lines).
BrianH
5-Dec-2011
[6029]
Funny, for my purposes it has to get over 100000 lines before it 
starts to be large :)
Henrik
5-Dec-2011
[6030x2]
Well, sure, but I like to have complete control over things like 
that, so I usually settle for showing only the first 100 lines.
(better response time, when the user abuses import adjustment buttons)
BrianH
5-Dec-2011
[6032x2]
Which do you prefer as a /next style?
	set [output data] load-csv/into data output
or
	output: load-csv/into/next data output 'data
Sorry that first one was:
	set [output data] load-csv/into/next data output
Henrik
5-Dec-2011
[6034x2]
second one looks ok
output: load-csv/into/next data output 'data
BrianH
5-Dec-2011
[6036x2]
The latter makes chaining of the data to other functions easier, 
but requires a variable to hold the continuation; however, you usually 
use a variable for that anyway. The former makes it easier to chain 
both values (and looks nicer to R2 fans), but the only function you 
normally chain both values to is SET, so that's of limited value.
The main problem with /part is that the current code reads the whole 
file into memory before parsing, and the parsing itself has miniscule 
overhead compared to the file overhead. Really doing /part properly 
might require incremental file reading, to the extent that that works 
(how well does it work for the http scheme on R3?).
Henrik
5-Dec-2011
[6038x2]
That's fine by me, as I read the file into memory once due to the 
need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV.
I need to go to bed. If there are more questions, I'll be back tomorrow.
BrianH
5-Dec-2011
[6040]
LOAD has the same problem on R2 and R3. The continuation returned 
would be an offset reference to the entire data in the file, at the 
position after the part parsed.
BrianH
6-Dec-2011
[6041]
http://www.rebol.org/view-script.r?script=csv-tools.rupdated, with 
the new LOAD-CSV /part option.

The LOAD-CSV /part option takes two parameters:
- count: The maximum number of decoded lines you want returned.

- after: A word that will be set to the position of the data after 
the decoded portion, or none.


If you are loading from a file or url then the entire data is read, 
and after is set to a position in the read data. If you are converting 
from binary then in R2 after is set an offset of an as-string alias 
of the binary, and in R3 after is set to an offset of the original 
binary. R3 does binary conversion on a per-value basis to avoid having 
to allocate a huge chunk of memory for a temporary, and R2 just does 
string aliasing for the same reason. Be careful to expect that if 
you are passing the value assigned to after to anything else than 
LOAD-CSV (which can handle it either way).
Gregg
6-Dec-2011
[6042]
Thanks for posting that Brian!
BrianH
6-Dec-2011
[6043x2]
I was a little concerned about making /part take two parameters, 
since it doesn't anywhere else, but the only time you need that continuation 
value is when you do /part, and you almost always need it then. Oh 
well, I hope it isn't too confusing :)
This pass-by-word convention is a little too C-like for my tastes. 
If only we had multivalue return without overhead, like Lua and Go.