r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

BrianH
5-Dec-2011
[6033]
Sorry that first one was:
	set [output data] load-csv/into/next data output
Henrik
5-Dec-2011
[6034x2]
second one looks ok
output: load-csv/into/next data output 'data
BrianH
5-Dec-2011
[6036x2]
The latter makes chaining of the data to other functions easier, 
but requires a variable to hold the continuation; however, you usually 
use a variable for that anyway. The former makes it easier to chain 
both values (and looks nicer to R2 fans), but the only function you 
normally chain both values to is SET, so that's of limited value.
The main problem with /part is that the current code reads the whole 
file into memory before parsing, and the parsing itself has miniscule 
overhead compared to the file overhead. Really doing /part properly 
might require incremental file reading, to the extent that that works 
(how well does it work for the http scheme on R3?).
Henrik
5-Dec-2011
[6038x2]
That's fine by me, as I read the file into memory once due to the 
need for one-time UTF-8 conversion, so that will happen outside LOAD-CSV.
I need to go to bed. If there are more questions, I'll be back tomorrow.
BrianH
5-Dec-2011
[6040]
LOAD has the same problem on R2 and R3. The continuation returned 
would be an offset reference to the entire data in the file, at the 
position after the part parsed.
BrianH
6-Dec-2011
[6041]
http://www.rebol.org/view-script.r?script=csv-tools.rupdated, with 
the new LOAD-CSV /part option.

The LOAD-CSV /part option takes two parameters:
- count: The maximum number of decoded lines you want returned.

- after: A word that will be set to the position of the data after 
the decoded portion, or none.


If you are loading from a file or url then the entire data is read, 
and after is set to a position in the read data. If you are converting 
from binary then in R2 after is set an offset of an as-string alias 
of the binary, and in R3 after is set to an offset of the original 
binary. R3 does binary conversion on a per-value basis to avoid having 
to allocate a huge chunk of memory for a temporary, and R2 just does 
string aliasing for the same reason. Be careful to expect that if 
you are passing the value assigned to after to anything else than 
LOAD-CSV (which can handle it either way).
Gregg
6-Dec-2011
[6042]
Thanks for posting that Brian!
BrianH
6-Dec-2011
[6043x3]
I was a little concerned about making /part take two parameters, 
since it doesn't anywhere else, but the only time you need that continuation 
value is when you do /part, and you almost always need it then. Oh 
well, I hope it isn't too confusing :)
This pass-by-word convention is a little too C-like for my tastes. 
If only we had multivalue return without overhead, like Lua and Go.
Just tweaked it to add any-path! support to the after parameter in 
the R3 version, since R3 supports SET any-path!.
ChristianE
7-Dec-2011
[6046]
Do you consider LOAD-CSV { " a " , " b " , " c " } yielding [[{ " 
a " } { " b " } { " c " }]] to be on spec? It says that spaces are 
part of a field's value, yet it states that fields may be enclosed 
in double quotes. I'd rather expected [[" a " " b " " c "]] as a 
result. The way it is, LOAD-CSV in such cases parses unescaped double 
quotes as part of the value, IMHO that's not conforming with the 
spec.
BrianH
7-Dec-2011
[6047x3]
The values are only considered to be surrounded by quotes if those 
quotes are directly next to the commas; otherwise, the quotes are 
data. In the case you give above, according to the spec the quotes 
in the data should not be allowed - they are bad syntax. However, 
since the spec in the RFC doesn't define what to do in the case of 
data that doesn't match the spec, I decided to match the error fallback 
behavior of Excel, the most widely used CSV handler. Most of the 
other tools I've tried match the same behavior.
I considered making a /strict option to make it trigger errors in 
that case, but then reread the RFC and checked the behavior again, 
and realized that noone took the spec that strictly. Most tools either 
behave exactly the same as my LOAD-CSV (because that's how Excel 
behaves), or completely fail when there are any quotes in the file, 
like PARSE data "," and PARSE/all data ",".
The RFC is fairly loose and incomplete documentation of the observed 
behavior of most CSV handling tools. Excel's behavior is the real 
defacto standard, for better or worse.
Pekr
8-Dec-2011
[6050]
BrianH: one of my guys returned from some MS training, and he pointed 
me out to LogParser. It seems even some guys at MS are kind of dialecting 
:-) It looks like SQL, and you can query logs, and do some nice stuff 
around them ....

http://technet.microsoft.com/en-us/library/ee692659.aspx
Endo
8-Dec-2011
[6051]
SELECT ... INTO Chart.gif
Nice addition to SQL :)
Sunanda
8-Dec-2011
[6052]
Debugging some live code here .... I wasn't expecting 'parse to drop 
the last space in the second case here:
        parse/all " a" " "
        == ["" "a"]
       parse/all " a " " "
       == ["" "a"]
So after the parse, it seems that " a" = " a "

Any thoughts on a quick work around? Thanks!
PeterWood
8-Dec-2011
[6053]
Very crudely adding an additional space if the last character is 
space:

>> s: " a "                      
  
== " a "

>> if #" " = last s [append s " "
]
== " a  "

>> parse/all s " 

== [
 
a" ""]
Sunanda
8-Dec-2011
[6054]
Crude maybe, yet looks effective -- thanks!
Henrik
18-Dec-2011
[6055x2]
BrianH, testing csv-tools.r now.

Is this a bug?:

>> to-iso-date 18-Dec-2011/14:57:11
** Script Error: Invalid path value: hour
** Where: ajoin
** Near: p0 date/hour 2 ":" p0
>> system/version
== 2.7.8.3.1
Also it seems that TO-CSV does not like blocks for cells.
BrianH
18-Dec-2011
[6057x4]
Yeah, blocks for cells are so far outside the data model of everything 
else that uses CSV files that TO-CSV was written to assume that you 
forgot to put an explicit translation to a string or binary in there 
(MOLD, FORM, TO-BINARY), or more likely that the block got in there 
by accident. Same goes for functions and a few other types.
As for that TO-ISO-DATE behavior, yes, it's a bug. Surprised I didn't 
know that you can't use /hour, /minute and /second on date! values 
with times in them in R2. It can be fixed by changing the date/hour 
to date/time/hour, etc. I'll update the script on REBOL.org.
Having to put an explicit conversion from blocks, parens, objects, 
maps, errors, function types, structs, routines and handles, reminds 
you that you would need to explicitly convert them back when you 
LOAD-CSV. Or more often, triggers valuable errors that tell you that 
unexpected data made it in to your output.
TO-ISO-DATE fixed on REBOL.org
Henrik
18-Dec-2011
[6061]
Thanks
GrahamC
18-Dec-2011
[6062x2]
dunno if it's faster but to left pad days and months, I add 100 to 
the value and then do a next, followed by a form ie. regarding you 
p0 function
eg. next form 100 + date/month
BrianH
18-Dec-2011
[6064]
It's worth timing. I'll try both, in R2 and R3.
GrahamC
19-Dec-2011
[6065]
and the outcome was?
BrianH
19-Dec-2011
[6066x2]
Twice the speed using your method :)
Updated on REBOL.org to use new method.
GrahamC
20-Dec-2011
[6068]
Yeah, generally math is faster than using logic.  And old Forth trick.
BrianH
20-Dec-2011
[6069]
Added a TO-CSV /with delimiter option, in case commas aren't your 
thing. It only specifies the field delimiter, not the record delimiter, 
since TO-CSV only makes CSV lines, not whole files.
Endo
20-Dec-2011
[6070]
I'm using it to prepare data to bulk insert into a SQL Server table 
using BCP command line tool.

I need to make some changes like /no-quote to not quote string values. 
Because there is no option in BCP to tell my data has quoted string 
values.
BrianH
20-Dec-2011
[6071]
Be careful, if you don't quote string values then the character set 
of your values can't include cr, lf or your delimiter. It requires 
so many changes that it would be more efficient to add new formatter 
functions to the associated FUNCT/with object, then duplicate the 
code in TO-CSV that calls the formatter. Like this:

to-csv: funct/with [
	"Convert a block of values to a CSV-formatted line in a string."
	data [block!] "Block of values"

 /with "Specify field delimiter (preferably char, or length of 1)"
	delimiter [char! string! binary!] {Default ","}
	; Empty delimiter, " or CR or LF may lead to corrupt data
	/no-quote "Don't quote values (limits the characters supported)"
] [
	output: make block! 2 * length? data
	delimiter: either with [to-string delimiter] [","]
	either no-quote [
		unless empty? data [append output format-field-nq first+ data]

  foreach x data [append append output delimiter format-field-nq :x]
	] [
		unless empty? data [append output format-field first+ data]
		foreach x data [append append output delimiter format-field :x]
	]
	to-string output
] [
	format-field: func [x [any-type!] /local qr] [

  ; Parse rule to put double-quotes around a string, escaping any inside

  qr: [return [insert {"} any [change {"} {""} | skip] insert {"}]]
		case [
			none? :x [""]
			any-string? :x [parse copy x qr]
			:x = #"^(22)" [{""""}]
			char? :x [ajoin [{"} x {"}]]
			money? :x [find/tail form x "$"]
			scalar? :x [form x]
			date? :x [to-iso-date x]

   any [any-word? :x binary? :x any-path? :x] [parse to-string :x qr]
			'else [cause-error 'script 'expect-set reduce [

    [any-string! any-word! any-path! binary! scalar! date!] type? :x
			]]
		]
	]
	format-field-nq: func [x [any-type!]] [
		case [
			none? :x [""]
			any-string? :x [x]
			money? :x [find/tail form x "$"]
			scalar? :x [form x]
			date? :x [to-iso-date x]
			any [any-word? :x binary? :x any-path? :x] [to-string :x]
			'else [cause-error 'script 'expect-set reduce [

    [any-string! any-word! any-path! binary! scalar! date!] type? :x
			]]
		]
	]
]


If you want to add error checking to make sure the data won't be 
corrupted, you'll have to pass in the delimiter to format-field-nq 
and trigger an error if it, cr or lf are found in the field data.
Henrik
20-Dec-2011
[6072]
Is this related to what you wrote above?

>> to-csv [34]
== {""""}
BrianH
20-Dec-2011
[6073x3]
Nope, that's a bug in the R2 version only. Change this:
			:x = #"^(22)" [{""""}]
to this:
			:x == #"^(22)" [{""""}]

Another incompatibility between R2 and R3 that I forgot :(
I'll update the script on REBOL.org.
Weirdly enough, = and =? return true in that case in R2, but only 
== returns false; false is what I would expect for =? at least.
Updated, Henrik.
Henrik
20-Dec-2011
[6076]
Thanks.
Endo
20-Dec-2011
[6077]
Thanks BrianH
BrianH
20-Dec-2011
[6078x2]
Note that that was a first-round mockup of the R3 version, Endo. 
If you want to make an R2 version, download the latest script and 
edit it similarly.
Have you looked into the native type formatting of bcp? It might 
be easier to make a more precise data file that way.
Endo
20-Dec-2011
[6080x2]
It uses a format file, it is very strict, but no chance to set a 
quote char for fields.
Native formats runs well if you export from one SQL server and import 
from other.
BrianH
20-Dec-2011
[6082]
I figure it might be worth it (for me at some point) to do some test 
exports in native format in order to reverse-engineer the format, 
then write some code to generate that format ourselves. I have to 
do a lot of work with SQL Server, so it seems inevitable that such 
a tool will be useful at some point, or at least the knowledge gained 
in the process of writing it.