r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[Parse] Discussion of PARSE dialect

Gabriele
1-Dec-2011
[5969]
(mm, not sure why the copy/past was messed up. i hope you get the 
idea anyway.)
Endo
1-Dec-2011
[5970x2]
I just did the same thing:


t: "abc56xyz" parse/all t [some [x: non-digit (prin first x remove 
x x: back x) :x | skip]] head t
a bit more clear:

t: "abc56xyz" parse/all t [some [x: non-digit (x: back remove x) 
:x | skip]] head t
Gabriele
1-Dec-2011
[5972]
note that copying the whole thing is probably faster than removing 
multiple times. also, doing several chars at once instead of one 
at a time is faster.
Endo
1-Dec-2011
[5973x2]
It depends on the input, but if it's a long text with many multiple 
chars to insert/remove your way will be faster. Thanks
Oh I think no need to "back"

t: "abc56xyz" parse/all t [some [x: non-digit (remove x) :x | skip]] 
head t
Dockimbel
1-Dec-2011
[5975]
Endo: in your first attempt, your second rule in SOME block is not 
making the input advance when the end of the string is reached because 
(remove "") == "", so it enters an infinite loop. A simple fix could 
be:


t: "abc56xyz" parse/all t [any [digit (prin "d") | x: skip (prin 
"." remove x) :x]]


(remember to correctly reset the input cursor when modifying the 
parsed series) 


As others have suggested, they are more optimal ways to achieve this 
trimming.
Endo
1-Dec-2011
[5976x2]
Strange but I tried to remove the whole part in one time, but its 
slower than the other:


aaa: [t: "abc56def7" parse/all t [some [x: some non-digit y: (remove/part 
x y) :x | skip]] head t]

bbb: [t: "abc56def7" parse/all t [some [x: non-digit (remove x) :x 
| skip]] head t]
>> benchmark2 aaa bbb ;(executes block 10'000'000 times.)
Execution time for the #1 job: 0:00:11.719
Execution time for the #2 job: 0:00:11.265
#1 is slower than #2 by factor ~ 1.04030181979583
Doc: Thank you. I tried to do that way (advancing the series position) 
but couldn't. I may add some more things so I wish to do it by parse 
instead of other ways. And want to learn parse more :)
Thanks for all!
Ashley
1-Dec-2011
[5978]
Anyone written anything to parse csv into an import-friendly stream?

Something like:

a,      b ,"c","d1
d2",a ""quote"",",",

a|b|c|d1^/d2|a "quote"|,|


(I'm trying to load CSV files dumped from Excel into SQLite and SQL 
Server ... these changes will be in the next version of my SQLite 
driver)
Endo
1-Dec-2011
[5979]
Geomol: It would be nice if trim/with supports charsets.

And also I would love if I have "trace/parse" just like trace/net, 
which gives info about parse steps instead of all trace output.
Hmm I should add this to wish list I think :)
Gregg
1-Dec-2011
[5980]
Ashley, not sure exactly what you're after. I use simple LOAD-CSV 
and BUILD-DLM-STR funcs to convert each direction.
BrianH
2-Dec-2011
[5981x8]
I use a TO-CSV function that does type-specific value formatting. 
The dates in particular, to be Excel-compatible. Was about to make 
a LOAD-CSV function - haven't needed it yet.
Here's the R2 version of TO-CSV and TO-ISO-DATE (Excel compatible):

to-iso-date: funct/with [
	"Convert a date to ISO format (Excel-compatible subset)"
	date [date!] /utc "Convert zoned time to UTC time"
] [

 if utc [date: date + date/zone date/zone: none] ; Excel doesn't support 
 the Z suffix
	either date/time [ajoin [

  p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2 " "  ; or T

  p0 date/hour 2 ":" p0 date/minute 2 ":" p0 date/second 2  ; or offsets
	]] [ajoin [
		p0 date/year 4 "-" p0 date/month 2 "-" p0 date/day 2
	]]
] [
	p0: func [what len] [ ; Function to left-pad a value with 0
		head insert/dup what: form :what "0" len - length? what
	]
]

to-csv: funct/with [
	"Convert a block of values to a CSV-formatted line in a string."
	[catch]
	data [block!] "Block of values"
] [
	output: make block! 2 * length? data
	unless empty? data [append output format-field first+ data]

 foreach x data [append append output "," format-field get/any 'x]
	to-string output
] [
	format-field: func [x [any-type!]] [case [
		none? get/any 'x [""]

  any-string? get/any 'x [ajoin [{"} replace/all copy x {"} {""} {"}]]
		get/any 'x = #"^"" [{""""}]
		char? get/any 'x [ajoin [{"} x {"}]]
		scalar? get/any 'x [form x]
		date? get/any 'x [to-iso-date x]

  any [any-word? get/any 'x any-path? get/any 'x binary? get/any 'x] 
  [
			ajoin [{"} replace/all to-string :x {"} {""} {"}]
		]
		'else [throw-error 'script 'invalid-arg get/any 'x]
	]]
]


There is likely a faster way to do these. I have R3 variants of these 
too.
Especially since I forgot that APPEND isn't native in R2 :(
Gregg, could you post your LOAD-CSV ?
Here's a version that works in R3, tested against your example code:
>> a: deline read clipboard://
== {a,      b ,"c","d1
d2",a ""quote"",",",}

>> use [x] [collect [parse/all a [some [[{"} copy x [to {"} any [{""} 
to {"}]] {"} (keep replace/all x {""} {"}) | copy x [to "," | to 
end] (keep x)] ["," | end]]]]]
== ["a" "      b " "c" "d1^/d2" {a ""quote""} "," ""]


But it didn't work in R2, leading to an endless loop. So here's the 
version refactored for R2 that also works in R3

>> use [value x] [collect [value: [{"} copy x [to {"} any [{""} to 
{"}]] {"} (keep replace/all any [x ""] {""} {"}) | copy x [to "," 
| to end] (keep any [x ""])] parse/all a [value any ["," value]]]]
== ["a" "      b " "c" "d1^/d2" {a ""quote""} "," ""]


Note that if you get the b like "b" then it isn't CSV compatible, 
nor is it if you escape the {""} in values that aren't themselves 
escaped by quotes. However, you aren't supposed to allow newlines 
in values that aren't surrounded by quotes, so you can't do READ/lines 
and parse line by line, you have to parse the whole file.
I'm sure that the proposed PARSE for Topaz would allow the rule to 
be even smaller than the R3 version, because it includes COLLECT 
[KEEP] as PARSE operations.
That operation would be a great thing to add to the R3 Parse Proposals 
:)
I copied Ashley's example data into a file and checked against several 
commercial CSV loaders, including Excel and Access. Same results 
as the parsers above.
PeterWood
2-Dec-2011
[5989]
Brian - it may be here - http://snippets.dzone.com/posts/show/1281
Endo
2-Dec-2011
[5990]
BrianH: I tested parsing csv (R2 version) there is just a little 
problem with space between coma and quote:


parse-csv: func [a][ use [value x] [collect [value: [{"} copy x [to 
{"} any [{""} to {"}]] {"} (keep replace/all any [x ""] {""} {"}) 
| copy x [to "," | to end] (keep any [x ""])] parse/all a [value 
any ["," value]]]]]

parse-csv {"a,b", "c,d"}  ;there is space after coma
== ["a,b" { "c} {d"}]   ;wrong result.


I know it is a problem on CSV input, but I think you can easily fix 
it and then parse-csv function will be perfect.
Ashley
2-Dec-2011
[5991]
Also this case:

	{"a,b" ,"c,d"} ; space *before* comma

This case

	"a, b"


can be dealt with by replacing "keep any" with "keep trim any" ... 
but Brian's func handles 95% of the real-life test cases I've thrown 
at it so far, so a big thanks from me.
Endo
2-Dec-2011
[5992]
These are also a bit strange:
>> parse-csv {"a", "b"}
== ["a" { "b"}]
>> parse-csv { "a" ,"b"}
== [{ "a" } "b"]
>> parse-csv {"a" ,"b"}
== ["a"]
BrianH
2-Dec-2011
[5993x4]
If there is a space after the comma and before the ", the " is part 
of the value. The " character is only used as a delimiter if it is 
directly next to the comma.
My func handles 100% of the CSV standard - http://tools.ietf.org/html/rfc4180
- at least for a single line. To really parse CSV you need a full-file 
parser, because you have to consider that newlines in values surrounded 
by quotes are counted as part of the value, but if the value is not 
surrounded completely by quotes (including leading and trailing spaces) 
then newlines are treated as record separators.
CSV is not supposed to be forgiving of spaces around commas. Even 
the "" escaping to get a " character in the middle of a " surrounded 
value is supposed to be turned off when the comma, beginning of line, 
or end of line have spaces next to them.
For the purposes of discussion I'll put the CSV data inside {}, so 
you can see the ends, and the results in a block of line blocks.

This: { "a" }
should result in this: [[{ "a" }]]

This: { "a
b" }
should result in this: [[{ "a}] [{b" }]]

This: {"a
b"}
should result in this: [[{a
b}]]

This: {"a ""b"" c"}
should result in this: [[{a "b" c}]]

This: {a ""b"" c}
should result in this: [[{a ""b"" c}]]

This: {"a", "b"}
should result in this: [["a" { "b"}]]
Gregg
2-Dec-2011
[5997x4]
load-csv: func [
        "Parse newline delimited CSV records"
        input [file! string!]
        /local p1 p2 lines
    ] [
        lines: collect line [
            parse input [

                some [p1: [to newline | to end] p2: (line: copy/part p1 p2) skip]
            ]
        ]
        collect/only rec [
            foreach line lines [
                if not empty? line [rec: parse/all line ","]
            ]
        ]
    ]
Argh. Shouldn't just post the first one I find. Ignore that. It doesn't 
handle file!.
load-csv: func [
    "Load and parse a delimited text file."
    source [file! string!]
    /with
        delimiter
    /local lines
][
    if not with [delimiter: ","]

    lines: either file? source [read/lines source] [parse/all source 
    "^/"]
    remove-each line lines [empty? line]
    if empty? lines [return copy []]
    head forall lines [
        change/only lines parse/all first lines delimiter
    ]
]
I did head down the path of trying to handle all the things REBOL 
does wrong with quoted fields and such, but I have always found a 
way to avoid dealing with it.
Ashley
2-Dec-2011
[6001]
load-csv fails to deal with these 3 simple (and for me, common) cases:

1,"a
b"
2,"a""b"
3,

>> load-csv %test.csv
== [["1" "a"] [{b"}] ["2" "a" "b"] ["3"]]

I've reverted to an in situ brute force approach:

c: make function! [data /local s] [
		all [find data "|" exit]
		s: false
		repeat i length? trim data [
			switch pick data i [
				#"^""	[s: complement s]
				#","	[all [not s poke data i #"|"]]
				#"^/"	[all [s poke data i #" "]]
			]
		]
		remove-each char data [char = #"^""]

  all [#"|" = last data insert tail data #"|"]	; only required if we're 
  going to parse the data
		parse/all data "|^/"
]

which has 4 minor limitations:

1) the data can't contain the delimter you're going to use ("|" in 
my case)

2) it replaces quoted returns with another character (" " in my code)

3) it removes all quote (") characters (to allow SQLite .import and 
parse/all to function correctly)
4) Individual values are not trimmed (e.g.c "a ,b" -> ["a " "b"])


If you can live with these limitations then the big benefit is that 
you can omit the last two lines and have a string that is import 
friendly for SQLite (or SQL Server) ... this is especially important 
when dealing with large (100MB+) CSV files! ;)
BrianH
2-Dec-2011
[6002x2]
Individual values should not be trimmed if you want the loader to 
be CSV compatible. However, since TRIM is modifying you can post-process 
the values pretty quickly if you like.
I'm working on a fully standards-compliant full-file LOAD-CSV - actually 
two, one for R2 and one for R3. Need them both for work. For now 
I'm reading the entire file into memory before parsing it, but I 
hope to eventually make the reading incremental so there's more room 
in memory for the results.
Ashley
3-Dec-2011
[6004]
Actually, 4) above is easily solved by adding an additional switch 
case:

	#" "	[all [not s poke data i #"^""]]

This will ensure "a , b" -> ["a" "b"]
BrianH
3-Dec-2011
[6005x2]
But it doesn't assure that "a , b" -> ["a " " b"]. It doesn't work 
if it trims the values.
It needs to handle "" escaping too, but only in the case where values 
are quoted. Anyway, I have the function mostly done. I'll polish 
it up tomorrow.
Ashley
3-Dec-2011
[6007]
it doesn't work if it trims the values.

 - that may not be the standard, but when you come across values like:

	1, 2, 3


the intent is quite clear (they're numbers) ... if we retained the 
leading spaces then we'd be treating these values (erroneously) as 
strings. There's a lot of malformed CSV out there! ;)
BrianH
3-Dec-2011
[6008x2]
I figure that dealing with malformed data, or even converting the 
strings to other values, is best done post-process. Might as well 
take advantage of modifiable blocks.
I'm putting LOAD-CSV in the %rebol.r of my dbtools, treating it like 
a mezzanine. That's why I need R2 and R3 versions, because they use 
the same %rebol.r with mostly the same functions. My version is a 
little more forgiving than the RFC above, allowing quotes to appear 
in non-quoted values. I'm making sure that it is exactly as forgiving 
on load as Excel, Access and SQL Server, resulting in exactly the 
same data, spaces and all, because my REBOL scripts at work are drop-in 
replacements for office automation processes. If anything, I don't 
want the loader to do value conversion because those other tools 
have been a bit too presumptuous about that, converting things to 
numbers that weren't meant to be. It's better to do the conversion 
explicitly, based on what you know is supposed to go in that column.
Kaj
3-Dec-2011
[6010]
Sounds like a job for a dialect that specifies what is supposed to 
be in the columns
BrianH
3-Dec-2011
[6011]
Because of R2's crappy binary parsing (yes, you can put binary data 
in CSV files) I used an emitter function in the R2 version. This 
could easily be exported to an option, to let you provide your own 
emiter function which does whatever conversion you want.
Gregg
3-Dec-2011
[6012]
As far as standards compliance, I didn't know there was a single 
standard. ;-)
BrianH
3-Dec-2011
[6013x4]
There's an ad-hoc defacto standard, but it's pretty widely supported. 
I admit, the binary support came as a bit of a surprise :)
Here's the R2 version, though I haven't promoted the emitter to an 
option yet:

load-csv: funct [

 "Load and parse CSV-style delimited data. Returns a block of blocks."
	[catch]
	source [file! url! string! binary!]
	/binary "Don't convert the data to string (if it isn't already)"
	/with "Use another delimiter than comma"
	delimiter [char! string! binary!]
	/into "Insert into a given block, rather than make a new one"
	output [block!] "Block returned at position after the insert"
] [
	; Read the source if necessary
	if any [file? source url? source] [throw-on-error [
		source: either binary [read/binary source] [read source]
	]]
	unless binary [source: as-string source] ; No line conversion
	; Use either a string or binary value emitter
	emit: either binary? source [:as-binary] [:as-string]
	; Set up the delimiter
	unless with [delimiter: #","]

 valchar: remove/part charset [#"^(00)" - #"^(FF)"] join crlf delimiter
	; Prep output and local vars
	unless into [output: make block! 1]
	line: [] val: make string! 0
	; Parse rules
	value: [
		; Value surrounded in quotes
		{"} (clear val) x: to {"} y: (insert/part tail val x y)
		any [{"} x: {"} to {"} y: (insert/part tail val x y)]
		{"} (insert tail line emit copy val) |
		; Raw value
		x: any valchar y: (insert tail line emit copy/part x y)
	]
	; as-string because R2 doesn't parse binary that well
	parse/all as-string source [any [
		end break |
		(line: make block! length? line)
		value any ["," value] [crlf | cr | lf | end]
		(output: insert/only output line)
	]]
	also either into [output] [head output]
		(source: output: line: val: x: y: none) ; Free the locals
]


All my tests pass, though they're not comprehensive; maybe you'll 
come up with more. Should I add support for making the row delimiter 
an option too?
>> load-csv {^M^/" a""", a""^Ma^/^/}
== [[""] [{ a"} { a""}] ["a"] [""]]
>> load-csv/binary to-binary {^M^/" a""", a""^Ma^/^/}
== [[#{}] [#{206122} #{20612222}] [#{61}] [#{}]]
The R3 version will be simpler and faster because of the PARSE changes 
and better binary handling. However, url handling might be trickier 
because READ/string is ignored by all schemes at the moment.
Steeve
3-Dec-2011
[6017]
Don't forget to post your script on rebol.org when finished :-)
Gregg
4-Dec-2011
[6018]
Thanks for posting Brian. I second Steeve's suggestion, though I'll 
snag it here for testing.