Fast way to remove all non-numerical chars from a string
[1/15] from: kpeters:otaksoft at: 21-Sep-2007 15:45
Hi ~
what is a very fast way to remove all non-numerical characters from
a given string? I will have to process almost a million of these, so speed
matters.
I.e., "(250) 764-0929" -> "2507640929"
TIA,
Kai
[2/15] from: carl::cybercraft::co::nz at: 22-Sep-2007 11:46
On Friday, 21-September-2007 at 15:45:16 Kai Peters wrote,
>Hi ~
>
>what is a very fast way to remove all non-numerical characters from
>a given string? I will have to process almost a million of these, so speed
>matters.
>
>I.e., "(250) 764-0929" -> "2507640929"
I'm not sure how fast this would compare to other methods, but give it a go...
First, create a string containing all characters, less the ten numerals...
chrs: ""
repeat n 256 [append chrs to-char n - 1]
chrs: exclude chrs "1234567890"
Then trim your strings thus...
trim/with "(250) 764-0929" chrs
Hmmm. Well - it might work, depending on the type of characters in your string. It
works on your example, but not on a string made up of random characters. Can anyone
explain if that's expected behaviour? ie...
>> chrs: ""
== ""
>> repeat n 256 [append chrs to-char n - 1]
== {^-^A^B^C^D^E^F^G^H^-
^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ !"#$%&'()*+,-./0123456789:;<=>?-ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^^...
>> chrs: exclude chrs "1234567890"
== {^-^A^B^C^D^E^F^G^H^-
^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ !"#$%&'()*+,-./:;<=>?-ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^^_`{|}~^~...
>> str: ""
== ""
>> loop 1000 [append str to-char random 255]
== {^\^-^H^!^Ly8f?^Q垧H^_m4^ZV-^-OF^{-W^SwF^_38^H^\aV^NՌWuj)V^C^~^\%^B'...
>> trim/with "(250) 764-0929" chrs
== "2507640929"
>> trim/with str chrs
== {y8fm4w38aujft0iltp9wlnegnxsr2lub31hal3uyoczbqnbrly5yesva4...
???
-- Carl Read.
[3/15] from: pwawood::gmail::com at: 22-Sep-2007 11:32
In addition to Carl's solution, you could try parse, based on Romano's
advice that "if you want speed, parse is your friend".
>> input-string: "(250) 764-0929"
== "(250) 764-0929"
;; Create a bitset! of numeric digits to use in parse
>> digit: charset [#"0" - #"9"]
== make bitset! #{
000000000000FF03000000000000000000000000000000000000000000000000
}
;; create a string! the length of the input-string in which to collect
the output
>> output-string: make string! length? input-string
== ""
;; parse the input-string to collect only numeric digits
>> parse input-string [ any [ copy next-digit digit (insert tail
output-string next-digit) | skip]]
== true
;;check the result
>> probe output-string
2507640929
== "2507640929"
If all your strings are very long, it might be worth using a temporary
hash! in which to collect the digits within the parse and then convert
it to a string afterwards. (The only problem with using hash! is that
it has been removed from Rebol 3 and I don't know enough to tell if it
successor can be used in the same way).
>> temp: make hash! length? input-string
== make hash! []
>> parse input-string [ any [ copy next-digit digit (insert tail temp
next-digit) | skip]]
== true
>> output-string: to string! temp
== "2507640929"
I haven't compared the speed of the two approaches.
Regards
Peter
On Saturday, September 22, 2007, at 06:45 am, Kai Peters wrote:
[4/15] from: petr::krenzelok::seznam::cz at: 22-Sep-2007 14:17
Hi,
well, as with REBOL, another, totally different aproach, the short one :-)
start: now/time/precise
loop 1'000'000 [remove-each val "1234a5b6v77" [any [val < #"0" val >
#"9"]]]
now/time/precise - start
== 0:00:06.255
But - parse is usually the fastest method, otoh remove-each is native,
it could compare rather well. Parse is more flexible for more
complicated set-ups though ...
Cheers,
-pekr-
[5/15] from: carl:cybercraft at: 23-Sep-2007 10:19
On Saturday, 22-September-2007 at 14:17:45 Petr Krenzelok wrote,
>Hi,
>well, as with REBOL, another, totally different aproach, the short one :-)
<<quoted lines omitted: 5>>
>it could compare rather well. Parse is more flexible for more
>complicated set-ups though ...
Being native's the reason I thought of using trim too - and with good reason, it seems...
The script...
-----------------
rebol []
print "remove-each..."
start: now/time/precise
loop 1'000'000 [remove-each val "1234a5b6v77" [any [val < #"0" val > #"9"]]]
print now/time/precise - start
print "trim/with..."
chrs: ""
repeat n 256 [append chrs to-char n - 1]
chrs: exclude chrs "1234567890"
start: now/time/precise
loop 1'000'000 [trim/with "1234a5b6v77" chrs]
print now/time/precise - start
-----------------
And the results...
>> do %/c/program files/rebol/view/test.r
Script: "Untitled" (none)
remove-each...
0:00:07.313
trim/with...
0:00:02.343
Now if onlt trim worked as expected! (Someone else can add the parse test... And Kai,
like to give us some real-world results?)
-- Carl Read.
[6/15] from: Tom:Conlin:g:mail at: 22-Sep-2007 15:32
that is what I was seeing as well
Prkr's can be speeded up with bitsets ...
filter: charset "0123456789"
start: now/time/precise
loop 1'000'000 [remove-each char "(250) 764-0929" [not find filter char]]
now/time/precise - start
parse with integers! was faster than remove-each but slower than trim
the trouble of including parse is you have to so something with the
results and that is not being done in these speed tests so it is not
really a fair comparison
rule: [copy num integer!(something num) | skip]
start: now/time/precise
loop 1'000'000 [parse/all "(250) 764-0929"[some rule]]
now/time/precise - start
a better picture of what the data is really like, where is coming from
going to ... would help
Carl Read wrote:
[7/15] from: Tom::Conlin::gmail::com at: 22-Sep-2007 17:22
minutely faster than trim/with
digit: charset "0123456789"
noise: complement digit
start: now/time/precise
rule: [digit | here: some noise there:(remove/part :here :there) :here]
loop 1'000'000 [parse/all "(250) 764-0929"[some rule]]
now/time/precise - start
Tom wrote:
[8/15] from: carl:cybercraft at: 23-Sep-2007 12:59
On Saturday, 22-September-2007 at 17:22:49 Tom wrote,
>minutely faster than trim/with
>digit: charset "0123456789"
<<quoted lines omitted: 3>>
>loop 1'000'000 [parse/all "(250) 764-0929"[some rule]]
>now/time/precise - start
Ahah! I'm finding it minutely slower...
remove-each...
0:00:08.734
trim/with...
0:00:02.203
remove-each using bitsets...
0:00:07.032
parse...
0:00:02.265
but still an excellent advert for parse. And unlike trim, it doesn't easily break...
>> str: ""
== ""
>> loop 50 [append str to-char random 255]
== {^!V^[lwGpf~<8~^S|U#^Q|o$]Oy/#|Y^\je!}
>> loop 1'000'000 [parse/all str [some rule]]
== true
>> str
== "8"
So Kai - up to you for the real-world results, though parse looks to be the best choice.
-- Carl Read.
[9/15] from: pwawood:gma:il at: 23-Sep-2007 9:59
Tom
That's a good example of using parse to modify it's input string:
thanks. (I was struggling to come up with this approach myself).
I modified the rule to search for strings of digits rather than
individual ones (some digit instead of digit); there was a 30 per cent
reduction in the time taken. If Kai's data has a very high percentage
of digits, this small improvement may be significant.
The revised code is:
digit: charset "0123456789"
noise: complement digit
start: now/time/precise
rule: [some digit | here: some noise there:(remove/part :here :there)
:here]
loop 1'000'000 [parse/all "(250) 764-0929"[some rule]]
now/time/precise - start
Peter
On Sunday, September 23, 2007, at 08:22 am, Tom wrote:
[10/15] from: Tom::Conlin::gmail::com at: 22-Sep-2007 19:48
I got 03.265 for parse and 03.391 for trim all
in this range it could be due to the vagueries
of the operating system tasks
with Peter Wood's 'some improvement I see
== 0:00:02.047
trim/with is still coming in at 0:00:03.391 on multiple runs
Carl Read wrote:
[11/15] from: gregg::pointillistic::com at: 23-Sep-2007 11:17
>>> loop 1'000'000 [parse/all "(250) 764-0929"[some rule]]
Keep in mind that you're acting on the same string every time here.
If all the numbers are formatted exactly the same, hardcoding the
rules might be fastest, e.g.
remove skip remove/part skip remove s 3 2 3
But only Kai can say how important the speed is. Processing a million
inputs once may be no big deal, but if it has to happen in a loop, in
under x amount of time, we may need to optimize much further.
-- Gregg
[12/15] from: edoconnor::gmail::com at: 23-Sep-2007 17:40
On 9/23/07, Gregg Irwin wrote:
> But only Kai can say how important the speed is. Processing a million
> inputs once may be no big deal, but if it has to happen in a loop, in
> under x amount of time, we may need to optimize much further.
For further (non REBOL) reading, here's a recent article on the great
blog CodingHorror which is relevant here.
http://www.codinghorror.com/blog/archives/000957.html
Regards,
Ed
[13/15] from: kpeters:otaksoft at: 24-Sep-2007 10:25
Wow - this seemingly "little" question really sparked some responses!
I like it when that happens because it really shows off the brilliance of
Rebol and the people mastering it.
All solutions will go into my library collection since they all shine in
their own way and I can learn from all of them - so I thank you all.
As you likely have guessed, I asked because I need to re-format phone numbers.
The vast majority of these will arrive formatted by various people according to what
they consider proper formatting - sometimes quite creative and riddled with typos as
well.
At any time, I have to be prepared for the occasional complete junk string.
The numbers may reside in MySQL tables or in text files with one phone record (number
& address) per
line. Each of these tables or text files will be processed exactly once (as far as the
phone number
standardizing goes) - speed is important but a extra handful of seconds per file (containing
between
500,000 and 1,000,000 numbers) won't hurt anybody.
The phone numbers are stored with a max of 15 characters each prior to processing - these
strings
will be overwritten with a standardized phone number string if they contain a valid number
and will
be emptied otherwise.
For now, all phone numbers hail from
North America - so valid lengths are
a) 7 digits - local number
b) 10 digits - area code included
c) 11 digits - leading 1 in front of area code
Here's the function logic I intend to use:
1) Lose all non-numerical characters from ph#-string
2) If length not in (7,10,11) return empty string because phone# is invalid
3) If length = 11 and first char = 1 then chop off first char // now only 2 possibilities
left
4) If length = 10 then
frame the three leftmost digits with a pair or parentheses
insert a '1' in front
5) Insert hyphen before fourth character from the end of string
Does this sound like a good strategy or are there other, maybe radically different (but
speedy)
ways to do this?
TIA,
Kai
[14/15] from: gregg::pointillistic::com at: 24-Sep-2007 13:07
Hi Kai,
KP> As you likely have guessed, I asked because I need to re-format
KP> phone numbers.
Here is some very old code I remembered I had here. Use what you can.
It was designed for interactive UI use, checking and reformatting
numbers as users entered them, hence the object support; not optimized
for speed in any way.
-- Gregg
ctx-phone-entry: context [
set 'format-phone-number func [
num [string! issue! object!] "String or object with /text value"
/def-area-code area-code [string! integer!]
/local left right mid obj res
] [
left: func [s len][copy/part s len]
right: func [s len] [copy skip tail s negate len]
mid: func [s start len][copy/part at s start len]
if object? num [obj: num num: obj/text]
res: either data: parse-phone-num num [
; discard leader if it's there.
if all [11 = length? data/num data/num/1 = #"1"] [
data/num: right data/num 10
]
rejoin [
rejoin switch/default length? data/num [
7 [ compose [
(either area-code [rejoin ["(" area-code ") "]][])
left data/num 3 "-" right data/num 4
]]
10 [[
"(" left data/num 3 ") "
mid data/num 4 3 "-" right data/num 4
]]
][[data/num]]
reduce either data/ext [[" ext" trim data/ext]] [""]
reduce either data/pin [[" pin" trim data/pin]] [""]
]
][num]
if obj [
obj/text: res
attempt [if 'face = obj/type [show obj]]
]
res
]
set 'parse-phone-num func [
num [string! issue!]
/local digit digits sep _ext_ ch nums pin ext
] [
digit: charset "0123456798"
digits: [some digit]
sep: charset "()-._"
_ext_: ["ext" opt "." | "x"]
nums: copy ""
rules: [
any [
some [sep | copy ch digit (append nums ch)]
| _ext_ copy ext digits
| "pin" copy pin digits
]
end
]
either parse trim num rules
[reduce ['num nums 'ext ext 'pin pin]]
[none]
]
set 'well-formed-phone-number? func [num /local data] [
either none? data: parse-phone-num num [false] [
any [
found? find [7 10] length? data/num
all [11 = length? data/num data/num/1 = #"1"]
]
]
]
]
[15/15] from: Tom::Conlin::gmail::com at: 24-Sep-2007 22:57
Kai Peters wrote:
...
> For now, all phone numbers hail from
> North America - so valid lengths are
>
> a) 7 digits - local number
> b) 10 digits - area code included
> c) 11 digits - leading 1 in front of area code
a slightly more rigid grammar to catch bogus numbers
digit: charset "0123456789"
octit: charset "23456789"
qudit: charset "0123"
sep: ["-"|"."|"_"|"/"] ;;; whatever
exchange: [octit 2 digit]
subscriber: [4 digit]
;;; 7 digit
phone-number: [exchange opt sep subscriber]
;;; 10 digit
area-code: [opt "(" octit qudit digit opt ")" phone-number]
;;; 11 digit
long-distance: [ "1" opt sep area-code]
rule: [ long-distance | area-code | phone-number]
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted