r3wp [groups: 83 posts: 189283]
  • Home
  • Script library
  • AltME Archive
  • Mailing list
  • Articles Index
  • Site search
 

World: r3wp

[View] discuss view related issues

Graham
18-Apr-2008
[7632x4]
Anton, I'd like fully automatic, and yes, grayscale.
I think the algorithm can assume that if the line advances more than 
a 1/3 of the way across the depth.. there is no junk.
I personally don't need the horizontal edges cropped as it's usually 
vertical displacement that's a problem with faxes
I remembered your auto-crop function but didn't recall where it was 
.. you shift websites so often!
Anton
19-Apr-2008
[7636x2]
You should just load-thru useful looking rebol urls when you see 
them here, then you can just scan your public cache.
Are there likely to be horizontal black lines (eg. borders) which 
should be considered junk ?
Graham
19-Apr-2008
[7638]
Not in my case and I think you might then come up against the problem 
of deciding what is a border and what is a character.
Anton
19-Apr-2008
[7639x3]
Yes, I would grade the scan line according to ratio of  black : white 
 pixels on the line. Text is probably between 20-85% black pixels, 
and borders could perhaps be detected at > 95% black. Anyway, if 
you don't need it, that's much easier :)
I have something that's starting to work.

If we can preprocess the greyscale images so that they're bitonal 
(black and white), and denoised, then my algorithm has a chance.
http://anton.wildit.net.au/rebol/gfx/auto-crop-bitmap-text.r
http://anton.wildit.net.au/rebol/gfx/demo-auto-crop-bitmap-text.r
Graham
19-Apr-2008
[7642x2]
I'll give it a twirl
if they're placed in an anonymous context, how did you make them 
public?
Anton
19-Apr-2008
[7644x2]
Almost all my function libraries are in anonymous contexts.

Basically, DOing the library file (eg. do %auto-crop-bitmap-text.r) 
returns the context, and you just GET out the words you are interested 
in.
This job is eased a bit by my INCLUDE function.
You should be able to do this instead of use INCLUDE.


 auto-crop-bitmap-text: get in do %auto-crop-bitmap-text.r 'auto-crop-bitmap-text
Graham
19-Apr-2008
[7646]
ok
Anton
19-Apr-2008
[7647x2]
Now you can use the auto-crop-bitmap-text function.
It's really very simple. I've just got this include function which 
kind of hides the simplicity a bit (unfortunately). I wish something 
like that was built in to rebol.
Graham
19-Apr-2008
[7649]
ok, found some images that don't work.
Anton
19-Apr-2008
[7650]
cool, ... why not ?
Graham
19-Apr-2008
[7651]
I'll run them again ...
Anton
19-Apr-2008
[7652]
(note, the cropping is only off the top and bottom edges, ie. vertically 
cropped only.)
Graham
19-Apr-2008
[7653]
what does it do if the image is all white space?
Anton
19-Apr-2008
[7654]
good question. let me check...
Graham
19-Apr-2008
[7655]
what happens if you include two words above and below in the image?
Anton
19-Apr-2008
[7656x2]
All white does not do any cropping. It didn't find any "content" 
(non-white pixels) to crop to. What do you want it to do ?
If there's two words above and below in the image, like this:

	line one
	line two
	middle
	line four
	line five


then the 3 middle lines will be included, unless there is some white 
above the top line or some white below the bottom line (in which 
cases they will be included, respectively).
Graham
19-Apr-2008
[7658]
Hmm.... return none?
Anton
19-Apr-2008
[7659]
Let me add that to the to-do list...
Is this a common case, by the way ?
Graham
19-Apr-2008
[7660x3]
Yes
Let me send you some images ... that it appears to have failed on.
Ok, sent.  some don't have the whitespace cropped at the top.
Anton
19-Apr-2008
[7663]
Currently the algorithm scans downwards and upwards simultaneously, 
looking for non-white content. When it doesn't find any, it has nowhere 
to crop to, so no cropping happens. I can change it so that when 
the scans bump into each other they set that as the "content found" 
position, and, the scan lines being right next to each other, will 
result in a 0-height crop region. I will check for that case and 
return none instead.
Graham
19-Apr-2008
[7664]
some don't lose some rubbish at the bottom.
Anton
19-Apr-2008
[7665x3]
The first one show this result, indeed. Let me analyse...
I understand the bug in my code. I did not implement the weighting 
quite correctly.
hmm.. more issues... it's complex when you want to scan from top 
and from bottom simultaneously.
Graham
19-Apr-2008
[7668]
Anton, I found that the  OCR engine I am using needs a white space 
border, so I am padding the image back again with a little white 
space.
Anton
19-Apr-2008
[7669]
That would help my algorithm. Text which is right up against the 
edge is likely to be classified as 'junk'. When there is text at 
the top edge and text at the bottom edge only, then we have two possibly 
'content' texts. But which one is the content and which is the junk 
? The algorithm is forced to either make a choice (which it could 
do by choosing the larger one), or not choose at all (which is what 
currently happens), so including both as the 'content'. If you put 
just one line of white outside the text you consider 'content' then 
it will be surrounded by white and the algorithm will select it as 
'content'.
Graham
19-Apr-2008
[7670]
I would always select the larger ...
Anton
20-Apr-2008
[7671x2]
Rewritten algorithm (selects the larger now).
load-thru/update these two:
http://anton.wildit.net.au/rebol/gfx/auto-crop-bitmap-text.r
http://anton.wildit.net.au/rebol/gfx/demo-auto-crop-bitmap-text.r
And download this new test script:
http://anton.wildit.net.au/rebol/gfx/test-auto-crop-bitmap-text.r
You can fiddle with the last script to make it load your 6 test files 
(which all yield correct looking results).
Graham
20-Apr-2008
[7673x2]
Cool.
if the region is blank, your scan routine returns none, and then 
the crop errors.
Anton
20-Apr-2008
[7675]
Oops, forgot the simplest input.
Anton
21-Apr-2008
[7676x2]
I've fixed that oversight. Update these files:
	auto-crop-bitmap-text.r 
	test-auto-crop-bitmap-text.r
The above update also cleans up loose words in the auto-crop-bitmap-text.r 
file.
Graham
21-Apr-2008
[7678x2]
I added a /pad option to mine so that it returns the text with a 
white space border.
which is needed for some ocr engines
Anton
21-Apr-2008
[7680x2]
/border
 makes more sense, doesn't it ?
maybe not...