Quantcast

scanning for archival and OCR

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

scanning for archival and OCR

David H. Durgee
Now that I have my Canon PIXMA MG2120 working as a scanner I was looking
at the possibility of using it to scan some documents for archival
purposes.  My intent is to scan some old bills and other documents and
database them in some manner before disposing of the originals.  As
these are mostly bills and other documents they tend to be B&W with the
possible exception of some logos or letter head/foot in color.

I am trying to determine how best to scan and save these documents. I
would like to be able to reprint them in the future if that becomes
necessary, implying a high quality scan.  But I am also concerned with
the size of the saved documents on my system.  What is the best file
format for saving such documents?  I would imagine that 16 or possibly
even 8 colors would suffice to represent the documents, assuming the
palette can be specified.  The documents are likely to be mostly
white-space, so I would expect them to be highly compressible.

Given I am interested in databasing them I would also think that using
an OCR program to extract the content would be useful.  I tried a few
things with that on a small document and encountered problems, given the
mixed content of the particular document.  Are there any specific
recommendations for scanning a document with OCR in mind?  As I am
running linux I am looking at tesseract and cuneiform at this point, at
least one of which wants a B&W document for input and whose results
appear to vary quite a bit with the scan resolution.

Am I dealing with conflicting goals?  Will I need to scan each document
twice or more to accomplish what I want here?

Dave

--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Jeremy Johnson
The perl application gscan2pdf  will probably do what you need:
http://gscan2pdf.sourceforge.net/

I use a shell script "bscan" for scanning to pnm then conversion to e.g. pdf.
Since my scanner scans better in 8-bit grayscale then 2-bit B&W,
I scan in 8-bit grayscale @ 300dpi then convert to bitonal Black&White using
djvu wavelet compression (option -BW in my script):
bscan --mode=8-bit --shades=2 --page=Legal --comp=lzw -BW FILE

Sometimes I may need to use a photo scanner with high optical resolution (e.g.
an Epson with 24-bit grayscale). If I need to scan in color, I usually scan to
pnm then convert to djvu using c44, e.g.:
bscan --mode=color --shades=truecolor --page=Letter -c44 --djvutopdf=25 FILE

http://www.acjlaw.net:8080/~jeremy/Ricoh/usage_bscan.html

I haven't had much luck with any of the open source OCR programs. Maybe max
90% accuracy on straight B/W text with no logos, rules, underlines and all
text horizontal and of the same font weight and shape.


--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Martin Dengler
In reply to this post by David H. Durgee
On Tue, Jan 22, 2013 at 12:34:05PM -0500, David H. Durgee wrote:
> I am trying to determine how best to scan and save these documents.

I have found the following process to be useful:

Scanner input, jpg (or pdf)
  |
  v
tidy up image using 'unpapered'
  |
  v
convert to grayscale via ppmtopgm -> pamtotiff
  |
  v
OCR using tesseract

Tesseract can embed the OCR in the PDF (search for tesseract hocr),
too.

This is a makefile I use to automate that process, starting from a PDF
(image only) generated by my scanner:

http://www.martindengler.com/proj/scan-post-process-Makefile

...like so:

make -f scan-post-process-Makefile  $(basename input.pdf .pdf)-processed

Tesseract isn't perfect, but it's pretty good.

> Dave

Martin

--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]

attachment0 (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Bugzilla from michkloo@xs4all.nl
In reply to this post by David H. Durgee
On 01/22/13 18:34, David H. Durgee wrote:

> Now that I have my Canon PIXMA MG2120 working as a scanner I was looking
> at the possibility of using it to scan some documents for archival
> purposes.  My intent is to scan some old bills and other documents and
> database them in some manner before disposing of the originals.  As
> these are mostly bills and other documents they tend to be B&W with the
> possible exception of some logos or letter head/foot in color.
>
> I am trying to determine how best to scan and save these documents. I
> would like to be able to reprint them in the future if that becomes
> necessary, implying a high quality scan.  But I am also concerned with
> the size of the saved documents on my system.  What is the best file
> format for saving such documents?  I would imagine that 16 or possibly
> even 8 colors would suffice to represent the documents, assuming the
> palette can be specified.  The documents are likely to be mostly
> white-space, so I would expect them to be highly compressible.
>
> Given I am interested in databasing them I would also think that using
> an OCR program to extract the content would be useful.  I tried a few
> things with that on a small document and encountered problems, given the
> mixed content of the particular document.  Are there any specific
> recommendations for scanning a document with OCR in mind?  As I am
> running linux I am looking at tesseract and cuneiform at this point, at
> least one of which wants a B&W document for input and whose results
> appear to vary quite a bit with the scan resolution.
>
> Am I dealing with conflicting goals?  Will I need to scan each document
> twice or more to accomplish what I want here?
>
> Dave

I have been scanning for arschival purposes and will tell you how I
achieve this.
-1 xsane allows the type of file you want to use for writing a file
   to disk. For archival purposes I use pdf. I made a scan for archival
   with the name vkweer1.pdf .  For the next scan I wanted to save,
   xsane proposed the name  vkweer2.pdf.  This number goes up by 1 for
   each coming scan.  After finishing the scanning, I'll combine the
   seperate scans into one combined scan with ghostscript. An example
   of a pdf-combining ghostscript:
   gs -q -sPAPERSIZE=a4 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite
   -sOutputFile=vkweerc.pdf vkweer1.pdf vkweer2.pdf

  This command creates a combined file ("vkweerc.pdf") from the two
  input-files (vkweer1.pdf and vkweer2.pdf) into one output file which
  is much smaller than the original input-files:
        ls vkw* -l
-rw-r--r-- 1 julien users   682131 Jan 23 11:51 vkweerc.pdf
-rw-r----- 1 julien users 14439950 Jan 23 11:46 vkweer.pdf
-rw-r----- 1 julien users 14439950 Jan 23 11:47 vkweer2.pdf

As you can see the output of gs (the file vkweerc.pdf) is more
than 42 times smaller than the size of the inputfiles vkweer1.pdf and
vkweer2.pdf combined.  Great for archival purposes ;-) . In the setup
of xsane I specified to save the pdf zlib compressed, but appearantly
ghostscript does a much better job, as far as compressing is concerned.

Hope this is useful to you.
Julien





--
Julien Michielsen
julien_at_michkloo.xs4all.nl


--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Rolf Bensch-2
In reply to this post by David H. Durgee
Hi David,

I would install scanbd (http://sourceforge.net/projects/scanbd/). Then
you can use the scanner buttons to control scanning with Sane. scanbd is
calling shell scripts, so you can write your own scrips matching your needs.

You can ask this mailing list if you have problems with scanbd (please
search the ml history before ;) ).

Cheers,
Rolf


Am 22.01.2013 18:34, schrieb David H. Durgee:

> Now that I have my Canon PIXMA MG2120 working as a scanner I was looking
> at the possibility of using it to scan some documents for archival
> purposes.  My intent is to scan some old bills and other documents and
> database them in some manner before disposing of the originals.  As
> these are mostly bills and other documents they tend to be B&W with the
> possible exception of some logos or letter head/foot in color.
>
> I am trying to determine how best to scan and save these documents. I
> would like to be able to reprint them in the future if that becomes
> necessary, implying a high quality scan.  But I am also concerned with
> the size of the saved documents on my system.  What is the best file
> format for saving such documents?  I would imagine that 16 or possibly
> even 8 colors would suffice to represent the documents, assuming the
> palette can be specified.  The documents are likely to be mostly
> white-space, so I would expect them to be highly compressible.
>
> Given I am interested in databasing them I would also think that using
> an OCR program to extract the content would be useful.  I tried a few
> things with that on a small document and encountered problems, given the
> mixed content of the particular document.  Are there any specific
> recommendations for scanning a document with OCR in mind?  As I am
> running linux I am looking at tesseract and cuneiform at this point, at
> least one of which wants a B&W document for input and whose results
> appear to vary quite a bit with the scan resolution.
>
> Am I dealing with conflicting goals?  Will I need to scan each document
> twice or more to accomplish what I want here?
>
> Dave
>
>

--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Jeremy Johnson
In reply to this post by Bugzilla from michkloo@xs4all.nl
Hmmm, I guess I learn something new every day.
I wouldn't have suspected that ghostscript writer could concatenate pdf's and
save so much during compression.

So I just did a test, scanning some tax forms
in 8-bit grayscale to z[0001 --- 0019].pdf using xsane
and then combining using both gs and pdftk.

The results:

$ du -csh z00??.pdf
388K    z0001.pdf
1.2M    z0002.pdf
1.3M    z0003.pdf
1.1M    z0004.pdf
1.6M    z0005.pdf
892K    z0006.pdf
724K    z0007.pdf
908K    z0008.pdf
1.3M    z0009.pdf
728K    z0010.pdf
556K    z0011.pdf
196K    z0012.pdf
1.4M    z0013.pdf
196K    z0014.pdf
580K    z0015.pdf
472K    z0016.pdf
376K    z0017.pdf
920K    z0018.pdf
1.3M    z0019.pdf
16M     total


# Now concatenate using pdftk
$ pdftk z00??.pdf cat output PDFTK.pdf
$ ls -sh PDFTK.pdf
16M PDFTK.pdf

# Concatenate using ghostscript's re-write
$ gs -q  -dNOPAUSE -dBATCH -sDEVICE=pdfwrite  -sOutputFile=GS.pdf z00??.pdf
$ ls -sh GS.pdf
8.5M GS.pdf

Of course, pdftk allows mixing papersizes. Ghostscript's writer will truncate
pages which are larger then the default or specified pagesize. Not sure if
ghostscript can write pdfs with mixed papersizes.

For good measure, I also tried pdfjam/pdfjoin/pdflatex and it too just
concatenates the pdfs into a 16M file.



--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Jeremy Johnson
I think ghostscript must be writing one dictionary for the whole document
instead of one dictionary per page.
If I take my 16M PDFTK.pdf and re-write it using ghostscript, ghostscript
produces a 8.5M file:

$  gs -q  -dNOPAUSE -dBATCH -sDEVICE=pdfwrite  -sOutputFile=NEW.pdf PDFTK.pdf
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by:
   **** >>>> itext-paulo-155 (itextpdf.sf.net-lowagie.com) <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

$ ls -sh NEW.pdf
8.5M NEW.pdf

--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

David H. Durgee
In reply to this post by Martin Dengler
After reading the responses on this thread I decided to try something
out.  I picked an old, 3 page telephone bill on 10.3 cm x 17 cm pages.  
I scanned this bill with my MG2120 at 1200 bpi full color and saved the
pnm files from xsane for later processing.  Each file is 114,489K in
size as these are high-res full-color scans.  I then opened each of
these in Gimp, increase contrast by 30 points and saved the results in a
jpg file with quality setting of 0 to maximize compression.  This
produced files of 310-340K each.  I then used convert to create pdf
versions of them.  These pdfs range from 312K to 342K in size.  I then
used pdftk to create a single 989K pdf file.  Interestingly, using the
gs command as per another post in this thread created a single 2,969K
pdf file!

The page images resulting from this process are very readable, although
bleed-through from printing on the back of each page results in some
strange variations in the background shading.  If anyone has suggestions
for alternative processing of the page images that might produce better
results, I will test them on this bill.  I might also try a lower
resolution scan, but while I imagine that will result in smaller files I
suspect readability of these smaller files may be impaired.

Dave

--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

Jeremy Johnson
Generally, 1200 dpi resolution for text would be overkill unless you have a
document with extremely tiny print (1-2 point instead of 10-12 point).

They used to recommend 150 dpi or even 75 dpi for scanning documents
containing just plain text. But I scan at 300 dpi and also print ordinarily at
300 dpi which for me is adequate quality for plain text documents.

My photo scanner can scan at 2400 dpi but my printer can only print at a max
of 1200 dpi resolution. Scanning at a higher resolution than that at which one
prints can be useful for enlarging a portion of the image. Otherwise it's
probably just a waste of disc space.

Similarly, I ordinarily wouldn't scan bills, invoices, receipts, etc. in
color, since a Black&White 1-bit image would suffice for my needs. If someone
were to ask me for a copy of a receipt or check, even a G3 fax would probably
be good enough.

If I have a document with pages mixing text and color graphics/photos, I
ordinarily scan at full color depth and use djvu wavelet compression, which
generates reasonably small file sizes without sacrificing too much text clarity.

A typical grayscale scan of a black&white letter-sized document would result,
after binarization, in a pdf filesize of 30-40K per page (depending on line
spacing, line length, text size, text weight, etc)

A typical color scan with djvu wavelet compression would be about 10X as large
(again depending on the mix of text/graphics)



--
sane-devel mailing list: [hidden email]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/sane-devel
Unsubscribe: Send mail with subject "unsubscribe your_password"
             to [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: scanning for archival and OCR

tonytong
This post has NOT been accepted by the mailing list yet.
In reply to this post by David H. Durgee
You can try this free online ocr to convert image to text.
Loading...