The following short document describes how research mathematicians can go about creating an online archive or collected works for themselves. We assume that the following resources are available:
Note that the tools described in (2) are available with most GNU/Linux distributions such as Debian, RedHat.
If you already have your paper available as a PosctScript(TM), .dvi, TEX or LATEX file. Then life is easy as one can link this file directly on one’s list of publications. Even then you may want to examine the following packages to make the viewer’s job easier.
Occasionally, you may have a file as MicroSoft(TM)-Word or ChiWriter or soem other format. The simplest procedure is to use the original program to generate PostScript(TM). However, it is also possible to convert these documents to text by using programs such as antiword or even strings. After that you are only slightly better off than you would be at the end of Optical Character Recognition applied to scanned documents (end of next section).
We now deal with papers that are only available in print form. First of all you need to make sure that SANE is properly installed and configured. Run the command scanimage -T from a command prompt. This will perform a sequence of tests and should give “PASS” for all these tests. On the other hand you may not be so lucky:
Note that some people may suggest that you work with xscanimage at this point but it is very incovenient for what you want to do; the commandline is quicker.
Next you need to find out the precise way in which to scan the paper so that one
corner of the paper is at (0,0). For a “Flat Bed” scanner this is usually one of the
corners of the glass “bed”. Measure the height h and width w of the paper to be
scanned in millimeters. Now enter the command
scanimage --auto-threshold --mode lineart --resolution 50 \
-l 0 -t 0 -x w -y h > /tmp/test.pbm |
We are now set to scan the pages one by one in a simple fashion that is only possible with command-line invocations. The following script was the originally suggested solution (before the availability of djvulibre) a better solution is outlined below. This script will automatically number your pages and convert them to a browser friendly format; all you need to do is feed a new page when prompted and interrupt with a “Control-C” when done. Run this command with w and h as the command line paramenters
#!/bin/sh
if [ $# < 2 ] then echo Give Width and Height as parameters exit fi WD=$1 HT=$2 # This directory should not exist! if [ -d /tmp/paperscan ] then echo Someone has already created /tmp/paperscan echo Edit the script and change the directory name exit fi mkdir /tmp/paperscan cd /tmp/paperscan i=1 while true do echo Feed the next page into the scanner or Ctrl-C to exit scanimage --auto-threshold --mode lineart --resolution 100 \ -l 0 -t 0 -x $WD -y $HT > page.pbm convert -mono pbm:page.pbm png:page$i.png pbmreduce 5 page.pbm | convert -mono pbm:- png:thumbpage$i.png rm page.pbm i=`expr $i + 1` done |
To make these pages available via your web server move them to a separate directory under your home-page directory; don’t forget to clear out the directory /tmp/paperscan! Under your home-page you can create the HTML files as per the following templates:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html> <head> <title>My fundamental paper</title> </head> <body> <br>Each of the thumbnails below is a scan made for screen based viewing of the paper on "My fundmental equation''. Unfortunately this is for graphical mode viewing only. <hr> <A href="node1.html"><img src="thumbpage1.png" alt="Page 1"></A> <A href="node2.html"><img src="thumbpage2.png" alt="Page 2"></A> ..... and so on <hr> </body> </html> |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML> <HEAD> <TITLE>ct<My fundamental equation</TITLE> <LINK REL="next" HREF="node3.html"> <LINK REL="previous" HREF="node1.html"> <LINK REL="up" HREF="index.html"> </HEAD> <BODY > <hr> <img src="page2.png" alt="Page 2"> <hr> </BODY> </HTML> |
If you wish to additionally convert to PDF format you can employ the command convert -mono -adjoin -page A4 -cache 32 page*.png doc.pdf. Note that this requires quite a lot of spare room in the /tmp directory and a lot of memory (if you have less RAM you can experiment with -cache 16). The process is rather slow and produces large-ish PDF files (one 14 page document came out as a 2.5MB PDF file).
We can convert scanned documents at 300dpi directly into DjVu documents which also have thumbnails! As above we scan the pages one by one in a simple fashion that is only possible with command-line invocations. The following script is better solution than the one outlined above. This script will automatically convert your pages into a browser friendly format called DjVu; all you need to do is feed a new page when prompted and interrupt with an “n” when done. Run this command with w and h as the command line paramenters
#!/bin/sh
if [ $# < 2 ] then echo Give Width and Height as parameters exit fi WD=$1 HT=$2 # This directory should not exist! if [ -d /tmp/paperscan ] then echo Someone has already created /tmp/paperscan echo Edit the script and change the directory name exit fi mkdir /tmp/paperscan cd /tmp/paperscan i=1;ans=y while [ "$ans" = "y" ] do echo Feed the next page into the scanner or Ctrl-C to exit scanimage --auto-threshold --mode lineart --resolution 300 \ -l 0 -t 0 -x $WD -y $HT | \ cjb2 -dpi 300 -clean -loose - page$i.djvu pagelist="$pagelist page$i.djvu" i=`expr $i + 1` echo -n "Another page?(Y/n)"; read ans done # Now convert the pages into a bundled document djvm -c bundle.djvu $pagelist # We could stop here but it is probably a good # idea to create an unbundled document as well djvused -e 'save-indirect index.djvu' bundle.djvu |
At the completion of this you can save the document in a seperate directory under your home page and offer the file index.djvu as the index file for this directory. The file bundle.djvu can also be offered as an all-in-one document.
At some stage we will become more ambitious and adventurous and examine using Optical Character Recognition (OCR) programs to convert the scanned document to LATEX!
Chapter 1 of my thesis “On the Canonical Ring of a curve” and Chapter 2 of my thesis “The Kuga-Satake correspondence” . The original was printed using ChiWriter on a dot-matrix printer. This copy was scanned at 200dpi and reduced as above (the pbmreduce option -value 0.75 was used to touch it up a bit).
The same are also available as DjVu documents “On the Canonical Ring ...” and “The Kuga-Satake correspondence”.