Thursday, December 25, 2008

Windows Vista greets Moore's law: 64-bit desktop is here

Windows Vista was sold on security. This pitch didn't work and that is a good thing. You don't sell security. Security is bought out of fear, it is reactionary, not innovative. However you put it, hype on security is lame. You need more than fear to have a compelling value proposition, something like features and convenience. Vista didn't have these and most of us stayed with Windows XP.

But guess what happened in 2008? Moore's law ran over 32-bit operating systems. We have more memory than 32-bit x86 architecture can handle without issues. Many of us started with 2GB of RAM and now have 4GB. Even my 60-year old parents bought a new ultra-cheap desktop with 3GB preinstalled. 

In 2009, 4GB is common and 8GB is probably what you really want. This means that you'll need a 64bit operating system. In Windows, your options will mostly be limited to Vista 64-bit right now and Windows 7 later. There is no way around it and that is why Vista 64-bit does not need to be sold. It will sell itself.

Even if it's a forced update, 64-bit Vista promises security and stability improvements compared to XP. All 64-bit drivers will be signed, hopefully resulting in fewer blue screens.

-mika-

Friday, December 19, 2008

150€ Document Management System

I just got a Lexmark X9575 for 150€. A store in Finland was offering these at prices 100€ below anyone else. So I took the bait. This post is a review of the device, sort of. 
Lexmark X9575
Even I couldn't produce a cheaper looking product image

Unfortunately, there is all this 'big picture' stuff that distracts from writing a clean product review. It is looks like that there is a long term economic decline ahead of us and people are basically hoping that a next big thing will take us out it. There are bets in green technology, but generally we are clueless in recognizing useful tech in its infancy. I don't know if the next big thing will be hybrid cars or solar panels, but I am pretty sure that there will be a trend towards getting rid of useless stuff. Like printed paper. We are gradually getting rid of paper documents, for good. There are signs of this everywhere: enterprise document management systems, e-invoicing, and now, cheap personal document scanners that this blog post is about.

I definitely want to get rid of my papers, and if you are in the same boat as I am, it is only practical issues that keep us from starting to scan our archives. There are four technical and process problems we have to solve first.


1. How to scan documents fast?

In document quality, every scanner for the past 20 years has been good enough. The problem is speed. Nobody will wait-and-turn their document archives one page at a time. I think soon there will be enough demand for personal document scanning services, with monster scanners and all, but right now, the solution is to buy an automatic document feeder (ADF). 

You get ADFs with special purpose document scanners for over 500€, or combined with a multifunction printer/scanner/fax/copier starting at 90€. My price point for home was closer to 100€ than 500€, including the other multifunction features. For office, a separate document scanner is better.

Multifunction device makers advertise ADFs with size. ADF tray size is a nice number to print and compare. Bigger is better, but you shouldn't look at this too much. If size is the only information you get while comparing products, then go with the largest. But I suggest searching the internet for user experiences. An ADF is useless if it gets stuck freaquently, no matter how large tray it has attached. Unfortunately, there is no good data on ADF quality.

Hope this helps a bit: I have now scanned approximately 1000 pages with Lexmark 9575 ADF, from 1 to 50 docs at the tray.   [Dec. 15. 2008 - Jan. 4. 2009]
  • ADF gets stuck most often when it takes papers in a slightly wrong angle. Papers in the tray rotate easily when they are loose. Putting more papers in the ADF tray gives better support.
  • ADF has got stuck two times with a torn paper
  • Don't feed documents with sticker glue left. The glue trails will stick to the scanner plate and you'll have to clean it. 
  • It is not enough to unstaple papers. Papers that were stapled once tend to stick together even after unstapled, and flow through the ADF as one. Separate them in the paper stack.
  • 50 docs is a maximum load for well-conserved smooth-surface documents. For papers that look used, 30 is more realistic figure
[Update 2.5.2009] With approximately 2000 documents scanned:
  • ADF gets stuck with folded papers easily. By folded, I mean A3 folded once to a four-sided A4. Either fold the sheets tightly or use the platter instead of the ADF 
  • ADF gets stuck if the front end of the paper is torn, i.e. the end that is feeded to the ADF. You should feed it the "better shape end" first and let the software rotate it later
  • Paper's width determines if it is going to rotate in the ADF. A4 is the minimum width that can be supported by the tray slider support
Folded paper


A tip: don't scan with your primary machine. Scanning is a perfect background job. Right now I am writing this with a laptop while the scanner chews out a 50 document load to a desktop box. OCR and PDF generation can take longer than scanning on older machines.  

A tip: You can't always use ADF. Like when scanning receipts, photos, or anything small. In these cases it is necessary for the scanner be near the computer and you have to open/close the scanner lid frequently. Put the scanner in a place where it does not bother you or your loved ones AND where you can have computer nearby and operate the scanner. With WLAN in the multifunction device, your scanning computer can easily be a laptop. 

 
2. How to be sure your documents will be readable forever?

You want to avoid a personal digital dark age , sure. I don't know how paranoid you are but I am willing to trust two formats: JPEG images and PDF/A documents. JPEG is ok. The documents I scan need to be readable but not reproducible in fine detail. However, I like to print scans as original-looking images, not as OCR-generated Word documents. With PDF you can get both image and text, which is all you need.

Choosing the best PDF format is not simple. You want to have a PDF that has the scanned image on top of OCR generated searchable text. One way to do this is with PDF/A , a format specifically designed for archiving purposes. Basically PDF/A is more conservative than a normal PDF with respect to size optimizations. You get pictures and fonts embedded inside the document. I have no idea if this will be the format of personal archives but that is a problem for another day, that hopefully will never come.

Bad news: PDF/A support is not included with the default software, at least not in Lexmark X9575. With bundled in ABBYY FineReader Sprint Plus Lexmark can produce PDF 1.3 with text below image. That may very well be good enough for you. I had issues with this "scan to PDF" feature with over 30 documents in the tray. FineReader crashed with "Out of memory", with a 1,5GB process size! This is a sound reason to switch to 64bit / 8GB+ machines. 

ABBYY FineReader Pro cost 159€, handles PDF/A, and produces smaller documents than Sprint. Here's a feature matrix. A caveat: FineReader Pro needs TWAIN or WIA drivers and does not work over Lexmark WiFi.

A tip [Jan. 4. 2009]: If you have an ADF, scanning is more about software than hardware. You should evaluate alternatives beyond the scanner default software. I couldn't stick with Lexmark default apps for anything other than simple one-document tasks. Scanning multiple documents directly to PDFs is not simple. On the other hand, my previous Canon MP150 scanner had a decent MP Navigator application for image scanning. It was a lot better than the Lexmark Productivity Suite that cannot handle even tasks with multiple images. With Canon I only needed Picasa to complement it in document renaming. With Lexmark I'm currently testing Hamrick VueScan for image scanning/naming tasks. 


3. How to search your scanned documents

Answer: With a good file naming convention and indexing based on PDF text.

I have used the following file naming convention:
yyyy.MM.dd. Sender. Description.

Example: 2007.06.30. Microsoft. MCPD Certificate.jpg

I need to find this document later, and I might use Windows Search and try something like ".net certification". The result would be a miss if the search were looking for complete words. With thousands of documents you will have trouble finding what you want. That is why it is useful to have a chronologically sortable naming. In my opinion, backed by cognitive psychology, it is natural for humans to search linearly in time. You can certainly remember that the cert exam was in 2007, perhaps in late spring, and start looking from your document folder. And it makes naming folders easier too. You just name your document folder by year:



If you store other people's documents, you have a separate root folder for each. Documents belonging to multiple persons should be copied to each one's folder.

But hey, this is data management from the 80's. Today, you want to search from the contents of your scanned documents, and OCR with PDF content indexing should do this. In the example case, I remember a funny little detail. The cert was printed with a Bill Gates signature.



See, maybe Bill was busy sending certs around the world at the time. So if search with "Gates" the document should be found. With good OCR even "Bill Gates" signature should be indexed without any manual effort.


Help needed: What is the best way to rename a pdf while you can see its contents, without locking the pdf file. I have tried Presto! PageManager and Nuance Paperport,  and they kinda suck. Thumbnails are too small.
Picasa is great for naming JPGs. I want a similar tool for PDFs.

A tip: If the doc doesn't tell the date, you guess it, like 2006.12.XX or 1980.XX.YY.
A tip: Scan each document to a single PDF but combine them later to logical multi-page PDFs.

4. How to dispose your scanned originals

Quick answer: you shred them and recycle. Keep only the most important.

I have a small shredder. It is useless. Waste of time and energy for anything except must destroy docs like the ones you take home from office.

Instead of shredding, I use a cardboard box where I throw everything to be destroyed and recycled later. You must have a place for centralized, secure document disposal. For example, sneak the documents in your workplace and use their disposal facilities.


Final words

I have been scanning documents for three years now. First with a Canon Pixma MP150 and now with the Lexmark X9575. It takes time and you should value your time properly. Don't go into it yet if you're not willing to fight technical issues. In a few years you probably get your documents picked up from your house anyway, with scans delivered, secured and backed up somehow.

So why bother? Because of the thrill, man. No matter what the business cycle is today, we as a people are finally getting rid of paper, for good. And only by understanding the details involved, you prepare yourself for the future.

So, how about Lexmark X9575. Here's my conclusions:

What I wanted
  • Reliable, fast ADF
  • No-hassle automatic scan-to-pdf. Result pdfs should be "image with text below" and wait in a folder to be renamed manually later.
  • Reliable software with automation features
  • Silent operation, or wlan to put the machine away from me

What I got
  • Good enough ADF. My scan machine is old enough to be the speed bottleneck, not the scanner. Actually, the ADF is reliable enough that it could be bigger, like 100 docs.  
  • Hassle with scanning A4 size docs with default software. Letter/Legal scanning should be ok. No points for Lexmark i18n. They should shame.
  • Buggy default software. But this is what you get with less than 500€. You just have to learn to deal with it. Scan-to-PDF automatization didn't work with full ADF load. Losing 50 scans sucks. 
  • Silent operation AND wlan, good


-mika-


P.S. I intend to add more tips for the scanning process as I get more experience with the current tools.