OCR'd magazine scans
Hi all,
I've been working on OCR'ing some magazine scans over the last few weeks, and combining the images together into a searchable, indexed PDF for each issue. I've started with Crash, and I've now got the entire lot done. Although there's indexes available on the web (not least at WoS), having the PDF's indexed and "local" makes it very easy to find certain words or subjects. They look great on the iPad too :)
Here's a 10-page sample of some pages from issue 39:
http://bit.ly/baDI1d (temporary link, won't work forever :) )
The purpose of this post is to ask:
The file size per "issue" is roughly 50% more than the equivalent JPG size - I've had to use "high quality" compression to avoid adding to the already lossy JPGs.
I'm planning to OCR and "compile" Your Spectrum, Your Sinclair, Sinclair User and PCG next. If this is of interest to others, I'll happy share the results.
I've been working on OCR'ing some magazine scans over the last few weeks, and combining the images together into a searchable, indexed PDF for each issue. I've started with Crash, and I've now got the entire lot done. Although there's indexes available on the web (not least at WoS), having the PDF's indexed and "local" makes it very easy to find certain words or subjects. They look great on the iPad too :)
Here's a 10-page sample of some pages from issue 39:
http://bit.ly/baDI1d (temporary link, won't work forever :) )
The purpose of this post is to ask:
- Am I duplicating something that's already been done by others?
- Does the sample look OK? It's pretty good for me using OS X Preview: the OCR isn't 100% perfect, as you'd expect (example: the LM advert), and the paragraph detection occasionally gets confused, but it's still fine for searching.
- If not, is this something that anyone else would find useful? Or is it just me? :)
The file size per "issue" is roughly 50% more than the equivalent JPG size - I've had to use "high quality" compression to avoid adding to the already lossy JPGs.
I'm planning to OCR and "compile" Your Spectrum, Your Sinclair, Sinclair User and PCG next. If this is of interest to others, I'll happy share the results.
Post edited by KenD on
Comments
If you start on any of those drop me a PM so we don't duplicate.
---
APOCALYPSE SEGA
Uoyd Uoyd.
Accordin According to al all th the new news an and report reports tha that I'v I've rea read lately lately, th the Da Day of th the
Gam Game Consol Console is fas fast approaching approaching. Thei Their arriva arrival no now see seem inevitable inevitable, so
I though thought
I can see games consoles being marvellous devices for creative home
entertainment - one day. But then they will be interactive, worked by
computers, incorporating digitised video, sound recording, perhaps
---
... although the text from the left-hand column seems to have developed something of a stutter.
I'm amazed it works at all, considering all the problems which I had OCRing the pages from Your Spectrum (at "Your Spectrum" Unofficial Archive) - although I had a cheap scanner and mediocre OCR software which crashed several times a day, and had to test the re-drafted HTML on buggy old versions of Explorer and Navigator. There's some impressive software available these days.
I'd see these PDF projects as being complementary to the existing magazine archive/tribute sites (WoS, YSRnRY, CTOE, SUMO, YrUA, etc.), as they provide differing "experiences" of the material.
That's odd - here's the copy-and-paste of the same paragraph using OS X Preview:
Maybe Acrobat v5 is parsing the file differently? Do you have a more recent version you can test with, or maybe try in Foxit Reader?
I OCR'd the Crash mags years ago so that I could get the ASCII review text into my database. The text was then painstakingly copied and pasted from the PDF into various tools to clean it all up. The resulting data can be viewed on my website if you click on my sig (best approach is to go into Stats then Magazine Stats and click on the "Open" links.) So yes, it has been done previously but not in the approach you're taking! I had no interest in creating PDF's for viewing, they were simply the best tool to get the text into a database. Finished Issue 94 of Crash last night so 4 more to go...
The quality looks really good, much better than my efforts. May I ask which OCR package you have been using? I used OmniPage 14 and 15 which was apparently one of the best on the market. The results up to about issue 85 are pretty good. Then Crash started using horrendous colour schemes and crappy fonts so it made life much more difficult - primarily paragraph detection as you said :-( Spelling can also be a problem as the OCR software estimates certain words incorrectly.
I would definitely be interested, especially for the non Crash mags as it could potentially save me an absolute ton of work! You can contact me via my website or PM on here if you want to correspond.
On-topic:
PDFs are nice and this project is certainly great, but every time I think of Sinclair magazine searching, this project comes to mind:
http://mhoogle.speccy.org/
It's a very powerful online search engine that enables users to find content from the Spanish magazine "Microhobby". They now have Firefox and IE8 search plug-ins/add-ons as well. If the work that's being done by you and others could also be used for something similar to Mhoogle I'm sure it would be very useful for the Speccy community.
I have nothing against searchable PDFs, but it's a lot better if you can search the entire magazine collection and find that article, review, ad, etc. It's also faster than opening a PDF file. You can then open just that page (JPG) or open the entire magazine in PDF format for online or offline reading.
The OCR'd PDFs of Crash are now available by torrent:
Torrent file:
http://torrents.thepiratebay.org/5729639/Crash_Magazine_issues_01-98_complete_OCR_d_PDF.5729639.TPB.torrent
Magnet link: [URL="magnet:?xt=urn:btih:13e34e96290b344232f066c56a5c3eb57752a373&dn=Crash+Magazine+issues+01-98+complete+OCR%27d+PDF"]magnet:?xt=urn:btih:13e34e96290b344232f066c56a5c3eb57752a373&dn=Crash+Magazine+issues+01-98+complete+OCR%27d+PDF[/URL]
IMPORTANT: The original torrent above has a slight problem in issue 1 - pages 7 and 8 are low quality and almost impossible to read. I can't change the original file at the moment as it's still being seeded: here's a single-file torrent with the fixed issue 1:
http://torrents.thepiratebay.org/5731423/Crash_Magazine_Issue_01_OCR_d_PDF.5731423.TPB.torrent
Magnet: [URL="magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce"]magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce[/URL]
The files are relatively big - as discussed, I've struggled to get them down any smaller without hurting image quality. However, any decent torrent client should be able to selectively download individual files within the torrent if you're only after a couple of issues.
I'll keep this torrent seeded for as long as possible - however, my upstream bandwidth isn't great, so it would help everyone greatly if anyone who downloads also seeds for as long as they are able.
Coming soon: the Your Spectrum collection. I'll "announce" it here when they're ready to avoid cluttering up the forum with separate threads.
Cheers
Ken
What I can be worried about is if they will be recognised as ASCII characters in that process, due to the small size of characters from these listings appeared at these magazines like Your Sinclair, Your Spectrum, Your Computer, Microhobby and more...
Uh. Talking about Your Computer, the big size of the magazine in a good part of its history can be a burden, a big weight of resulting files and maybe a great loss of time if thinking to convert it into OCR'd PDFs, but I'll welcome that.
has anybody done 'INPUT' or 'The Home Computer Course'
And before you say they are denied...I Know....but so are Ultimate and I have them.
The Home Computer Course and The Home Computer Advanced Course these were scanned and OCRed with OmniPage 17 Pro (Beta Version in 2008/9).
They are in a PDF format, which has the scanned image and the OCRed text combined.
Some issues of INPUT have been done, but are denied. I'm currenty testing OmiPage 18 ( Beta version) using the ZX Spectrum Micro-Prolog Primer (300 pages) as the test document,
because of its font being small and hard to read.
Apologies, I just realised I hadn't answered your question. I'm using ABBYY FineReader 10, which - given the source images - seems to be doing a good job. As you say, the later Crash issues are a nightmare - they went a bit mad trying to be "trendy" and it's like an explosion in a typeface factory.
The next collections available will be (in this order):
Your Spectrum
Your Sinclair
Personal Computer Games (simply because it was my favourite multi-format mag :) )
Sinclair User
After that, I'm open to requests - Martijn has very generously allowed me access to the scans archive, so I'm happy to OCR any and all of the magazines stored there if anyone's interested. Please let me know if there's any you'd like OCR'd and shared.
the rest is good though (Issue 1)
No worries Ken. I'm sure alot of people are very greatful for the work you have put into the OCRing... Any chance you can print here the URL for the fixes?
Please keep seeding this people. I'm not on the fastest connection, but will seed as much as I can (that is once I get everything downloaded!).
:-)
http://torrents.thepiratebay.org/5731423/Crash_Magazine_Issue_01_OCR_d_PDF.5731423.TPB.torrent
Magnet: [URL="magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce"]magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce[/URL]
I know it, but I assume there are listings not preserved yet if watching every magazine filelist. So there's still lot of work to do with them.
Unhappy to know that OCR as the only easiest way for copying all those programs left to be typed-in. In the meantime, I still have a tape with few listings already saved and waiting for a good time to finish and fill it with the most I can, using real computers and a big dose of patience.
Not sure to transfer all possible type-ins into TZX once finished the great job, but I'll find someone if in case. In this 90 min. tape will be ZX Spectrum listings from Mundo Spectrum, ZX, Your Computer, Personal Computer News, Home Computer Weekly and other mags...
http://calibre-ebook.com/
Tutorial how to use it:
http://www.trickyways.com/2010/04/how-to-convert-pdf-to-epub-format/
I managed to convert first issue of crash to .epub format (google/HTC) ready to plonk on my phone.
Problem:
I get garbled junk when converting pages. Anybody else wanna try?