OCR'd magazine scans

edited July 2011 in Sinclair Miscellaneous
Hi all,

I've been working on OCR'ing some magazine scans over the last few weeks, and combining the images together into a searchable, indexed PDF for each issue. I've started with Crash, and I've now got the entire lot done. Although there's indexes available on the web (not least at WoS), having the PDF's indexed and "local" makes it very easy to find certain words or subjects. They look great on the iPad too :)

Here's a 10-page sample of some pages from issue 39:

http://bit.ly/baDI1d (temporary link, won't work forever :) )

The purpose of this post is to ask:
  • Am I duplicating something that's already been done by others?
  • Does the sample look OK? It's pretty good for me using OS X Preview: the OCR isn't 100% perfect, as you'd expect (example: the LM advert), and the paragraph detection occasionally gets confused, but it's still fine for searching.
  • If not, is this something that anyone else would find useful? Or is it just me? :)

The file size per "issue" is roughly 50% more than the equivalent JPG size - I've had to use "high quality" compression to avoid adding to the already lossy JPGs.

I'm planning to OCR and "compile" Your Spectrum, Your Sinclair, Sinclair User and PCG next. If this is of interest to others, I'll happy share the results.
Post edited by KenD on
«134

Comments

  • edited July 2010
    I am working my way through OCRing the weekly mags, Home Computing Weekly, Popular Computing Weekly and (not started yet) New Computer Express.

    If you start on any of those drop me a PM so we don't duplicate.
  • edited July 2010
    I don't know how relevant this is, or practicable, but I once uploaded a scan from PCW to Evernote and it OCR'd and indexed it quite nicely - zero effort from me. The problems with this approach are obvious, but I thought I'd throw it in there just in case it sparks off an idea in someone else!
  • edited July 2010
    Could you please explain how this works? I downloaded crash39sample.pdf into Acrobat (v5.0) and tried searching for text in it, but got no matches on anything. The document appears to be a selection of graphical objects; eg. if I use the Text Select Tool to select "APOCALYPSE SEGA" on the first page I get back:
    !"#$!%&"'() '(*!)
    L8E3?%
    
    I fear that I might be missing something obvious. Does the document use features not supported in Acrobat v5.0? - although I'd normally expect to get a warning when I opened the file if that was the case.
  • edited July 2010
    if I use the Text Select Tool to select "APOCALYPSE SEGA" on the first page I get back:
    !"#$!%&"'() '(*!)
    L8E3?%
    
    Apologies - entirely my fault. For some reason, the selection of the main PDF I copied out has mangled text. I've fixed it and reuploaded the file; can you try redownloading it and see if it's any better now?
  • edited July 2010
    That's better ...
    ---
    APOCALYPSE SEGA
    Uoyd Uoyd.
    Accordin According to al all th the new news an and report reports tha that I'v I've rea read lately lately, th the Da Day of th the
    Gam Game Consol Console is fas fast approaching approaching. Thei Their arriva arrival no now see seem inevitable inevitable, so
    I though thought
    I can see games consoles being marvellous devices for creative home
    entertainment - one day. But then they will be interactive, worked by
    computers, incorporating digitised video, sound recording, perhaps
    ---
    ... although the text from the left-hand column seems to have developed something of a stutter.

    I'm amazed it works at all, considering all the problems which I had OCRing the pages from Your Spectrum (at "Your Spectrum" Unofficial Archive) - although I had a cheap scanner and mediocre OCR software which crashed several times a day, and had to test the re-drafted HTML on buggy old versions of Explorer and Navigator. There's some impressive software available these days.

    I'd see these PDF projects as being complementary to the existing magazine archive/tribute sites (WoS, YSRnRY, CTOE, SUMO, YrUA, etc.), as they provide differing "experiences" of the material.
  • edited July 2010
    ---
    APOCALYPSE SEGA
    Uoyd Uoyd.
    Accordin According to al all th the new news an and report reports tha that I'v I've rea read lately lately, th the Da Day of th the
    Gam Game Consol Console is fas fast approaching approaching. Thei Their arriva arrival no now see seem inevitable inevitable, so
    I though thought
    I can see games consoles being marvellous devices for creative home
    entertainment - one day. But then they will be interactive, worked by
    computers, incorporating digitised video, sound recording, perhaps
    ---
    ... although the text from the left-hand column seems to have developed something of a stutter.

    That's odd - here's the copy-and-paste of the same paragraph using OS X Preview:
    Uoyd. According to all the news and reports that I've read lately, the Day of the Game Console is fast approaching. Their arrival now seem inevitable, so I thought I would voice my opinion as to what is likely to happen.
    At first sight, dedicated games consoles seem to have much to offer the player; the graphics and sound effects will far surpass those achiev- able on our home computers, which were never really designed as games machines. But is this in itself enough? To answer this I think we must go back to the age-old problem of what makes a game a good one.
    

    Maybe Acrobat v5 is parsing the file differently? Do you have a more recent version you can test with, or maybe try in Foxit Reader?
  • edited July 2010
    Works fine in Reader 9 on Windows, can't say I'm surprised if Acrobat v5 is having issues, it's an ancient version that doesn't support a lot of the functionality in modern PDFs.
  • edited July 2010
    I would love to be able to view PDF versions of the old Speccy mags! So yeah, keep up the good work! :)
  • edited July 2010
    I was put off Foxit Reader 4 by the comments on CNET about all the unwanted system modifications made by the installation procedure, and I was put off Adobe Reader 9 by its size, so Acrobat 5 which I bought some time ago is the latest PDF application which I have. Not to worry, as the text search option works OK with your sample file, which is the significant feature.
  • edited July 2010
    Looks really good.
  • edited July 2010
    KenD wrote: »
    Am I duplicating something that's already been done by others?

    I OCR'd the Crash mags years ago so that I could get the ASCII review text into my database. The text was then painstakingly copied and pasted from the PDF into various tools to clean it all up. The resulting data can be viewed on my website if you click on my sig (best approach is to go into Stats then Magazine Stats and click on the "Open" links.) So yes, it has been done previously but not in the approach you're taking! I had no interest in creating PDF's for viewing, they were simply the best tool to get the text into a database. Finished Issue 94 of Crash last night so 4 more to go...
    Does the sample look OK? It's pretty good for me using OS X Preview: the OCR isn't 100% perfect, as you'd expect (example: the LM advert), and the paragraph detection occasionally gets confused, but it's still fine for searching.

    The quality looks really good, much better than my efforts. May I ask which OCR package you have been using? I used OmniPage 14 and 15 which was apparently one of the best on the market. The results up to about issue 85 are pretty good. Then Crash started using horrendous colour schemes and crappy fonts so it made life much more difficult - primarily paragraph detection as you said :-( Spelling can also be a problem as the OCR software estimates certain words incorrectly.
    If not, is this something that anyone else would find useful? Or is it just me? :)

    I would definitely be interested, especially for the non Crash mags as it could potentially save me an absolute ton of work! You can contact me via my website or PM on here if you want to correspond.
  • edited July 2010
    About the latest Acrobat Reader: If you install it the "normal way" it does add a few extra things like Adobe Air, but it's possible to download just the reader as an offline installer and it works ok.

    On-topic:

    PDFs are nice and this project is certainly great, but every time I think of Sinclair magazine searching, this project comes to mind:

    http://mhoogle.speccy.org/

    It's a very powerful online search engine that enables users to find content from the Spanish magazine "Microhobby". They now have Firefox and IE8 search plug-ins/add-ons as well. If the work that's being done by you and others could also be used for something similar to Mhoogle I'm sure it would be very useful for the Speccy community.

    I have nothing against searchable PDFs, but it's a lot better if you can search the entire magazine collection and find that article, review, ad, etc. It's also faster than opening a PDF file. You can then open just that page (JPG) or open the entire magazine in PDF format for online or offline reading.
  • edited July 2010
    Hi all,

    The OCR'd PDFs of Crash are now available by torrent:

    Torrent file:
    http://torrents.thepiratebay.org/5729639/Crash_Magazine_issues_01-98_complete_OCR_d_PDF.5729639.TPB.torrent
    Magnet link: [URL="magnet:?xt=urn:btih:13e34e96290b344232f066c56a5c3eb57752a373&dn=Crash+Magazine+issues+01-98+complete+OCR%27d+PDF"]magnet:?xt=urn:btih:13e34e96290b344232f066c56a5c3eb57752a373&dn=Crash+Magazine+issues+01-98+complete+OCR%27d+PDF[/URL]

    IMPORTANT: The original torrent above has a slight problem in issue 1 - pages 7 and 8 are low quality and almost impossible to read. I can't change the original file at the moment as it's still being seeded: here's a single-file torrent with the fixed issue 1:
    http://torrents.thepiratebay.org/5731423/Crash_Magazine_Issue_01_OCR_d_PDF.5731423.TPB.torrent
    Magnet: [URL="magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce"]magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce[/URL]

    The files are relatively big - as discussed, I've struggled to get them down any smaller without hurting image quality. However, any decent torrent client should be able to selectively download individual files within the torrent if you're only after a couple of issues.

    I'll keep this torrent seeded for as long as possible - however, my upstream bandwidth isn't great, so it would help everyone greatly if anyone who downloads also seeds for as long as they are able.

    Coming soon: the Your Spectrum collection. I'll "announce" it here when they're ready to avoid cluttering up the forum with separate threads.

    Cheers

    Ken
  • edited July 2010
    No seeds mate!
  • edited July 2010
    Sorry, showing my lack of knowledge about torrents - I've updated the links above, can you try again?
  • edited July 2010
    ok, got 1 seed, downloading now, going to take a while at this speed but ill seed for as long again when complete, cheers
  • edited July 2010
    Please be patient - I only get around 80k/s upload, and there's currently 8 people downloading. However, once it gets going the magic of BitTorrent should ensure that that the "pieces" get distributed amongst peers and my upstream should be less of an issue.
  • edited July 2010
    Nice job. I'm just downloading it. Next magazines I'd like to see OCR'd would be those with lots of BASIC listings, so anyone can also copy them into BASin and convert them easily into TAP, Z80 or TZX, instead of being bored in their type-ins or blame on to their scanner equipment for typical errors during OCR process. (16B0 FOD n-0 TO 25: READ ~: P0KE 57B0B,~+n: NEXT n)

    What I can be worried about is if they will be recognised as ASCII characters in that process, due to the small size of characters from these listings appeared at these magazines like Your Sinclair, Your Spectrum, Your Computer, Microhobby and more...

    Uh. Talking about Your Computer, the big size of the magazine in a good part of its history can be a burden, a big weight of resulting files and maybe a great loss of time if thinking to convert it into OCR'd PDFs, but I'll welcome that.
  • edited July 2010
    You don't need to type in any listings from Your Spectrum or Your Sinclair, as they're already all available typed in at TTFn, as are many others - and getting usable program listings from OCR is a pipe dream except in the most optimum of conditions. Your example and worry highlight the problem exactly.
  • edited July 2010
    Thanks Downloading now

    has anybody done 'INPUT' or 'The Home Computer Course'

    And before you say they are denied...I Know....but so are Ultimate and I have them.
  • edited August 2010
    ASH-II wrote: »
    Thanks Downloading now

    has anybody done 'INPUT' or 'The Home Computer Course'

    And before you say they are denied...I Know....but so are Ultimate and I have them.

    The Home Computer Course and The Home Computer Advanced Course these were scanned and OCRed with OmniPage 17 Pro (Beta Version in 2008/9).
    They are in a PDF format, which has the scanned image and the OCRed text combined.

    Some issues of INPUT have been done, but are denied. I'm currenty testing OmiPage 18 ( Beta version) using the ZX Spectrum Micro-Prolog Primer (300 pages) as the test document,
    because of its font being small and hard to read.
  • edited August 2010
    Vampyre wrote: »

    The quality looks really good, much better than my efforts. May I ask which OCR package you have been using? I used OmniPage 14 and 15 which was apparently one of the best on the market. The results up to about issue 85 are pretty good. Then Crash started using horrendous colour schemes and crappy fonts so it made life much more difficult - primarily paragraph detection as you said :-( Spelling can also be a problem as the OCR software estimates certain words incorrectly.

    Apologies, I just realised I hadn't answered your question. I'm using ABBYY FineReader 10, which - given the source images - seems to be doing a good job. As you say, the later Crash issues are a nightmare - they went a bit mad trying to be "trendy" and it's like an explosion in a typeface factory.

    The next collections available will be (in this order):

    Your Spectrum
    Your Sinclair
    Personal Computer Games (simply because it was my favourite multi-format mag :) )
    Sinclair User

    After that, I'm open to requests - Martijn has very generously allowed me access to the scans archive, so I'm happy to OCR any and all of the magazines stored there if anyone's interested. Please let me know if there's any you'd like OCR'd and shared.
  • edited August 2010
    Just Checked through issue 1 of crash and pages 7,8 are really low quality? I can't read the Deathchase review at all

    the rest is good though (Issue 1)
  • edited August 2010
    Nightmare - for some reason those pages in the original file have their dimensions set wrongly. I can't change the torrent without cutting off anyone who's currently downloading it, so I've seeded a torrent with a fixed copy of issue 1 and updated the original post with the link to the file. Sorry for the confusion everyone, I knew it was going too smoothly :)
  • edited August 2010
    KenD wrote: »
    Nightmare - for some reason those pages in the original file have their dimensions set wrongly. I can't change the torrent without cutting off anyone who's currently downloading it, so I've seeded a torrent with a fixed copy of issue 1 and updated the original post with the link to the file. Sorry for the confusion everyone, I knew it was going too smoothly :)

    No worries Ken. I'm sure alot of people are very greatful for the work you have put into the OCRing... Any chance you can print here the URL for the fixes?

    Please keep seeding this people. I'm not on the fastest connection, but will seed as much as I can (that is once I get everything downloaded!).

    :-)
  • edited August 2010
    No problem - here's the torrent for the fixed Issue 1:
    http://torrents.thepiratebay.org/5731423/Crash_Magazine_Issue_01_OCR_d_PDF.5731423.TPB.torrent
    Magnet: [URL="magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce"]magnet:?xt=urn:btih:612897600fe78548a003482c3102296687b4fe5c&dn=Crash+Magazine+Issue+01+OCR%27d+PDF&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce[/URL]
  • edited August 2010
    Thanks Ken :-)
  • edited August 2010
    You don't need to type in any listings from Your Spectrum or Your Sinclair, as they're already all available typed in at TTFn, as are many others - and getting usable program listings from OCR is a pipe dream except in the most optimum of conditions. Your example and worry highlight the problem exactly.

    I know it, but I assume there are listings not preserved yet if watching every magazine filelist. So there's still lot of work to do with them.

    Unhappy to know that OCR as the only easiest way for copying all those programs left to be typed-in. In the meantime, I still have a tape with few listings already saved and waiting for a good time to finish and fill it with the most I can, using real computers and a big dose of patience.

    Not sure to transfer all possible type-ins into TZX once finished the great job, but I'll find someone if in case. In this 90 min. tape will be ZX Spectrum listings from Mundo Spectrum, ZX, Your Computer, Personal Computer News, Home Computer Weekly and other mags...
  • edited August 2010
    BTW, I have already 86,7% torrent downloaded. It goes so well that probably tomorrow will be ready to save all files in two standard DVDs.
  • edited August 2010
    Following Kens wonderful work on the ocr scanning of the crash issues. I've been pondering on a utility to convert ocr text searchable PDF's into Ebooks and came across this little gem. It's an open source PDF to ebook convertor called 'calibre' which you can find here:-

    http://calibre-ebook.com/

    Tutorial how to use it:

    http://www.trickyways.com/2010/04/how-to-convert-pdf-to-epub-format/

    I managed to convert first issue of crash to .epub format (google/HTC) ready to plonk on my phone.

    Problem:

    I get garbled junk when converting pages. Anybody else wanna try?
Sign In or Register to comment.