In a recent "Ask HN: What are you working on?" thread, I mentioned I was working on OCRing a large book:
https://news.ycombinator.com/item?id=41971614
The post generated some interest so I thought I would keep HN posted.
The book is Saint-Simon’s Memoirs -- an invaluable historical account of the French court under Louis XIV, full of wit, sharp observations, and of incredible literary value. I'm OCRing the edition of reference made between 1879-1930, that contains a lot of comments and footnotes: 45 volumes, ~27,000 pages.
Here's a link to a blog post that describes the techniques used so far (the project is still ongoing):
https://blog.medusis.com/38_Adventures+in+OCR.html
But you may also directly access the result here:
https://divers.medusis.net/boislisle/pub
This web app (not optimized for mobile, sorry) solves a tricky problem of preloading images efficiently. In short: preloading the next image isn't enough, since browsers will repaint if an image is moved, or scaled. Or browsers won't paint at all if visibility is hidden or opacity is zero, and will paint only when those values change. On an average, slow machine, this takes visible time. But if an image is simply behind another element, it will be painted, and the removal of the covering element or changing the z-index will not trigger a repaint.
(Preloading is important because it lets one review results fast; if one has to wait 150-200 ms between images it's simply discouraging).
Would love to hear feedback; happy to answer any question!
loading...