Using Old Streets to Make New Inroads to Data: Part 1 Cole Hudson Wayne State University Hi, my name is Cole Hudson. I’m with Wayne State University, and I’m here to briefly talk about a project we are embarking upon at Wayne State.
Motor City Madness We have lots of streets. They’ve been around for a long time. Their numbering and names have changed. This is a problem for researchers.
A book to the rescue! And nope! There is a book called The Old and new house numbers : : new house numbers effective January 1st, 1921. It’s a book full of tables of streets with a street name at the top, a column for old street number, and a column for new street number. There’s only one copy. It’s at the Detroit Public Library across the street. There is a pdf of it on a random site that lots of people use. It’s a pretty poor scan of the book. We wanted to fix this. Our project, which is just in the beginning stages, will scan this book, transcribe the charts in the book to parseable data that could be plugged into a database, build a web app to allow people to match up pre-1921 addresses with modern street addresses. Oh, and we’re going to apply for a grant to pay a student intern to help us with the data work. Here’s a quick overview of what we’ve done. We asked our buddies there if we could borrow it and they were cool enough to let us scan it for our project.
What we’ve done Scanned the book. Made an interface that helps us cut out the street address tables in book In process of applying for a grant
Book So, here’s what the digitized book looks like. We crop out sections of the book using a javascript tool that capture coordinates for a box we draw. Then it sends that information to a table in a database. And we serve that image out using our Loris IIIF image server, which has no problems reading coordinates from an image. Finally, it’s all for our OCR process, and data checkup.
Crop and Display Here’s what the box drawing tool gets us. A table with a link to the section of the page that we’ve identified that has relevant text.
The URL https://digital.library.wayne.edu/loris/fedora:wayne:detroithousenumbers_Page_3|JP2/144,612,253,194/full/0/default.j pg
Sample Image Then here’s the image itself. Now it’s ready for OCR’ing with our Abbyy Recognition Server software.
Next Steps OCR book chunks with Abbyy Recognition Server Looking for some CSV output Hire a work study student for data cleanup Bulk ingest OCR’d data into DB Build a public web interface Probably Python. Lots of promising bulk comparison ability using pandas Python library
Hope to have more next year! Acknowledgments: Alexandra Sarkozy (team lead) Graham Hukill Jodi Coulter Clayton Hayes Hope to have more next year! Look for an update at Code4Lib Midwest 2019! Thank you! Cole Hudson, Digital Publishing Librarian, Wayne State University Libraries @colehud