Snakes in the Archive

It was in graduate school when I got my first lessons in programming. The language was Python, which I later discovered to be the introductory language for many up and coming developers. Having dove into a variety of tools since then, I completely understand why. Python emphasizes the creation of concise, human-readable code. With its numerous community driven libraries, Python has also proven to be extremely an extremely flexible tool for handling various tasks. A huge reason I picked Wisconsin Public Television as my top choice of residencies was because it offered the chance to gain programming experience in an archival setting. It’s far easier to learn a language when you have a defined task or end result you’re trying to achieve, rather than learning endless bits of isolated syntax.

Throughout my coursework I was constantly told how “powerful” a language Python was, a concept that didn’t hit me until I began my residency. Here in Wisconsin we have multiple micro-processes running in the background, most of which are built in Windows cmd. To get a good idea of the differences between the two, take the example of iterating through files. At WPT, many processes begin with a variation of the following script for iterating through every designated file type in a directory. This particular script runs .png files through an OCR software, Tesseract (more on that later):

cmdIteration.png

This is basically saying:

  • Find every .png file in a directory and send the file string to a new text file: dir2.txt
  • Create a numbered list of the raw file strings and send it to another text file: fsn2.txt
  • From this numbered list, find a total count and send it to another file: find2.txt
  • Create a variable which auto increments by 1 each time we iterate through a list, so we can locate each numbered file in said list
  • Then create ANOTHER text file which references the file we’re currently at in our list.
  • Finally, use this file to run tesseract, repeating the process until we’ve exhausted the list

Phew!

Looking at the script’s output sheds some more light on what’s going on. We’re generating several temp files which are constantly being updated and referenced by one another. The first time I saw this in action, I was incredibly weirded out by how clunky it seemed. Why do you need to generate so many files when Python can dump this info into memory? As you can imagine, this method can quickly fill a directory if you’re utilizing multiple loops, creating to a mess of files to sort through. Even worse, each OCR we perform on an image generates it’s own unique text file, when we’d ideally like all our data concatenated into a single output file.

cmdOutputFiles.png

Turning to Python, we can accomplish all of the above and more in the same number of lines and without the creation of temp files. Instead of writing text files to reference image files, we can easily create a list of these files and call them with a variable (in this case “images”). Additionally, we can funnel the results of every OCR image into one cohesive output file. And the best part? Python can perform the process MUCH faster than cmd.

pythonIteration.png

This is actually an edited snippet of a script I wrote at WPT for generating metadata from video files. As we all know, metadata creation is a notoriously slow task, and I wanted a means of automating the process to some degree. I vaguely knew about Optical Character Recognition (OCR) tools which converted image text into machine-encoded text, and I wanted to apply it to the task of capturing credit information at the beginning and end of our shows. I discovered two Python libraries (Pytesseract and Pillow) for integrating Google’s open source OCR tool, Tesseract, and decided to go from there. I wrote up a very simple .bat that file for calling ffmpeg which takes a snapshot of each video we want to OCR at one second intervals. The Python script is then called from the .bat file. To emphasize again how much I love Python over cmd, If ffmpeg is creating one picture per second for four minutes of video, running OCR against these images yields 240 individual text files – which is 239 more than we want.

PytessOutput.png

The final product was far more accurate than I expected, although it’s not without it’s flaws (but then again, no OCR is perfect). I began writing the same script in cmd to forego a Python/library installation on other computers, but again, the process was far less concise. Halfway through writing Wintesser (the Windows equivalent), the script was already taking twice as long to perform and generating far more temp files than I wanted. Pytesser is now being used at WPT and is available on my Github. Feel free to check it out!  

Speaking of metadata, we’re currently preparing for a large scale migration off of our Access database. Access has proven to be a less than ideal platform for finding and representing data, causing me to write micro Python scripts for parsing out the content we want. From here we were able to discover some pretty specific facts, like how many unique titles we had and how many thousands of titles we had left to digitize. To get ready for the big crunch, we’ve been preparing crosswalk CSVs for applying the PBCore schema to our records. Getting data to look exactly the way you want it to can be a huge headache and thinking of the scripting required to do so has been somewhat overwhelming. But of course I’m also excited, as this is exactly the kind of experience I was hoping to gain during my residency.

This entry was posted in resident post and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s