Professional Development Funds Report

One of the unexpected bonuses of this residency is the professional development fund that’s earmarked for each of us. Considering the entire residency is calibrated for professional development, this extra cushion is really nice. I’ve used it to renew my SAA membership. I’ve registered for the LibTech Conference in March.

I also registered to audit a metadata class in a fit of …well, to be honest, I think in a fit of Imposter Syndrome. In the first few weeks, I felt like I had been thrown into the deep end of the pool and I hadn’t learned to swim yet. It’s not MPR- this happens with any job, any position. It usually takes me about six weeks to acclimate to any new position, especially for something like this. I should have just given myself some time to get a little more comfortable.

I love my alma mater, St. Catherine University, and I have definitely benefited from their programs. But I can’t decide if this class helped me or not. On the one hand, I’m not sure I learned anything from the course. On the other hand, it renewed my own confidence in my knowledge and skillset. I don’t have any new tools from the course, but maybe that’s because I already had the skills all along. I’m confident that everyone else in the class learned a great deal! The class wasn’t the problem, but the course description didn’t match what I took away from it.

So I jumped the gun in auditing this class and spending a good bulk of my money on this. It’s certainly not the fault of St. Kate’s, but I guess it served its purpose in that I do feel better about my own skills. The lesson is: live and learn! And give yourself six weeks before you decide to burn through over half of your professional development funds.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in admin | Tagged , | 1 Comment

AAPB NDSR Webinars

We have another resident-hosted webinar coming up in a few weeks!

Intro to Data Manipulation with Python CSV (Adam Lott)
Thursday, February 23
3:00 PM EST

Archivists frequently inherit huge amounts of messy and disorganized data. More often than not, this data does not fit into established organizational systems, and manually reconstructing data can be a monumental task. Luckily, there are tools for automating these processes. This webinar, open to the public, will focus on one such tool, the Python CSV module, which can handle the large-scale manipulation of CSV data. Using this module, we will extract content from a large CSV, isolating specific assets that have yet to be digitized. Basic concepts of Python programming will also be covered, so those with little experience can follow along and create code of their own.

Required Downloads for Following Along:
Python 2.7.12 will be used. If you’d like to follow along, it can be downloaded via
A text editor of your choosing (webinar will be done using Atom)

Register for “Intro to Data Manipulation with Python CSV”

And if you missed our January webinars, you can catch up now:

“Challenges of Removable Media in Digital Preservation,” presented by Eddy Colloton, AAPB NDSR resident at Louisiana Public Broadcasting
–          view the slides
–          watch the recording

“Demystifying FFmpeg/FFplay,” presented by Andrew Weaver, AAPB NDSR resident at CUNY TV
–          view the slides
–          watch the recording

Posted in admin | Tagged , , , | Leave a comment

Before & after XML to PBCore in ResourceSpace

I’m interested in learning about different applications of ResourceSpace for audiovisual digital preservation and collection management and wanted to explore PBCore XML data exports. Creating PBCore XML is possible in ResourceSpace, but it is dependent on each installation’s metadata field definitions and data model. Out of the box, ResourceSpace allows mapping of fields to Dublin Core fields only.

Before: default XML file created in ResourceSpace

After: PBCore XML formatting for data fields

There was talk on an old thread on the ResourceSpace Google Group about the possibility of offering PBCore templates, or sets of predefined PBCore metadata fields because one doesn’t exist currently. I did not create KBOO’s archive management database with all possible PBCore metadata fields, instead it was important for me to allow KBOO to enter information in a streamlined, simplified format without all fields open for editing. I can imagine that having a template will restrict users to enter data a certain way, and may not offer the best flexibility for various organizations.

ResourceSpace data created is flat, so it exports to CSV in a nice, readable way but any hierarchical relationships (i.e. PBCore asset instantiation; essence track and child fields) need to be defined with the metadata mapping and xml export file.

I learned some important things when building off of code from the function “update_xml_metadump”:

  • “Order by” metadata field order matters. Its easier to reuse this function if the order of metadata fields follows the PBCore element order and hierarchy.
  • Entering/storing dates formatted as YYYY-MM-DD makes things easier. In ResourceSpace, I defined the date fields as text and put in tooltip notes for users to always enter dates as YYYY-MM-DD. I also defined a value filter. A value filter allows data entered and stored as YYYY-MM-DD to display in different ways, such as MM/DD/YYYY.
  • It is important to finalize the use of all ResourceSpace tools (such as Exiftool, ffmpeg, staticsync) because this may affect use, display, and order of metadata fields.
  • I was incredibly challenged to figure out the structure of data in the database and how the original function loops through, in order to loop appropriately to put the data in a hierarchical structure. My end result is from “might” and not necessarily “right” meaning someone with more advanced knowledge of ResourceSpace could probably make the php file cleaner.  I ended up creating a separate function each time I needed special hierarchical sets of data, i.e. 1 function for the asset data, 1 function for the physical instantiation, 1 function for the preservation instantiation, etc. Each function is called based on an expected required data field. For example the preservation instantiation for loop will only run if a preservation filename exists.
  • Overall, if you know what you’re looking at, you’ll notice that my solution is not scalable “as is” but hopefully this information provides ideas and tips on how to get your own PBCore XML export going in ResourceSpace.

The work done:

1. Reviewed metadata field information and defined metadata mapping definitions in the config.php file

2. Created a new php file based on an the ResourceSpace function “update_xml_metadump” which exports XML data in their default template, and which also offers renaming tags mapped to Dublin Core tags.

3. Created a new php file to call the new pbcore xml metadump function, based on the existing /pages/tools/update_xml_metadump.php file

4. Ran the update file. XML files are exported to the /filestore directory.

This post was written by Selena Chau, resident at KBOO Community Radio.

Posted in resident post | Tagged | Leave a comment


First of all, I can’t believe we’re over halfway through this residency. It just doesn’t seem possible!

Secondly, MPR just celebrated its 50th anniversary, and I would be remiss if I didn’t acknowledge it. Happy Birthday, Minnesota Public Radio! Between Classical, News, and the Current, there is always more to explore, and your archive is off the hook. I feel really lucky to be part of the push to make it as accessible as possible. In the past few months, here in the archive we have been getting ready for the 50th Anniversary celebrations and have gotten a great deal of support to make stories accessible, especially to the news room. Our projects, including the NHPRC grant that I continue to dive into, is a way to highlight the rich culture and diversity here in Minnesota, and MPR’s commitment to tell the stories of this state.

Part of this residency is to dive into the work culture at our host institutions, let me just say, everyone at MPR has been wonderful. I feel welcome, I feel supported. I go to all the work meetings, coffee breaks, etc. So nothing was going to keep me from the annual Cabaret last month! I met people from all over the station and even some board members, caught up with folks I don’t see very often, went to the talent show (it is unfair how talented some of these folks are), ate some amazing fancy mac & cheese.  And then there was a rock show in the parking garage. Because of course there was!

Working here feels like a gift but in true MPR fashion, they have a present for all of us, courtesy of jeremy messersmith:

MPR celebrated all week, but the party will actually last all year and the archives team is delighted to be a part of that.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in resident post | Tagged , | 1 Comment

Register for our upcoming webinars

We have two free, resident-hosted, open-to-the-public webinars coming up in January!

Challenges of Removable Media in Digital Preservation (Eddy Colloton)
Thursday, January 12th, 3:00 PM ET

Removable storage media could be considered the most ubiquitous of digital formats. From floppy disks to USB flash drives, these portable, inexpensive and practical devices have been relied upon by all manner of content producers. Unfortunately, removable media is rarely designed with long-term storage in mind. Optical media is easy to scratch, flash drives can “leak” electrons, and floppy disks degrade over time. Each of these formats are unique, and carry with them their own risks. This webinar, open to the public, will focus on floppy disks, optical media, and flash drives from a preservation perspective. The discussion will include a brief description of the way information is written and stored on such formats, before detailing solutions and technology for retrieving data from these unreliable sources.

Register for “Challenges of Removing Media in Digital Preservation”

Demystifying FFmpeg/FFplay (Andrew Weaver)
Thursday, January 26th, 3:00 PM ET

The FFmpeg/FFplay combination is a surprisingly multifaceted tool that can be used in myriad ways within A/V workflows.  This webinar will present an introduction to basic FFmpeg syntax and applications (such as basic file transcoding) before moving into examples of alternate uses.  These include perceptual hashing, OCR, visual/numerical signal analysis and filter pads.

Register for “Demystifying FFmpeg/FFplay”

Posted in admin, project updates | Tagged , , | Leave a comment

Playing with Pandas: CSV metadata transformations

Being able to start this blog post with a panda gif makes me incredibly happy. However, what I’m really talking about is Pandas, the data analysis library for Python programming. I’m intrigued by the possibilities of programming for library and archives so I’m sharing my beginner’s journey. Being a beginner, I can tell you that learning how to put together this script took a lot of reading and hands-on learning time. The non-spoiler is that through this journey I realize there is much, much more for me to learn about Python. I’d also love to know more ways in which libraries and archives are implementing programming solutions! Please let me know if you know of any.


There are a lot of cool things scripting and programming can do–computing to save manual work and people time–especially in archives and libraries. Libraries and archives work with the growing amount of data and the challenge to keep it accessible. There are often metadata schemas, metadata crosswalks, and data preparation work to move information from one system to another. KBOO has a database with a particular structure. AAPB has a database with a particular structure. Finding a way to transform
data from one structure to another programmatically would be great!

A refresher on why we do this: http://

Metadata interoperability goals

Adam is my go-to guy in our cohort for Python questions. And, he is giving a public webinar on Python basics, so mark your calendars:

Thursday 2/23/2017, 3:00 PM EST

Webinar: “Intro to Data Manipulation with Python CSV” (Adam Lott)

Links to the public webinars will be posted on Twitter and the AAPB blog, and promoted through various public listservs.

Our NDSR cohort keeps in touch with archiving and digital preservation topics in our Slack channel so I would wonder if something could get done with Python, send him some files, and he would send me the answer with the ability to explain it genially and hear my follow up questions. I didn’t want to take advantage of his generosity so it was time for me to do some Python learning on my own. But, along the way I learned that I was really using Pandas more than standard Python programming.

For people who want to dabble and not install python on their own  computers yet,  I found these two online Python environments: and


  • A basic understanding of the command line interface is required. The command line is different for Windows than Mac.
  • Even though I installed both Python 2.7 and 3.5 on my Mac, I’m not in a place to
    decide to use one over the other. I learned that many developers are using Python 3.5 but existing tools written in 2.7 may not be compatible with Python 3.5. This can be read more about here. The newest version of Pandas is  compatible with multiple versions of Python.
  • Why not use Python’s csv module for data manipulation? There’s probably a parallel solution using Python CSV. However, I found answers to my questions with Pandas more easily than with Python CSV. That being said, Pandas is extremely robust and probably a more powerful tool than I need for my exploration with data manipulation.
  • There are probably multiple ways within Python to get to the same end result, I haven’t explored all the libraries and modules.

My goal was to take a csv and reformat it. After installing Python and Pandas, I searched Google for basic Python introductory exercises involving reading a csv to get used to writing the code. I read a bunch of stackoverflow messages and looked at examples others were working with. For me, using the logic of programming languages to get to an end result is like a puzzle. My awesome roommate is into jigsaw puzzles, so I think of this “figuring things out” learning time period as doing puzzles as well.

Some basics with writing a Python script:

  • Use a *.py extension. I am using TextWrangler on my Mac and Notepad++ on the
    PC at work (both free)
  • Import python modules at the top of your code
  • I add comments to keep track of notes to myself
  • Learn the terminology, logic, and formatting of Python may not be the most natural way for a human to think, but it is consistent and define. For example, a dataframe is “a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table.” Or, thinking of the data type string as similar to plain text.
  • Try things! Give yourself time to have the language make sense to you. If something is not right, the language will produce an error which can guide you to fixing your script.

My goal: csv database export => data in AAPB’s Excel template format (see

What do I want to do with the data?

  • Keep a smaller set of the original csv columns
  • Add new columns with known data values, per AAPB request
  • Reformat some column data. Separate Subject terms held in one column into multiple columns, and do the same for Contributor names.
  • Match the AAPB column headers as much as possible
  • Replace ‘nan’ with ” (I learned later that programming languages have a specific ways of managing missing data)

What I came up with:

I ended up separating the work into three parts. Mostly this was because I wanted the subjects and contributors columns to be split into different columns based on a comma delimiter but I didn’t want an entire row of data to be split based on a comma delimiter (think about commas used in a description field).

So, I opened the original csv data file, split the subject column data, split the contributor data, and then worked on the rest of the data. If I was to develop this further, I would work on splitting each Contributor Role from the Contributor Name field. Currently this data is entered in a consistent manner, i.e.

Frances Moore Lappe (Guest)
Patricia Welch (KBOO Host)
Jim Hartnet (KBOO Recording Engineer)
Debra Perry (KBOO Host)

Theoretically the data could be split again on the “(” character and the end parentheses could be trimmed.

The Python script and test csv files are on my github: Big thanks to @HenryBorchers for looking at my script, suggesting cleaner formatting to make the code easier to reuse down the road, and giving me tips along the way.

Because things seem more difficult in Windows, here are the steps I followed to install Python 2.7 and Pandas on Windows 7 at KBOO.

Download and install python 2.7 at C:\Python27\ from


run C:\Python27\Scripts\pip.exe install numpy
C:\Python27\Scripts\pip.exe install python-dateutil
C:\Python27\Scripts\pip.exe install pytz
C:\Python27\Scripts\pip.exe install pandas

Here’s a blog post that gives an example that you can get hands-on with:

This post was written by Selena Chau, resident at KBOO Community Radio.

Posted in resident post | Tagged | Leave a comment

Just Ask For Help Already!

I do not know everything. Nor does anyone else.

Huge surprise, I’m sure. Lemme give you a second to pick your jaw up off the floor and recompose yourself.

Despite the fact that this foregone conclusion is painfully obvious, I certainly struggle with admitting it to people sometimes. Especially if the particular thingy that I do not totally understand is something that I wish I did know.

When this happens in social situations, I am guilty of trying to “pass” as someone who knows this information. I’m terrified that acknowledging that I don’t know how to write a script for that particular problem, or that I have not seen that really cool foreign film, or that I have never heard of this historical event, will be met with “WHAT??? You haven’t heard of XYZ?”

Instead, I will just nod my head and softly say “yeah, yeah,” while hoping no one presses me for details (they usually don’t).

Hey, by the way, when someone says they don’t know something PLEASE DON’T belittle them for it. I don’t care how much you loved that punk band in high school, not everyone has heard of them. Don’t act like someone just told you they didn’t know the sky was blue.

But I’m not talking about social situations in this post.

I want to talk about when I don’t know stuff that I wish I knew, so that I could do my job better (and faster).
Often in this situation I insist that I can just “boot-strap” this problem. I can teach myself, I can solve this without help, I am unstoppable.

Then, when I ultimately cannot pile-drive my ignorance out of existence through sheer force of will, I feel defeated. I put myself down. This often takes the form of comparing myself to someone else, who has, likely through the use of the dark arts (but certainly not through asking for help), managed to acquire the skills/knowledge I desire. I am clearly less than this person, as they have learned this thing that I cannot learn.

The author David Foster Wallace lived and worked in my hometown of Bloomington/Normal, Illinois, and I have always felt like that somehow explained how relatable I find his writing (totally plausible that he is just a good writer, and good writing just feels relatable – like a little peak inside the author’s head – but don’t ruin this for me). In an interview (that eventually became a book, that eventually became a movie) Although of Course You End Up Becoming Yourself: A Road Trip With David Foster Wallace, DFW describes his struggle with unhappiness, and his misguided approach to addressing this unhappiness, as “very American.”

When talking with David Lipsky:

“DFW: …Or, then, for two weeks I wouldn’t drink, and I’d run ten miles every morning. You know that kind of desperate, like very American, ‘I will fix this somehow, by taking radical action.’

And uh, you know, that lasted for a, that lasted for a couple of years.

DL: Like Jennifer Beals, more or less. In Flashdance, solving Pittsburgh.

DFW: And it’s weird. I think a lot of it comes out of sports training. You know? (In Schwarzenegger voice) ‘If there is a problem, I vill train myself out of it. I vill get up early. I vill vork harder.’ And that shit worked on me when I was a kid, but you know…”

That’s what I mean by “boot-strapping” a problem. Somehow the solution is just going to present itself to me because I did a bunch of pushups or something. Or the digital preservation equivalent of pushups, google searches.

I’ve found over the course of my residency at LPB, that I get better answers, faster, when I just admit I don’t know how to do something, and would like to know how to do that thing. One of my first experiences with this during my residency came up when I was trying to figure out how to run MediaInfo on all of the web encoded access file we keep on our server. I had worked out a process with LPB’s Web IT Manager, and was proudly tweeting about it. Turns out, there was a faster and easier way to do the same thing, and Kieran O’Leary from the Irish Film Archive was nice enough to clue me in.

Thanks to Kieran we ended up modifying this script to run recursively through our LTO-6 tapes as well.

Like this:

for /r %%A in (“*.mp4”, “*.mxf”) do mediainfo –output=PBCore2 “%%A” > “%%~nxA.xml”

This command is written for Windows command line, often stored in a batch file or “.bat.” They’re new to me, so this might be a little rudimentary to a seasoned Windows user. I found this resource helpful for figuring out all the different arguments and parameters (what all those goofy % and ~ signs mean):

I’m also looking to implement the MediaConch application into the archival workflows here at LPB. MediaConch does many things, but at LPB, I would like to use it as a form of automated quality control, through the policy checker feature. The policy checker essentially checks an input file against a set of rules, a policy, and tells the user whether the file conforms to that policy, or not. You can setup a policy manually, for instance, “all files must be in a .mov container, and have a sample rate of 5mbps or higher,” or you can use a file as a template to create a policy. Here at LPB, we make many files that will all be encoded the same way, so creating a policy from an existing file seemed like the way to go. If I knew how to do that.

I had read that it was possible, but couldn’t figure out how to do it. After scrolling around fruitlessly through the MediaConch GUI (there’s a CLI version as well), I decided to publically declare my ignorance, albeit timidly.

Dave Rice, audiovisual archivist at CUNY TV, was quick to jump in with the answer to my question. Not only did I get the answer I sought pretty much instantly, former NDSR resident, current mass digitization coordinator at NYPL, and distributor of sage advice, Dinah Handel, was there to set me straight on my needless sheepishness. Finally, Jérôme Martinez is the lead developer on MediaInfo and MediaConch, so now my feedback can be used to help develop later versions of the software. (In retrospect, I could have done this more directly on Github using the “Issues” tab on the software’s Github repo:

Once one has created a MediaConch policy, and runs a file against that policy, a report is created, detailing whether the file “passed” or “failed” the policy, and why. MediaConch exports reports in a variety of formats, you can create an HTML file that presents the report in an easy to read format (for a human), or as an XML file that can be interpreted more easily by a machine. I wanted to see if we could parse the XML reports automatically to create some sort of “red flag” for our transfer engineer when a file failed our policy. I emailed Ashley Blewer, a developer for NYPL, who is on the MediaConch team, and a friend of mine, and asked her if she had any ideas on how I might accomplish my “red flag” idea. See?? Asking for help! I’m learning (I hope).

Ashley clued me into a project another friend of ours was working on (isn’t it nice to have friends?) at her internship at CUNY TV.Savannah Campbell, current NYU MIAP student and CUNY TV intern, has been using XMLStarlet to parse the XML output of MediaConch reports at CUNY TV. You can see some of her code here:
If you’re looking at that code and you’re like, “um, what?” So was I! I just emailed Savannah and said that basically.
Savannah clued me into XMLStarlet, and the different flags you can add to the XML Starlet command, and how they work (I also needed some help recognizing the MediaInfo Xpath).

It was easy enough to re-purpose Savannah’s code into our own processes here at LPB. It’s still a little up in the air whether we will employ this method for reviewing files or not, we may just create HTML files and have our engineer quickly open them to review the report, but XMLStarlet is clearly a really powerful tool that I’m glad I learned about either way.

A post from one of my cohort members on this blog was super helpful to me, too. I had used md5deep before, but only on a really small scale. It was great to be able to use Lorena’s post as a reference before I dove into unfamiliar waters. We’re using md5deep slightly differently at LPB, and so when I was struggling to build off of what I had learned from Lorena, embolden by my success of asking for help earlier in the residency, I emailed Dinah Handel about it, as well as asking the rest of the cohort about it through our slack channel. Dinah worked on migrating files from one generation of LTO tape to another during her residency, and used checksums to verify that the transfer was complete and successful. I’m going to be trying something similar soon, as we’re moving files off of a RAID and onto an LTO tape. Dinah suggested creating a manifests of checksums from all the files before and after the transfer, and then comparing the two lists.
Thanks Dinah! That’s the plan.

But, while I was testing this process, I was running into some trouble. When you run md5deep recursively through a directory, it doesn’t always process the files in the same order. This sucks, cause when you try to compare the files programmatically on Windows (using the “FC” command), it sees the files as being totally different, because they’re in a different order (FC compares files line by line). Thankfully, I’m not in this alone.

I was talking to my fellow AAPB NDSR residents about this, and Andrew Weaver suggested piping the results through the “sort” function. The “pipe” command takes the results from one command and pushes them through a second process. Andrew’s suggestion totally paid off. The command I’m using to create a manifest looks like this:

md5deep64 -r -b -e “INPUT DIRECTORY” | sort > OUTPUT FILE PATH #1.txt


Soooo I tested this script yesterday and it turns out it doesn’t work for LTO tapes. I think the “*” wildcard sends md5deep on a rampage trying to create checksums for everything it can get its grubby little hands on.
LTO stands for Linear Tape Open, and the “Linear” part means that files are written linearly. When a computer is reading an LTO tape, it can only read one file at a time. So when md5deep tries to create checksums for three files at once, it basically just spins the tape around in the drive, getting nothing done.

Instead, I’m using a modifed version of the script we’re using for creating MediaInfo files, which processes one file at a time, inserting each new checksum into the same text file, and then sorting the results after the fact:

for /r %%A in (“*”) do md5deep64 -b -e “%%A” >> “checksum_manifest_for_RAID.txt”
sort “checksum_manifest_for_RAID.txt” /O “checksum_manifest_for_RAID.txt”

This post was written by Eddy Colloton, resident at Louisiana Public Broadcasting.

Posted in resident post | Tagged , | 1 Comment