Save the Data!

On Saturday, February 25th, I met an archivist friend for lunch and then we went to our first DataRescue session, hosted by the University of Minnesota. The event went from Friday afternoon to Saturday evening.

I was going to go on Friday, but switched my plans midweek based on the weather forecast. On Friday, the metro area was supposed to see up to a foot of snow, which I was happy about because the week before, we had seen temperatures in the 60s. Instead we got a few flakes and temperatures in the mid thirties. I shouldn’t have to tell you how unusual that is for Minnesota in February. The promise of working at the climate change data rescue helped me channel some of my own frustration. Some things, like science, shouldn’t be political.

The goal, of course, was to capture federal information that should remain free and available, and back it up on DataRefuge.Org. Our focus was primarily climate data, identified by professors and instructors at the University as being essential to their teaching and research.

When we got there, the room was nearly full of information professionals all over the Twin Cities, all at different levels: 1. Nomination, 2. Seeding, 3. Researchers, 4. Checkers & Baggers, 5. Harvesters. Here is the workflow from github that we were working from.

We signed in and were directed to what I will call “Station 4.”  We walked back to the table and low and behold, Valerie Collins, from the D.C. NDSR cohort was there! Valerie and I have met before, and I was delighted to see her. She got us caught up to speed and my friend and I got to work. We checked datasets and then uploaded them to the Data Refuge site, creating records and filling in the metadata. It was archiving, it was cataloging, it was resistence, and it was fun. There were snacks, and cool, passionate people, and there was a real sense of being in a room where people knew how to use their skills to make sure knowledge about the reality of climate change stayed in the hands of the scientists and the people who need the information.

I hope they host another one.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in resident post | Tagged | Leave a comment

Moving Beyond the Allegory of the Lone Digital Archivist (& my day of Windows scripting at KBOO)

The “lone arranger” was a term I learned in my library sciences degree program and I accepted it. I visualized hard-working, problem-solving solo archivists in small staff-situations challenged with organizing, preserving, and providing access to the growing volumes of historically and culturally relevant materials that could be used by researchers. As much as the archives profession is about facilitating a deep relationship between researchers and records, the work to make archival records accessible to researchers needed to be completed first. The lone arranger term described professionals, myself to be one of them, working alone and with known limitations to meet this charge. This reality has encouraged archivists without a team to band together and be creative about forming networks of professional support. The Society of American Archivists (SAA) has organized support for lone arrangers since 1999, and now has a full-fledged Roundtable for professionals to meet and discuss their challenges. Similarly, support for the lone digital archivist was the topic of a presentation I heard at the recent 2017 Code4Lib conference held at UCLA by Elvia Arroyo-Ramirez, Kelly Bolding, and Faith Charlton of Princeton University.

Managing the digital record is a challenge that requires more attention, knowledge sharing, and training in the profession. At Code4Lib, digital archivists talked about how archivists in their teams did not know how to process born-digital works, that this was a challenge, but more than that unacceptable in this day and age. It was pointed out that our degree programs didn’t offer the same support for digital archiving compared to processing archival manuscripts and other ephemera. The NDSR program aims to close the gap on digital archiving and preservation, and the SAA has a Digital Archives Specialist credential program, but technology training in libraries and archives shouldn’t be limited to the few who are motivated to seek out this training. Many jobs for archivists will be in a small or medium-sized organizations, and we argued that processing born-digital works should always be considered part of archival responsibilities. Again, this was a conversation among proponents of digital archives work, and I recognize that it excludes many other thoughts and perspectives. The discussion would be more fruitful by including individuals who may feel there is a block to their learning and development to process born-digital records, and to focus the discussion on learning how to break down these barriers.

Code4Lib sessions (http://bit.ly/d-team-values, http://scottwhyoung.com/talks/participatory-design-code4lib-2017/) reinforced values of the library and archives profession, namely advocacy and empowering users. No matter how specialized an archival process is, digital or not, there is always a need to be able to talk about the work to people who know very little about archiving, whether they are stakeholders, potential funders, community members, or new team members. Advocacy is usually associated with external relations, but is an approach that can be taken when introducing colleagues to technology skills within our library and archives teams. Many sessions at Code4Lib were highly technical, yet the conversation always circled back to helping the users and staying in touch with humanity. When I say highly technical, I do not mean “scary.” Another session reminded us that technology can often cause anxiety, and can be misinterpreted as something that can solve all problems. When we talk to people, we should let them know what technology can do, and what it can’t do. The reality is that technology knowledge is attainable and shouldn’t be feared. It cannot solve all work challenges but having a new skill set and understanding of technology can help us reach some solutions. It can be a holistic process as well. The framing of challenges is a human-defined model, and finding ways to meet the challenges will also be human driven. People will always brainstorm their best solutions with the tools and knowledge they have available to them—so let’s add digital archiving and preservation tools and knowledge to the mix.

And the Windows scripting part?

I was originally going to write about my checksum validation process on Windows, without Python, and then I went to Code4Lib which was inspiring and thought-provoking. In the distributed cohort model I am a lone archivist if you frame your perspective around my host organization. But, I primarily draw my knowledge from my awesome cohort members and my growing professional network I connected with on Twitter (Who knew? Not me.). So I am not a lone archivist in this expanded view. When I was challenged to validate a large number of checksums without the ability to install new programs to my work computer, I asked for help from my colleagues. So below is my abridged process where you can discover how I was helped through an unknown process with a workable solution using not only my ideas, but ideas from my colleagues. Or scroll all the way down for “Just the solution.”

KBOO recently received files back from a vendor who digitized some of our open-reel content. Hooray! Like any good post-digitization work, ours had to start with verification of the files, and this meant validating checksum hash values. Follow me on my journey through my day of Powershell and Windows command line.

Our deliverables included a preservation wav, mezzanine wav, and mp3 access file, plus related jpgs of the items, an xml file, and a md5 sidecar for each audio file. The audio filenames followed our filenaming convention which was designated in advance, and files related to a physical item were in a folder with the same naming convention.

Md5deep can verify file hashes with two reports created with the program, but I had to make some changes to the format of the checksum data before they could be compared.

Can md5deep run recursively through folders? Yes, and it can recursively compare everything in a directory (and subdirectories) against a manifest.

Can md5deep selectively run on just .wav files? Not that I know of, so I’ll ask some people.

Twitter & Slack inquiry: Hey, do you have a batch process that runs on designated files recursively?

Response: You’d have to employ some additional software or commands like [some unix example]

@Nkrabben: Windows or Unix? How about Checksumthing?

Me: Windows, and I can’t install new programs, including Python at the moment

@private_zero: hey! I’ve done something similar but not on Windows. But, try this Powershell script that combines all sidecar files into one text file. And by the way, remember to sort the lines in the file so they match the sort of the other file you’re comparing it to.

Me: Awesome! When I make adjustments for my particular situation, it works like a charm. Can powershell scripts be given a clickable icon to run easily like windows batch files in my work setup where I can’t install new things?

Answer: Don’t know… [Update: create a file with extension .ps1 and call that file from a .bat file]

@kieranjol: Hey! If you run this md5deep command it should run just on wav files.

Me: Hm, tried it but doesn’t seem like md5deep is set up to run with that combination of Windows parameters.

@private_zero: I tried running a command, seems like md5deep works recursively but not picking out just wav files. Additional filter needed.

My afternoon of Powershell and command line: Research on FC (file compare), sort, and ways to remove characters in text files (the vendor file had an asterisk in front of every file name in their sidecar files that needed to be removed to match the output of an md5deep report).

??? moments:

Turns out using powershell forces output as UTF-8 BOM as compared to ascii/’plain’ utf output of md5deep text files. Needed to be resolved before comparing files.

The md5deep output that I created listed names only and not paths, but that left space characters at the end of lines! That needed to be stripped out before comparing files.

I tried to perform the same function of the powershell script in windows command line but was hitting walls so went ahead with my solution of mixing powershell and command line commands.

After I got 6 individual commands to run, I combined the Powershell ones and the Windows command line ones, and here is my process for validating checksums:

Just the solution:

It’s messy, yes, and there are better and cleaner ways to do this! I recently learned about this shell scripting guide that advocates for versioning, code reviews, continuous integration, static code analysis, and testing of shell scripts. https://dev.to/thiht/shell-scripts-matter

Create one big list of md5 hashes from vendor’s individual sidecar files using Powershell

–only include the preservation wav md5 sidecar files, look for them recursively through the directory structure, then sort them alphabetically. The combined file is named mediapreserve_20170302.txt. Remove the asterisk (vendor formatting) so that the text file matches the format of an md5deep output file. After removing asterisk, the vendor md5 hash values will be in the vendormd5edited.txt file.

open powershell

nav to new temp folder with vendor files

dir .\* -exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav,*_mezz.wav,*.jpg,*.txt,*.bat -rec | gc | out-file -Encoding ASCII .\vendormd5.txt; Get-ChildItem -Recurse A:\mediapreserve_20170302 -Exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav.md5,*_mezz.wav,*.jpg,*.bat,*.txt | where { !$_.PSisContainer } | Sort-Object name | Select FullName | ft -hidetableheaders | Out-File -Encoding “UTF8” A:\mediapreserve_20170302\mediapreserve_20170302.txt; (Get-Content A:\mediapreserve_20170302\vendormd5.txt) | ForEach-Object { $_ -replace ‘\*’ } | set-content -encoding ascii A:\mediapreserve_20170302\vendormd5edited.txt

Create my md5 hashes to compare to vendor’s

–run md5deep on txt list of wav files from inside the temp folder using Windows command line (Will take a long time to hash multiple wav files)

“A:\md5deep-4.3\md5deep.exe” -ebf mediapreserve_20170302.txt >> md5.txt

Within my new md5 value list text file, sort my md5 hashes alphabetically and trim the end space characters to match the format of the vendor checksum file. Then, compare my text file containing hashes with the file containing vendor hashes.

–I put in pauses to make sure the previous commands completed, and so I could follow the order of commands.

run combined-commands.bat batch file (which includes):

sort md5.txt /+34 /o md5sorteddata.txt

timeout /t 2 /nobreak

@echo off > md5sorteddata_1.txt & setLocal enableDELAYedeXpansioN

for /f “tokens=1* delims=]” %%a in (‘find /N /V “” ^<md5sorteddata.txt’) do ( SET “str=%%b” for /l %%i in (1,1,100) do if “!str:~-1!”==” ” set “str=!str:~0,-1!” >>md5sorteddata_1.txt SET /P “l=!str!”>md5sorteddata_1.txt echo.

)

timeout /t 5 /nobreak

fc /c A:\mediapreserve_20170302\vendormd5edited.txt A:\mediapreserve_20170302\md5sorteddata_1.txt

pause

The two files are the same, so all data within it matches, therefore, all checksums match. So, we’ve verified the integrity and authenticity of files transferred successfully to our server from the vendor.

This post was written by Selena Chau, resident at KBOO Community Radio.

Posted in project updates, resident post | Tagged , | 1 Comment

Louisiana Public Broadcasting Digital Preservation Plan

I have completed my NDSR AAPB residency at Louisiana Public Broadcasting (LPB)! While most of my cohort will continue chugging along for another few months, I sadly have to finish up a bit early. But, I’m leaving for an exciting opportunity in the conservation department at the Denver Art Museum. I’m feeling good about the time I have spent here at LPB, and the work that we have accomplished. I may even chime in on the ol’ AAPB NDSR blog again down the line, once I’ve had time to lean into some post-residency navel gazing.

Please find my primary deliverable to LPB, the LPB Digital Preservation Plan, below. The objective of this document was to document the station’s current digital preservation procedures, and to make recommendations for improvement. The plan discusses the benefits of creating MediaInfo and MediaConch reports, as well as fixity checks, and how to apply those tools in a production environment. The plan also describes the benefits of using uncompressed and lossless codecs for the preservation of analog video, the methodology and strategy behind planning for LTO tape generation migrations, the importance of collecting production documentation in audiovisual archiving, and much more. While the policies and procedures described in the plan are specific to LPB, I think that there’s certainly information to be gleaned from the plan whether you are working in a public broadcasting archive or not.

I want to offer my thanks to everyone at LPB for being so welcoming to a stranger from the north, and for helping me with so many aspects of my project. I want to offer a special thanks to my host mentor, Leslie Bourgeois, who in spite of having a very difficult year due to the historic flooding that occurred Baton Rouge in August of 2016, has been supportive and encouraging of my work here at LPB. I would also be remiss if I didn’t thank Rebecca Fraimow, the NDSR program coordinator, for constantly being there for me and the rest of the cohort over the last 7 months. And of course a very special thanks to my NDSR cohort for letting me ask them questions, vent to them about my struggles, and allowing me to share a barrage of my dumb jokes. I wish you all the best. AAPB NDSR 4 lyfe!

Download the Louisiana Public Broadcasting Digital Preservation Plan

This post was written by Eddy Colloton, resident at Louisiana Public Broadcasting.

Posted in project updates, resident post | Tagged , | Leave a comment

AAPB NDSR March Webinars

We have two resident-hosted webinars coming up in March:

Through the Trapdoor: Metadata & Disambiguation in Fanfiction, a case study (Kate McManus)
Thursday, March 9th
3:00 PM EST

Within the last twenty years, fanfiction writing has transfigured from disparate communities like webrings within Geocities communities into websites like AO3, providing a more intimate and interactive space for the wide world of web communities to gather and create. Recently, professionals who work to preserve and archive pop culture have begun to open dialogues with fan writers in hope that these underground communities will share their labors of love, but doing so creates its own problems for both creators and archivists.

Rated: T
Characters: Gen
Tags: meta, archives, angst, rarepair

Disclaimer: Just playing in the sandbox. DO NOT SUE, I OWN NOTHING Please read and review!

Register for “Through the Trapdoor: Metadata and Disambiguation in Fanfiction”

ResourceSpace for Audiovisual Archiving (Selena Chau)
Thursday, March 23rd
3:00 PM EST

Selena Chau (AAPB NDSR Resident, KBOO Community Radio) and Nicole Martin (Senior Manager of Archives and Digital Systems, Human Rights Watch) share their assessment and uses of ResourceSpace as an open-source tool for metadata and asset management within digital preservation workflows.

Register for “ResourceSpace for Audiovisual Archiving”

And if you missed Adam Lott’s February webinar, “Intro to Data Manipulation with Python CSV,” you can catch up now by watching the recording or viewing the slides.

Posted in admin | Tagged | Leave a comment

Advocating for Archives in a Production Environment

One of the new experiences I’ve had here at LPB, among the many new experiences that have accompanied moving to Baton Rouge, Louisiana (I have way more opinions about gumbo now), has been working for an archive in a production environment.

The primary purpose of a TV station is, believe it or not, to broadcast television programs. Not, necessarily, to preserve and provide access to their archive. That being said, it can be a kind of two birds/one stone situation. However, given that archiving is not the first thing that springs to mind when one mentions a public broadcasting station, prioritizing archival functions in a production environment requires some justification.

Now, this isn’t all that different from a traditional archive. Whether one works at NARA or a local library, new preservation initiatives or procedures aren’t just rubber stamped because the archivist remembered to ask nicely (although I think saying “please” and “thank you” can’t hurt). Archivists have to advocate for archiving.

After spending several months at LPB learning the archival workflow, I had some recommendations I wanted to make to (hopefully) improve digital preservation practices. My host mentor, Leslie Bourgeois, suggested we call a meeting with several department heads and discuss the feasibility of my recommendations.

I think it’s important to acknowledge that I had several advantages coming into this meeting. LPB had to apply to host me in my residency, and a part of the stated purpose of my position at LPB is to make digital preservation recommendations. In other words, there was an understanding that I would be suggesting some changes. Leslie also helped pave the way for the meeting by reminding the participants ahead of time that this was why they had asked me to come work at LPB (thanks, Leslie). The choice to apply for and host a National Digital Stewardship Residency also acknowledges the value of archiving and information science. I can imagine a situation in which pointing towards “best practices” in the field would have no weight at all. While LPB is certainly more concerned with how other public television stations are archiving their collections than with how a university library preserves its collection, there is interest at LPB in the best practices of audiovisual preservation. This was a big help for me, because I didn’t need to justify the significance of complying with standards or archival best practices.

So, what did I say, and how did I say it?

I decided not to use a powerpoint. I wanted people to feel comfortable jumping in with questions, or sharing their expertise, and I thought looking at a screen in a dark room could inhibit that. I did use a handout, though (sorry trees). You can find that handout here.

catvprintergif

via GIPHY

I began by outlining our current practices and archival processes. Everyone in the meeting was more or less familiar with these, but since most folks are very familiar with their own part of the workflow, and less aware of the other elements, it seemed to serve as a helpful reminder. It was also a good opportunity to demonstrate that I knew the lay of the land and was making these recommendations from an informed perspective.

After a quick review of what our existing digital preservation procedures were, I pointed out some of the risks inherent in these current practices. I tried to boil down the risk to some key concepts, ideas that would be relatable even if you weren’t familiar with preservation. Below are some of those key concepts, and how they applied to our workflow.

Simplicity as Risk-Averse

The more things you do, the more chance you will run of one of them going wrong. Our current workflow involves moving media files across many different devices, that send signals to many different places, for different purposes. I suggested that the simpler the workflow was, the better.

Provenance as Troubleshooting

In order to encourage more fixity checks, I pointed out our current inability to diagnose the point of file corruption or loss of data integrity, in the event that a file does not perform successfully. If an error is encountered, there is no way to trace when the error took place, or if a previous version of the file would contain the same error. If we had checksums from several points along the way, we could trace an error back to it’s origin.

Planning for Updates as Planning for Obsolescence

Television stations are accustomed to the growing pains of technological evolution. Whether it is the shift from analog to digital or standard definition to high definition, changes in the media landscape are obligatory in television, but not necessarily easy. Given this reality, pointing towards the obsolescence of a particular storage media, or the necessity of more digital storage in the future, has more traction in a room full of professionals whom have had to manage these types of changes in the past. The station’s current dependency on the XDCAM format, and the risks involved in that, was not a surprise to anyone.

Once I had explained the risks inherent to the station’s current practices I outlined a series of potential solutions to those problems.

Lossless or Uncompressed File Formats

Best practices in the field of video preservation call for analog video to be migrated to digital formats through uncompressed or lossless digital video encoding. A fair question could be: “So what?” We currently migrate analog video to a high quality video file format – IMX50. (For the digital video nerds, IMX50 is a proprietary Sony codec, but it’s basically MPEG-2 encoding, standardized in SMPTE 365M as the D-10 codec.) Why bother with uncompressed or lossless when IMX50 looks really good? A couple reasons:

  1. You’re still losing information. Analog closed captions are encoded into a video signal on the horizontal “line 21” of the video signal. Compressing that signal can destroy captions! Michael Grant of NYU Libraries gave a cool talk about this at the last AMIA conference.
  2. IMX50 looks good now, but… As we hear more about the introduction of 4k broadcasting, it’s painfully clear that standard definition video is going to look worse and worse – by comparison – over time.
  3. Storage will always be cheaper. In the meeting, it was nice to have two IT professionals in the room to back me up on this assertion. LTO-6 tape is already relatively inexpensive in terms of digital storage. When LPB migrates to LTO-8, ideally following the release of LTO-9, the station will be able to store 5 times as much data on one tape, for likely the same price or even less. Besides, mathematically lossless file formats like FFV1 allow for the same quality as an Uncompressed file, while taking up about half the space. The effort to standardize FFV1 through the IETF’s CELLAR working group and other standards organizations and communities makes FFV1 even more appealing.

File-based submission to the Archive

We currently have a lot of material submitted to the archive on XDCAM disks, and we will also receive documentation of productions (contracts, releases, footage logs, etc) on paper. I argued that we should be moving towards file-based submission of material to the archive. Former NDSR resident Dinah Handel summarizes some of the advantages CUNY TV found in moving to a file-based workflow in the 2nd half of this blog post (the 1st half, by Mary Kidd, is great too). My primary argument for moving towards file-based submission centers around the ease of automating archival functions in a file based environment. When files are locked on an XDCAM disk, an archivist cannot document the video file’s technical metadata by creating a MediaInfo report, or validate the file’s fixity, without an expensive and clunky interface between an XDCAM deck and a computer (which our archive does not have). Even if with access to an XDCAM deck, this process could not be automated.

I suggested that LPB produced documentaries, like the recently released Deeply Rooted, be submitted to the archive on a hard drive. This submission would include the editing software’s project file, graphics produced for the documentary, and all accompanying documentation, such as scripts, supers lists, and contracts. I pointed to the advantages of digital information, and that keeping digital records in their original form – such as the connections between raw footage and a finished program – saved time. The same is true for a word processing document over a printed out version of a script.

So, how did it go?

Pretty well I think! There seemed to be a lot of support for some of the ideas I put forward. We had a follow up meeting a few weeks later to discuss which solutions we might explore, and how we might get there. I used a similar formula to prepare and present my ideas for that meeting. You can find the handout I made for that meeting here.

I would be surprised if LPB ends up following every single one of my recommendations, but steps have been taken to implement some of them already! It is certainly a move in the right direction, and I think the meetings we had helped raise the profile of some of our archival processes and procedures at the station. In hindsight, just getting everybody together in the same room and discussing what could be better about the workflow was one of the biggest advantages to the meeting. From there, things progressed pretty organically through teamwork. My suggestion of finding a higher quality file for the analog workflow inspired the team to start discussing how to collect a higher quality version of our recently producer programs as well! Don’t underestimate your colleagues’ interest in what you do, and their ability to help you accomplish your goals.

highfivegif

via GIPHY

This post was written by Eddy Colloton, resident at Louisiana Public Broadcasting.

Posted in project updates, resident post | Tagged , | Leave a comment

Professional Development Funds Report

One of the unexpected bonuses of this residency is the professional development fund that’s earmarked for each of us. Considering the entire residency is calibrated for professional development, this extra cushion is really nice. I’ve used it to renew my SAA membership. I’ve registered for the LibTech Conference in March.

I also registered to audit a metadata class in a fit of …well, to be honest, I think in a fit of Imposter Syndrome. In the first few weeks, I felt like I had been thrown into the deep end of the pool and I hadn’t learned to swim yet. It’s not MPR- this happens with any job, any position. It usually takes me about six weeks to acclimate to any new position, especially for something like this. I should have just given myself some time to get a little more comfortable.

I love my alma mater, St. Catherine University, and I have definitely benefited from their programs. But I can’t decide if this class helped me or not. On the one hand, I’m not sure I learned anything from the course. On the other hand, it renewed my own confidence in my knowledge and skillset. I don’t have any new tools from the course, but maybe that’s because I already had the skills all along. I’m confident that everyone else in the class learned a great deal! The class wasn’t the problem, but the course description didn’t match what I took away from it.

So I jumped the gun in auditing this class and spending a good bulk of my money on this. It’s certainly not the fault of St. Kate’s, but I guess it served its purpose in that I do feel better about my own skills. The lesson is: live and learn! And give yourself six weeks before you decide to burn through over half of your professional development funds.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in admin | Tagged , | 1 Comment

AAPB NDSR Webinars

We have another resident-hosted webinar coming up in a few weeks!

Intro to Data Manipulation with Python CSV (Adam Lott)
Thursday, February 23
3:00 PM EST

Archivists frequently inherit huge amounts of messy and disorganized data. More often than not, this data does not fit into established organizational systems, and manually reconstructing data can be a monumental task. Luckily, there are tools for automating these processes. This webinar, open to the public, will focus on one such tool, the Python CSV module, which can handle the large-scale manipulation of CSV data. Using this module, we will extract content from a large CSV, isolating specific assets that have yet to be digitized. Basic concepts of Python programming will also be covered, so those with little experience can follow along and create code of their own.

Required Downloads for Following Along:
Python 2.7.12 will be used. If you’d like to follow along, it can be downloaded via https://www.python.org/downloads/
A text editor of your choosing (webinar will be done using Atom)
csv_data.csv

Register for “Intro to Data Manipulation with Python CSV”

And if you missed our January webinars, you can catch up now:

“Challenges of Removable Media in Digital Preservation,” presented by Eddy Colloton, AAPB NDSR resident at Louisiana Public Broadcasting
–          view the slides
–          watch the recording

“Demystifying FFmpeg/FFplay,” presented by Andrew Weaver, AAPB NDSR resident at CUNY TV
–          view the slides
–          watch the recording

Posted in admin | Tagged , , , | Leave a comment