Trying New Things: Meditations on NDSR from the Symposium in DC

Way back in August, when I was already feeling overwhelmed by this residency, I decided that yes, I was going to join the planning committee for the NDSR Symposium. I don’t know why I thought taking on more obligations was a good idea, when I was already feeling like I might collapse under the weight of the residency. But I always have a high opinion of what Future Kate is capable of.

In terms of planning a Symposium, this one was a good one to start with. Everyone on the planning committee, staff, and advisory council was dedicated to making this a space where stakeholders across residencies could talk about the benefits of the program, the successes, and of course, the missteps. We all really believe in these residencies, and the conversations we had over the two days of the Symposium were excellent.

In terms of planning, I advocated for a group slack for the planning committee. I tracked the very generous travel grant applications. We set up gmail accounts for the applications (both travel and submissions for panels and presentations), as well as google sheets to track how much money we had in the pool. I also came up with some templated emails for the planning committee to use. If you got one, hey! I hope it was exactly like every other confirmation email you’ve ever gotten. Like I said, I have never done anything like this before, so everything was new to me. I also helped with registration and greeting, so that I could meet as many people as possible. Hi!

There were two upsets that we didn’t anticipate. After initially agreeing to making opening remarks, Dr. Carla Hayden was unable to make it. There was a replacement speaker quickly arranged, but I know I was really disappointed. She is the Librarian of Congress, however, so I understand there are demands on her time.

There was also a small issue with the chair delivery being sent to the wrong building. The staff at Library of Congress soon got it sorted out, and skipping a break allowed us to get back on schedule. Of all the things that could go wrong, I’m glad it was something that wasn’t a disaster. Many of the early speakers said they enjoyed the extra mingling and networking.

Despite these minor hitches, the conversations during the Symposium were brilliant across the board. I enjoyed hearing about the diversity of projects. The challenges some speakers raised, about sustainability, communication, and effective mentorship, are challenges that my cohort has faced. And even defining what exactly qualifies a project as an NDSR program challenged my thinking.

Of course, the AAPB crew represented our work over the past few months, which is always interesting. A goal of mine has been to become an improved public speaker. The jury’s probably still out on that one, but it’s become less terrifying. The smaller group probably had something to do with that. We’ll see if I feel the same way when Society of American Archivist’s National Conference rolls around in July.

There was another opportunity this trip to Washington offered me, something I had never done before. I met with my congressperson, Jason Lewis. I had met two of my hometown’s mayors, but this was the first time I made an appointment with an elected leader, certainly I had never met anyone from the National level. But with all the uncertainty of continued IMLS funding, I felt it would be worth a short but passionate conversation. During the last congressional recess, he had visited one of the local libraries here, so I thought it would be good timing.

It turned out to be not so great timing, as he wasn’t able to meet with me while I was there. I’m trying not to think it was intentional, and the Legislative Assistant who met with me took a polite interest.

Honestly, my anxiety didn’t even kick in, I wasn’t nervous at all. It helped that I was coming in with the full weight of my master’s degree, my ten years in library service, my position on my local library’s board, and of course, my experience as a direct recipient of an IMLS grant focusing on our digital heritage. I talked to the Assistant for 20 minutes about the benefits of libraries and how IMLS supports them. I gave him articles and a printout of the American Library Association’s Code of Ethics. I got his card and he’s already heard from me. I hope he’s being honest when he says he’s passing my concerns on to the Congressperson.

I feel this trip to DC really benefited my understanding of the NDSR program as a whole. I tried a lot of new things, I got a lot out of the conversations I had. I hope that I gave as much as I took with me.

As a resident who is wrapping up her project in the next two months, I only hope that I can continue to be connected to this wonderful community, and to advocate for our shared digital heritage.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in resident post | Tagged , | Leave a comment

Resident Webinar Recording Roundup

With Lorena Ramírez-López’s presentation last month on “Whats, Whys, and How Tos of Web Archiving,” our AAPB NDSR webinar series has now concluded.  However, if you want to catch up on what you missed, I’m happy to share that recordings of all our resident-hosted webinars, along with slides and other resources, are now available for open viewing!

Challenges of Removable Media in Digital Preservation – Eddy Colloton
Removable storage media could be considered the most ubiquitous of digital formats. From floppy disks to USB flash drives, these portable, inexpensive and practical devices have been relied upon by all manner of content producers. Unfortunately, removable media is rarely designed with long-term storage in mind. Optical media is easy to scratch, flash drives can “leak” electrons, and floppy disks degrade over time. Each of these formats are unique, and carry with them their own risks. This webinar, open to the public, will focus on floppy disks, optical media, and flash drives from a preservation perspective. The discussion will include a brief description of the way information is written and stored on such formats, before detailing solutions and technology for retrieving data from these unreliable sources.
(Originally aired January 12, 2017)
Recorded Webinar
Slides

Demystifying FFmpeg/FFprobe – Andrew Weaver
The FFmpeg/FFplay combination is a surprisingly multifaceted tool that can be used in myriad ways within A/V workflows.  This webinar will present an introduction to basic FFmpeg syntax and applications (such as basic file transcoding) before moving into examples of alternate uses.  These include perceptual hashing, OCR, visual/numerical signal analysis and filter pads.
(Originally aired January 26, 2017)
Recorded Webinar
Slides

Intro to Data Manipulation with Python CSV – Adam Lott
Archivists frequently inherit huge amounts of messy and disorganized data. More often than not, this data does not fit into established organizational systems, and manually reconstructing data can be a monumental task. Luckily, there are tools for automating these processes. This webinar, open to the public, will focus on one such tool, the Python CSV module, which can handle the large-scale manipulation of CSV data. Using this module, we will extract content from a large CSV, isolating specific assets that have yet to be digitized. Basic concepts of Python programming will also be covered, so those with little experience can follow along and create code of their own.
(Originally aired February 23, 2017)
Recorded Webinar
Slides

Through the Trapdoor: Metadata and Disambiguation in Fanfiction – Kate McManus
Within the last twenty years, fanfiction writing has transfigured from disparate communities like webrings within Geocities communities into websites like AO3, providing a more intimate and interactive space for the wide world of web communities to gather and create. Recently, professionals who work to preserve and archive pop culture have begun to open dialogues with fan writers in hope that these underground communities will share their labors of love, but doing so creates its own problems for both creators and archivists.
(Originally aired March 9, 2017)
Rated: T
Characters: Gen
Tags: meta, archives, angst, rarepair
Disclaimer: Just playing in the sandbox. DO NOT SUE, I OWN NOTHING Please read and review!
Recorded Webinar
Slides

ResourceSpace for Audiovisual Archiving – Selena Chau
Selena Chau (AAPB NDSR Resident, KBOO Community Radio) and Nicole Martin (Senior Manager of Archives and Digital Systems, Human Rights Watch) share their assessment and uses of ResourceSpace as an open-source tool for metadata and asset management within digital preservation workflows.
(Originally aired March 23, 2017)
Recorded Webinar
Slides
Demo videos: 1, 2, 3, 4

Whats, Whys, and How Tos of Web Archiving – Lorena Ramírez-López
Basic introduction of:
– what web archiving is 🤔
– why we should web archive ¯\_(ツ)_/¯
– how to web archive 🤓
in a very DIY/PDA style that’ll include practical methods of making your website more “archive-able” – or not! – and a live demo of webrecorder.io using our very own NDSR website as an example!
(Originally aired April 6, 2017)
Recorded Webinar

Slides
Transcript

Posted in admin | Tagged , , , , , , | Leave a comment

Adventures in Perceptual Hashing

Background

One of the primary goals of my project at CUNY TV is to prototype a system of perceptual hashing that can be integrated into our archival workflows. Perceptual hashing is a method of identifying similar content using automated analysis; the goal being to eliminate the (often impossible) necessity of having a person look at every item one-by-one to make comparisons. Perceptual hashes function in a similar sense to standard checksums, except instead of comparing hashes to establish exact matches between files at the bit level, they establish similarity of content as would be perceived by a viewer or listener.

There are many examples of perceptual hashing in use that you might already have encountered. If you have ever used Shazam to identify a song, then you have used perceptual hashing! Likewise, if you have done a reverse Google image search, that was also facilitated through perceptual hashing. When you encounter a video on Youtube that has had its audio track muted due to copyright violation, it was probably detected via perceptual hashing.

One of the key differences between normal checksum comparisons and perceptual hashing is that perceptual hashes attempt to enable linking of original items and derivative or modified versions. For example, if you have a high quality FLAC file of a song and make a copy transcoded to MP3, you will not be able to generate matching checksums from the two files. Likewise, if you added a watermark to a digital video it would be impossible to compare the files using checksums. In both of these examples, even though the actual content is almost identical in a perceptive sense, as data it is completely different.

Perceptual hashing methods seek to accomodate for these types of changes by applying various transformations to the content before generating the actual hash, also known as a ‘fingerprint’. For example, the audio fingerprinting library Chromaprint converts all inputs to a sample rate of 11025 Hz before generating representations of the content in musical notes via frequency analysis which are used to create the final fingerprint. 1 The fingerprint standard in MPEG-7 that I have been experimenting with for my project does something analogous, generating global fingerprints per frame with both original and square aspect ratios as well as several sub-fingerprints. 2 This allows comparisons to be made that are resistant to differences caused by factors such as lossy codecs and cropping.

Hands on Example

As of the version 3.3 release, the powerful Open Source tool FFmpeg has the ability to generate and compare MPEG-7 video fingerprints. What this means, is that if you have the most recent version of FFmpeg, you are already capable of conducting perceptual hash comparisons! If you don’t have FFmpeg and are interested in installing it, there are excellent instructions for Apple, Linux and Windows users available at Reto Kromer’s Webpage.

For this example I used the following video from the University of Washington Libraries’ Internet Archive page as a source.

https://archive.org/embed/NarrowsclipH.264ForVideoPodcasting

From this source, I created a GIF out of a small excerpt and uploaded it to Github.

GIF

GIF

Since FFmpeg supports URLs as inputs, it is possible to test out the filter without downloading the sample files! Just copy the following command into your terminal and try running it! (Sometimes this might time out; in that case just wait a bit and try again).

Sample Fingerprint Command

ffmpeg -i https://ia601302.us.archive.org/25/items/NarrowsclipH.264ForVideoPodcasting/Narrowsclip-h.264.ogv -i https://raw.githubusercontent.com/privatezero/Blog-Materials/master/Tacoma.gif -filter_complex signature=detectmode=full:nb_inputs=2 -f null -

Generated Output

[Parsed_signature_0 @ 0x7feef8c34bc0] matching of video 0 at 80.747414 and 1 at 0.000000, 48 frames matching [Parsed_signature_0 @ 0x7feef8c34bc0] whole video matching

What these results show is that even though the GIF was of lower quality and different dimensions, the FFmpeg signature filter was able to detect that it matched content in the original video starting around 80 seconds in!

For some further breakdown of how this command works (and lots of other FFmpeg commands), see the example at ffmprovisr.

Implementing Perceptual Hashing at CUNY TV

At CUNY we are interested in perceptual hashing to help identify redundant material, as well as establish connections between shows utilizing identical source content. For example, by implementing systemwide perceptual hashing, it would be possible to take footage from a particular shoot and programmatically search for every production that it had been used in. This would obviously be MUCH faster than viewing videos one by one looking for similar scenes.

As our goal is collection wide searches, three elements had to be added to existing preservation structures: A method for generating fingerprints on file ingest, a database for storing them and a way to compare files against that database. Fortunately, one of these was already essentially in place. As other parts of my residency called for building a database to store other forms of preservation metadata, I had an existing database that I could modify with an additional table for perceptual hashes. The MPEG-7 system of hash comparisons uses a three tiered approach to establish accurate links, starting with a ‘rough fingerprint’ and then moving on to more granular levels. For simplicity and speed (and due to us not needing accuracy down to fractions of seconds) I decided to only store the components of the ‘rough fingerprint’ in the database. Each ‘rough fingerprint’ represents the information contained in forty-five frame segments. Full fingerprints are stored as sidecar files in our archival packages.

As digital preservation at CUNY TV revolves around a set of microservices, I was able to write a script for fingerprint generation and database insertion that can be run on individual files as well as inserted as necessary into preservation actions (such as AIP generation). This script is available here at the mediamicroservices repository on github. Likewise, my script for generating a hash from an input and comparing it against values stored in the database can be found here.

I am now engaged with creating and ingesting fingerprints into the database for as many files as possible. As of writing, there are around 1700 videos represented in the database by over 1.7 million individual fingerprints. The image below is an example of current search output.

Breweryguy

Breweryguy

During search, fingerprints are generated and then compared against the database, with matches being reported in 500 frame chunks. Output is displayed both in the terminal as well as superimposed on a preview window. In this example, a 100 second segment of the input returned matches from three files, with search time (including fingerprint generation) taking about about 25 seconds.

The most accurate results so far have been derived from footage of a more distinct nature (such as the example above). False positives occur more frequently with less visually unique sources, such as interviews being conducted in front of a uniform black background. While there are still a few quirks in the system, testing so far has been successful at identifying footage reused across multiple broadcasts with a manageably low amount of false results. Overall, I am very optimistic for the potential enhancements that perceptual hashing can add to CUNY TV archival workflows!

This post was written by Andrew Weaver, resident at CUNY TV.

Posted in resident post | Tagged , | Leave a comment

Reporting from the PNW: Online Northwest Conference

Hello, it’s your Pacific Standard Time/Pacific Northwest resident here reporting on happenings in the Cascadia region. I was fortunate to learn about the Online Northwest conference (someone who attended my KBOO Edit-a-Thon event told me about it!), although last minute. This small, one=day conference was well organized with the majority of the presenters and participants from Oregon and Washington but also from northern California, New York, and Virginia. It was held on Cesar Chavez’s birthday which is a recognized holiday for KBOO. And P.S.: some of our recently digitized open reel audio made it into KBOO’s Cesar Chavez Day Special Programming so check it out!

I learned that Online Northwest (#ONW17) previously had been held in Corvallis, had a hiatus day last year, and was held at PSU for the first time this year. At Online Northwest, there were four tracks: User Experience/Understanding Users, Design, Engagement/Impact, and Working with Data. Presenter slides are here: http://pdxscholar.library.pdx.edu/onlinenorthwest/2017/

First, I was fully engaged with Safiya Umoja Noble’s keynote “Social Justice in LIS: Finding the Imperative to Act” which highlighted the ongoing need for reevaluation and action in our work as information professionals to ensure that information is equitable in online catalogs, cataloging procedures, search algorithms, interface design, and personal interactions.

I attended the Kelley McGrath (University of Oregon) session on metadata where she talked about the potential of linked data and the semantic web, or the brave new metadata world. She talked about algorithms and machines that attempt to self-learn and present users with new information drawn from available linked data sources. Of course the information can be fickle at times since computers are largely unknowing of the data they are fed. End of session questions almost reached a point of interest where people were putting two and two together: one can say that algorithms are neutral but we can’t ignore that humans choose and develop algorithms. Rarely do individuals critique implemented algorithms for human bias. This reminded me of Andreas Orphanides’s keynote at Code4Lib: that a person initially frames the question that is being asked, and that influences the answer, or what is being looked at. Framing to focus on something specific requires excluding many other things. My takeaway from this session, on the heels of Dr. Noble’s keynote, was to be cautious and consider potential bias when using algorithms.

The Design of Library Things: Creating a Cohesive Brand Identity for Your Library (Stacy Taylor & Maureen Rust, Central Washington University) presenters talked about their process of brand identity from start to finish including their challenges and sharing their awesome resources. Considering tone of communication and signage (what is the emotional state of readers when they see your signs?), consistent styling of materials, defining scope of the project, clearly separating standards (requirements) from best practices, and continuously making a case and being firm on the needs for and benefits of branding were important takeaways.

Open Oregon Resources: From Google Form to Interactive Web Apps by Amy Hofer (Linn-Benton Community College) and Tamara Marnell (Central Oregon Community College) was fantastic! They walked through their use of Google forms/sheets and WordPress to share open educational resources in use by professors in Oregon Community Colleges, and ideas for how others could implement something similar, but better, considering that technology advances and old versions age, and considering the time and resource needs of an implemented tool.

Chris Petersen from Oregon State University (OSU) presented on Creating a System for the Online Delivery of Oral History Content. He described ways in which he reused existing OSU resources (Kaltura, XML, TEI, colleague knowledgeable in XSL/XSLT), common in the “poor archivist”’s toolbox. I also thought his reference to the Cascadia Subduction Zone’s impending massive destruction when arguing for the need to ensure multiple collocations of digital materials was uniquely PNW. Like the Open Oregon Resources talk, this session was modeled as here’s what I did, but it clearly can be done better; here’s what we’ll be doing or moving towards in the future. Chris mentioned they’ll be OHMS-ing in the future. Although it is less flexible than their current set-up, it was more viable considering his workload. He also mentioned Express Scribe software (inexpensive) as an in-house transcription tool and XSL-FO for formatting xml to pdf.

My favorite lightning talk was Electronic Marginalia (Lorena O’English, Washington State University): a case for web annotation. Tools I want to check out are hypothes.is (open source) and genius web annotator.

This post was written by Selena Chau, resident at KBOO Community Radio.

Posted in resident post | Tagged , | Leave a comment

AAPB NDSR April Webinar

Our last AAPB NDSR webinar is coming up this week!

Whats, Whys, and How Tos of Web Archiving
(Lorena Ramírez-López)
Thursday, April 6
3:00 PM EST

Basic introduction of:
– what web archiving is 🤔
– why we should web archive ¯\_(ツ)_/¯
– how to web archive 🤓

in a very DIY/PDA style that’ll include practical methods of making your website more “archive-able” – or not! – and a live demo of webrecorder.io using our very own NDSR website as an example: https://ndsr.americanarchive.org/!

Register for “Whats, Whys, and How Tos of Web Archiving”

And if you missed our March webinars from Kate McManus and Selena Chau, you can catch up now:

“Through the Trapdoor: Metadata & Disambiguation in Fanfiction, a case study,” presented by Kate McManus, AAPB NDSR resident at Minnesota Public Radio
–          view the slides
–          watch the recording

“ResourceSpace for Audiovisual Archiving,” presented by Selena Chau, AAPB NDSR resident at KBOO Community Radio, and Nicole Martin, DSenior Manager of Archives and Digital Systems, Human Rights Watch
–          view the slides
–          watch the recording
–          watch the demo videos: 1, 2, 3, 4

Posted in admin | Tagged | Leave a comment

Library Technology Conference

I had juuuuuuuuuuust enough in my professional development funds to attend my very first Library Technology Conference at Macalester College in St. Paul. My local mentor, Jason Roy, recommended the conference to me. Though I could only afford to attend one day, it was such a good conference. I did my best to get as much out of if as I could! Judging by the boost in my twitter followers, I did a really good job.

The morning began with caffeine, sugar, and keynote speaker Lauren Di Monte, who talked about data, data collection, and the internet of things. I was hanging out on twitter (#LTC2017) during the keynote and a lot of the chatter was about the ethics of harvesting that information. Should we archive someone’s information from their smartphone, especially private information like their health tracking apps? Probably this topic was too much for Di Monte to address during a keynote address, but it set the tone for conversations all day. (Which is what you want from a keynote!)

The first breakout session I went to was “My Resources Are Digital But My Staff is Still Analog!” with Brian McCann, which was a really engaging discussion on training staff and patrons on emerging tech tools.  The second I went to was slightly less helpful to me personally (I was really torn on this and another session, but that’s the way the conference crumbles). It was “Launching the Learning Commons: Digital Learning and Leadership Initiative” with Mandy Bellm. She did amazing work in her school district, revamping the library for 21st century use, but I would have liked to hear more about her process and experience writing the grants

My third breakout session, “Digital Texts and Learning: Overcoming Barriers to Effective Use” with Brad Belbas and Dave Collins turned out to be rather emotional. Digital text formats in theory have a ton of ADA capabilities, but ultimately, tech companies and digital publishers don’t seem to have any incentive to make their digital texts readable to as many people as possible! My tweets from this session were still being passed around a day later.

The final session for my LibTechConf experience was the cherry on top of a great day, “Giving a Voice to Audio Collections” with Christopher Doll and Joseph Letriz who talked about their experience bringing oral histories into their online collections. Their project was to highlight the 100 year history of African American students at University of Dubuque, Iowa. They highlighted tools they used, such as the Oral History Metadata Synthesizer, to get the project on its feet as quickly as possible. It was fascinating! Make sure you check it out: http://digitalud-dev.dbq.edu/omeka/exhibits/show/aheadofthecurve/introduction

Overall, I would absolutely recommend this conference. I met librarians and digital content folks from all over the country. There were tons of good presentations at all different levels and entry points. The conversations were fun and productive, and as you might expect, the conversations on twitter were just as rich and full as the presentations themselves.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in resident post | Tagged , | Leave a comment

Professional Development Time Project: Audiorecorder

One of the incredible things about my NDSR residency is that it requires me to spend 20% of my time working on professional development. This allows me to devote quite a bit of time to learning skills and working on things of interest that would otherwise be outside the scope of my project. (For people who are considering applying to future NDSR cohorts, I can’t emphasize enough how great it is to have the equivalent of one day a week to focus your attention on growth in areas of your choosing).

A great way to spend some of this time was suggested to me through some personal audio projects I am working on. I realized that, A: I wanted a tool that would allow me a range of signal monitoring capabilities while digitizing audio. B: I didn’t feel like paying for it. Solution: Try to build one myself!

As I was familiar with using vrecord for video digitization, I decided to build a tool called ‘audiorecorder’ that followed the same approach- namely using a shell script to pipe data between FFmpeg and ffplay. This allows capturing data in high quality, while simultaneously visualizing it. In looking at the filters available through FFmpeg, I decided to build an interface that would allow monitoring of peak levels, spectrum and phase. This seemed doable and would give me pretty much everything I would need for both deck calibration and QC while digitizing. The layout I settled on looks like:

audiorecorder interface

I also wanted audiorecorder to support embedding BWF metadata and to bypass any requirement for opening the file in a traditional DAW for trimming of silence. After quite a bit of experimentation, I have hit upon a series of looped functions using FFmpeg that allow for files to be manually trimmed both at the start and at the end, with the option for auto silence detect at the start. Being able to do this type of manipulation makes it possible to embed BWF metadata without worrying about having to re-embed in an edited file.

To test the robustness of digitization quality I used the global analysis tools in the free trial version of Wavelab. After some hiccups early in the process I have been pleased to find that my tool is performing reliably with no dropped samples! (Right now it is using a pretty large buffer. This contributes to latency, especially when only capturing one channel. One of my next steps will be to test how far I can roll that back while adding variable buffering depending on capture choices).

Overall this has been a very fun process, and has accomplished the goal of ‘Professional Development’ in several ways. I have been able to gain more experience both with scripting and with manipulating media streams. Since the project is based on the AMIA Open Source Github site, I have been able to learn more about managing collaborative projects in an open-source context. While initially intimidating, it has been exciting to work on a tool in public and benefit from constant feedback and support. Plus, at the end of the day I am left with a free and open tool that fulfills my audio archiving needs!

Audiorecorder is based here https://github.com/amiaopensource/audiorecorder, and is installable via homebrew. Some of the next steps I am working on are to get a bit of real usage documentation, and to ensure that it is also possible to install audiorecorder via linuxbrew.

I would like to give particular thanks to the contributors who have helped with the project so far, Reto Kromer, Matt Boyd and Dave Rice!

This post was written by Andrew Weaver, resident at CUNY TV.

Posted in resident post | Tagged , | Leave a comment