Adventures in Perceptual Hashing

Background

One of the primary goals of my project at CUNY TV is to prototype a system of perceptual hashing that can be integrated into our archival workflows. Perceptual hashing is a method of identifying similar content using automated analysis; the goal being to eliminate the (often impossible) necessity of having a person look at every item one-by-one to make comparisons. Perceptual hashes function in a similar sense to standard checksums, except instead of comparing hashes to establish exact matches between files at the bit level, they establish similarity of content as would be perceived by a viewer or listener.

There are many examples of perceptual hashing in use that you might already have encountered. If you have ever used Shazam to identify a song, then you have used perceptual hashing! Likewise, if you have done a reverse Google image search, that was also facilitated through perceptual hashing. When you encounter a video on Youtube that has had its audio track muted due to copyright violation, it was probably detected via perceptual hashing.

One of the key differences between normal checksum comparisons and perceptual hashing is that perceptual hashes attempt to enable linking of original items and derivative or modified versions. For example, if you have a high quality FLAC file of a song and make a copy transcoded to MP3, you will not be able to generate matching checksums from the two files. Likewise, if you added a watermark to a digital video it would be impossible to compare the files using checksums. In both of these examples, even though the actual content is almost identical in a perceptive sense, as data it is completely different.

Perceptual hashing methods seek to accomodate for these types of changes by applying various transformations to the content before generating the actual hash, also known as a ‘fingerprint’. For example, the audio fingerprinting library Chromaprint converts all inputs to a sample rate of 11025 Hz before generating representations of the content in musical notes via frequency analysis which are used to create the final fingerprint. 1 The fingerprint standard in MPEG-7 that I have been experimenting with for my project does something analogous, generating global fingerprints per frame with both original and square aspect ratios as well as several sub-fingerprints. 2 This allows comparisons to be made that are resistant to differences caused by factors such as lossy codecs and cropping.

Hands on Example

As of the version 3.3 release, the powerful Open Source tool FFmpeg has the ability to generate and compare MPEG-7 video fingerprints. What this means, is that if you have the most recent version of FFmpeg, you are already capable of conducting perceptual hash comparisons! If you don’t have FFmpeg and are interested in installing it, there are excellent instructions for Apple, Linux and Windows users available at Reto Kromer’s Webpage.

For this example I used the following video from the University of Washington Libraries’ Internet Archive page as a source.

https://archive.org/embed/NarrowsclipH.264ForVideoPodcasting

From this source, I created a GIF out of a small excerpt and uploaded it to Github.

GIF

GIF

Since FFmpeg supports URLs as inputs, it is possible to test out the filter without downloading the sample files! Just copy the following command into your terminal and try running it! (Sometimes this might time out; in that case just wait a bit and try again).

Sample Fingerprint Command

ffmpeg -i https://ia601302.us.archive.org/25/items/NarrowsclipH.264ForVideoPodcasting/Narrowsclip-h.264.ogv -i https://raw.githubusercontent.com/privatezero/Blog-Materials/master/Tacoma.gif -filter_complex signature=detectmode=full:nb_inputs=2 -f null -

Generated Output

[Parsed_signature_0 @ 0x7feef8c34bc0] matching of video 0 at 80.747414 and 1 at 0.000000, 48 frames matching [Parsed_signature_0 @ 0x7feef8c34bc0] whole video matching

What these results show is that even though the GIF was of lower quality and different dimensions, the FFmpeg signature filter was able to detect that it matched content in the original video starting around 80 seconds in!

For some further breakdown of how this command works (and lots of other FFmpeg commands), see the example at ffmprovisr.

Implementing Perceptual Hashing at CUNY TV

At CUNY we are interested in perceptual hashing to help identify redundant material, as well as establish connections between shows utilizing identical source content. For example, by implementing systemwide perceptual hashing, it would be possible to take footage from a particular shoot and programmatically search for every production that it had been used in. This would obviously be MUCH faster than viewing videos one by one looking for similar scenes.

As our goal is collection wide searches, three elements had to be added to existing preservation structures: A method for generating fingerprints on file ingest, a database for storing them and a way to compare files against that database. Fortunately, one of these was already essentially in place. As other parts of my residency called for building a database to store other forms of preservation metadata, I had an existing database that I could modify with an additional table for perceptual hashes. The MPEG-7 system of hash comparisons uses a three tiered approach to establish accurate links, starting with a ‘rough fingerprint’ and then moving on to more granular levels. For simplicity and speed (and due to us not needing accuracy down to fractions of seconds) I decided to only store the components of the ‘rough fingerprint’ in the database. Each ‘rough fingerprint’ represents the information contained in forty-five frame segments. Full fingerprints are stored as sidecar files in our archival packages.

As digital preservation at CUNY TV revolves around a set of microservices, I was able to write a script for fingerprint generation and database insertion that can be run on individual files as well as inserted as necessary into preservation actions (such as AIP generation). This script is available here at the mediamicroservices repository on github. Likewise, my script for generating a hash from an input and comparing it against values stored in the database can be found here.

I am now engaged with creating and ingesting fingerprints into the database for as many files as possible. As of writing, there are around 1700 videos represented in the database by over 1.7 million individual fingerprints. The image below is an example of current search output.

Breweryguy

Breweryguy

During search, fingerprints are generated and then compared against the database, with matches being reported in 500 frame chunks. Output is displayed both in the terminal as well as superimposed on a preview window. In this example, a 100 second segment of the input returned matches from three files, with search time (including fingerprint generation) taking about about 25 seconds.

The most accurate results so far have been derived from footage of a more distinct nature (such as the example above). False positives occur more frequently with less visually unique sources, such as interviews being conducted in front of a uniform black background. While there are still a few quirks in the system, testing so far has been successful at identifying footage reused across multiple broadcasts with a manageably low amount of false results. Overall, I am very optimistic for the potential enhancements that perceptual hashing can add to CUNY TV archival workflows!

This post was written by Andrew Weaver, resident at CUNY TV.

Posted in resident post | Tagged , | Leave a comment

Reporting from the PNW: Online Northwest Conference

Hello, it’s your Pacific Standard Time/Pacific Northwest resident here reporting on happenings in the Cascadia region. I was fortunate to learn about the Online Northwest conference (someone who attended my KBOO Edit-a-Thon event told me about it!), although last minute. This small, one=day conference was well organized with the majority of the presenters and participants from Oregon and Washington but also from northern California, New York, and Virginia. It was held on Cesar Chavez’s birthday which is a recognized holiday for KBOO. And P.S.: some of our recently digitized open reel audio made it into KBOO’s Cesar Chavez Day Special Programming so check it out!

I learned that Online Northwest (#ONW17) previously had been held in Corvallis, had a hiatus day last year, and was held at PSU for the first time this year. At Online Northwest, there were four tracks: User Experience/Understanding Users, Design, Engagement/Impact, and Working with Data. Presenter slides are here: http://pdxscholar.library.pdx.edu/onlinenorthwest/2017/

First, I was fully engaged with Safiya Umoja Noble’s keynote “Social Justice in LIS: Finding the Imperative to Act” which highlighted the ongoing need for reevaluation and action in our work as information professionals to ensure that information is equitable in online catalogs, cataloging procedures, search algorithms, interface design, and personal interactions.

I attended the Kelley McGrath (University of Oregon) session on metadata where she talked about the potential of linked data and the semantic web, or the brave new metadata world. She talked about algorithms and machines that attempt to self-learn and present users with new information drawn from available linked data sources. Of course the information can be fickle at times since computers are largely unknowing of the data they are fed. End of session questions almost reached a point of interest where people were putting two and two together: one can say that algorithms are neutral but we can’t ignore that humans choose and develop algorithms. Rarely do individuals critique implemented algorithms for human bias. This reminded me of Andreas Orphanides’s keynote at Code4Lib: that a person initially frames the question that is being asked, and that influences the answer, or what is being looked at. Framing to focus on something specific requires excluding many other things. My takeaway from this session, on the heels of Dr. Noble’s keynote, was to be cautious and consider potential bias when using algorithms.

The Design of Library Things: Creating a Cohesive Brand Identity for Your Library (Stacy Taylor & Maureen Rust, Central Washington University) presenters talked about their process of brand identity from start to finish including their challenges and sharing their awesome resources. Considering tone of communication and signage (what is the emotional state of readers when they see your signs?), consistent styling of materials, defining scope of the project, clearly separating standards (requirements) from best practices, and continuously making a case and being firm on the needs for and benefits of branding were important takeaways.

Open Oregon Resources: From Google Form to Interactive Web Apps by Amy Hofer (Linn-Benton Community College) and Tamara Marnell (Central Oregon Community College) was fantastic! They walked through their use of Google forms/sheets and WordPress to share open educational resources in use by professors in Oregon Community Colleges, and ideas for how others could implement something similar, but better, considering that technology advances and old versions age, and considering the time and resource needs of an implemented tool.

Chris Petersen from Oregon State University (OSU) presented on Creating a System for the Online Delivery of Oral History Content. He described ways in which he reused existing OSU resources (Kaltura, XML, TEI, colleague knowledgeable in XSL/XSLT), common in the “poor archivist”’s toolbox. I also thought his reference to the Cascadia Subduction Zone’s impending massive destruction when arguing for the need to ensure multiple collocations of digital materials was uniquely PNW. Like the Open Oregon Resources talk, this session was modeled as here’s what I did, but it clearly can be done better; here’s what we’ll be doing or moving towards in the future. Chris mentioned they’ll be OHMS-ing in the future. Although it is less flexible than their current set-up, it was more viable considering his workload. He also mentioned Express Scribe software (inexpensive) as an in-house transcription tool and XSL-FO for formatting xml to pdf.

My favorite lightning talk was Electronic Marginalia (Lorena O’English, Washington State University): a case for web annotation. Tools I want to check out are hypothes.is (open source) and genius web annotator.

This post was written by Selena Chau, resident at KBOO Community Radio.

Posted in resident post | Tagged , | Leave a comment

AAPB NDSR April Webinar

Our last AAPB NDSR webinar is coming up this week!

Whats, Whys, and How Tos of Web Archiving
(Lorena Ramírez-López)
Thursday, April 6
3:00 PM EST

Basic introduction of:
– what web archiving is 🤔
– why we should web archive ¯\_(ツ)_/¯
– how to web archive 🤓

in a very DIY/PDA style that’ll include practical methods of making your website more “archive-able” – or not! – and a live demo of webrecorder.io using our very own NDSR website as an example: https://ndsr.americanarchive.org/!

Register for “Whats, Whys, and How Tos of Web Archiving”

And if you missed our March webinars from Kate McManus and Selena Chau, you can catch up now:

“Through the Trapdoor: Metadata & Disambiguation in Fanfiction, a case study,” presented by Kate McManus, AAPB NDSR resident at Minnesota Public Radio
–          view the slides
–          watch the recording

“ResourceSpace for Audiovisual Archiving,” presented by Selena Chau, AAPB NDSR resident at KBOO Community Radio, and Nicole Martin, DSenior Manager of Archives and Digital Systems, Human Rights Watch
–          view the slides
–          watch the recording
–          watch the demo videos: 1, 2, 3, 4

Posted in admin | Tagged | Leave a comment

Library Technology Conference

I had juuuuuuuuuuust enough in my professional development funds to attend my very first Library Technology Conference at Macalester College in St. Paul. My local mentor, Jason Roy, recommended the conference to me. Though I could only afford to attend one day, it was such a good conference. I did my best to get as much out of if as I could! Judging by the boost in my twitter followers, I did a really good job.

The morning began with caffeine, sugar, and keynote speaker Lauren Di Monte, who talked about data, data collection, and the internet of things. I was hanging out on twitter (#LTC2017) during the keynote and a lot of the chatter was about the ethics of harvesting that information. Should we archive someone’s information from their smartphone, especially private information like their health tracking apps? Probably this topic was too much for Di Monte to address during a keynote address, but it set the tone for conversations all day. (Which is what you want from a keynote!)

The first breakout session I went to was “My Resources Are Digital But My Staff is Still Analog!” with Brian McCann, which was a really engaging discussion on training staff and patrons on emerging tech tools.  The second I went to was slightly less helpful to me personally (I was really torn on this and another session, but that’s the way the conference crumbles). It was “Launching the Learning Commons: Digital Learning and Leadership Initiative” with Mandy Bellm. She did amazing work in her school district, revamping the library for 21st century use, but I would have liked to hear more about her process and experience writing the grants

My third breakout session, “Digital Texts and Learning: Overcoming Barriers to Effective Use” with Brad Belbas and Dave Collins turned out to be rather emotional. Digital text formats in theory have a ton of ADA capabilities, but ultimately, tech companies and digital publishers don’t seem to have any incentive to make their digital texts readable to as many people as possible! My tweets from this session were still being passed around a day later.

The final session for my LibTechConf experience was the cherry on top of a great day, “Giving a Voice to Audio Collections” with Christopher Doll and Joseph Letriz who talked about their experience bringing oral histories into their online collections. Their project was to highlight the 100 year history of African American students at University of Dubuque, Iowa. They highlighted tools they used, such as the Oral History Metadata Synthesizer, to get the project on its feet as quickly as possible. It was fascinating! Make sure you check it out: http://digitalud-dev.dbq.edu/omeka/exhibits/show/aheadofthecurve/introduction

Overall, I would absolutely recommend this conference. I met librarians and digital content folks from all over the country. There were tons of good presentations at all different levels and entry points. The conversations were fun and productive, and as you might expect, the conversations on twitter were just as rich and full as the presentations themselves.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in resident post | Tagged , | Leave a comment

Professional Development Time Project: Audiorecorder

One of the incredible things about my NDSR residency is that it requires me to spend 20% of my time working on professional development. This allows me to devote quite a bit of time to learning skills and working on things of interest that would otherwise be outside the scope of my project. (For people who are considering applying to future NDSR cohorts, I can’t emphasize enough how great it is to have the equivalent of one day a week to focus your attention on growth in areas of your choosing).

A great way to spend some of this time was suggested to me through some personal audio projects I am working on. I realized that, A: I wanted a tool that would allow me a range of signal monitoring capabilities while digitizing audio. B: I didn’t feel like paying for it. Solution: Try to build one myself!

As I was familiar with using vrecord for video digitization, I decided to build a tool called ‘audiorecorder’ that followed the same approach- namely using a shell script to pipe data between FFmpeg and ffplay. This allows capturing data in high quality, while simultaneously visualizing it. In looking at the filters available through FFmpeg, I decided to build an interface that would allow monitoring of peak levels, spectrum and phase. This seemed doable and would give me pretty much everything I would need for both deck calibration and QC while digitizing. The layout I settled on looks like:

audiorecorder interface

I also wanted audiorecorder to support embedding BWF metadata and to bypass any requirement for opening the file in a traditional DAW for trimming of silence. After quite a bit of experimentation, I have hit upon a series of looped functions using FFmpeg that allow for files to be manually trimmed both at the start and at the end, with the option for auto silence detect at the start. Being able to do this type of manipulation makes it possible to embed BWF metadata without worrying about having to re-embed in an edited file.

To test the robustness of digitization quality I used the global analysis tools in the free trial version of Wavelab. After some hiccups early in the process I have been pleased to find that my tool is performing reliably with no dropped samples! (Right now it is using a pretty large buffer. This contributes to latency, especially when only capturing one channel. One of my next steps will be to test how far I can roll that back while adding variable buffering depending on capture choices).

Overall this has been a very fun process, and has accomplished the goal of ‘Professional Development’ in several ways. I have been able to gain more experience both with scripting and with manipulating media streams. Since the project is based on the AMIA Open Source Github site, I have been able to learn more about managing collaborative projects in an open-source context. While initially intimidating, it has been exciting to work on a tool in public and benefit from constant feedback and support. Plus, at the end of the day I am left with a free and open tool that fulfills my audio archiving needs!

Audiorecorder is based here https://github.com/amiaopensource/audiorecorder, and is installable via homebrew. Some of the next steps I am working on are to get a bit of real usage documentation, and to ensure that it is also possible to install audiorecorder via linuxbrew.

I would like to give particular thanks to the contributors who have helped with the project so far, Reto Kromer, Matt Boyd and Dave Rice!

This post was written by Andrew Weaver, resident at CUNY TV.

Posted in resident post | Tagged , | Leave a comment

Save the Data!

On Saturday, February 25th, I met an archivist friend for lunch and then we went to our first DataRescue session, hosted by the University of Minnesota. The event went from Friday afternoon to Saturday evening.

I was going to go on Friday, but switched my plans midweek based on the weather forecast. On Friday, the metro area was supposed to see up to a foot of snow, which I was happy about because the week before, we had seen temperatures in the 60s. Instead we got a few flakes and temperatures in the mid thirties. I shouldn’t have to tell you how unusual that is for Minnesota in February. The promise of working at the climate change data rescue helped me channel some of my own frustration. Some things, like science, shouldn’t be political.

The goal, of course, was to capture federal information that should remain free and available, and back it up on DataRefuge.Org. Our focus was primarily climate data, identified by professors and instructors at the University as being essential to their teaching and research.

When we got there, the room was nearly full of information professionals all over the Twin Cities, all at different levels: 1. Nomination, 2. Seeding, 3. Researchers, 4. Checkers & Baggers, 5. Harvesters. Here is the workflow from github that we were working from.

We signed in and were directed to what I will call “Station 4.”  We walked back to the table and low and behold, Valerie Collins, from the D.C. NDSR cohort was there! Valerie and I have met before, and I was delighted to see her. She got us caught up to speed and my friend and I got to work. We checked datasets and then uploaded them to the Data Refuge site, creating records and filling in the metadata. It was archiving, it was cataloging, it was resistence, and it was fun. There were snacks, and cool, passionate people, and there was a real sense of being in a room where people knew how to use their skills to make sure knowledge about the reality of climate change stayed in the hands of the scientists and the people who need the information.

I hope they host another one.

This post was written by Kate McManus, resident at Minnesota Public Radio.

Posted in resident post | Tagged | Leave a comment

Moving Beyond the Allegory of the Lone Digital Archivist (& my day of Windows scripting at KBOO)

The “lone arranger” was a term I learned in my library sciences degree program and I accepted it. I visualized hard-working, problem-solving solo archivists in small staff-situations challenged with organizing, preserving, and providing access to the growing volumes of historically and culturally relevant materials that could be used by researchers. As much as the archives profession is about facilitating a deep relationship between researchers and records, the work to make archival records accessible to researchers needed to be completed first. The lone arranger term described professionals, myself to be one of them, working alone and with known limitations to meet this charge. This reality has encouraged archivists without a team to band together and be creative about forming networks of professional support. The Society of American Archivists (SAA) has organized support for lone arrangers since 1999, and now has a full-fledged Roundtable for professionals to meet and discuss their challenges. Similarly, support for the lone digital archivist was the topic of a presentation I heard at the recent 2017 Code4Lib conference held at UCLA by Elvia Arroyo-Ramirez, Kelly Bolding, and Faith Charlton of Princeton University.

Managing the digital record is a challenge that requires more attention, knowledge sharing, and training in the profession. At Code4Lib, digital archivists talked about how archivists in their teams did not know how to process born-digital works, that this was a challenge, but more than that unacceptable in this day and age. It was pointed out that our degree programs didn’t offer the same support for digital archiving compared to processing archival manuscripts and other ephemera. The NDSR program aims to close the gap on digital archiving and preservation, and the SAA has a Digital Archives Specialist credential program, but technology training in libraries and archives shouldn’t be limited to the few who are motivated to seek out this training. Many jobs for archivists will be in a small or medium-sized organizations, and we argued that processing born-digital works should always be considered part of archival responsibilities. Again, this was a conversation among proponents of digital archives work, and I recognize that it excludes many other thoughts and perspectives. The discussion would be more fruitful by including individuals who may feel there is a block to their learning and development to process born-digital records, and to focus the discussion on learning how to break down these barriers.

Code4Lib sessions (http://bit.ly/d-team-values, http://scottwhyoung.com/talks/participatory-design-code4lib-2017/) reinforced values of the library and archives profession, namely advocacy and empowering users. No matter how specialized an archival process is, digital or not, there is always a need to be able to talk about the work to people who know very little about archiving, whether they are stakeholders, potential funders, community members, or new team members. Advocacy is usually associated with external relations, but is an approach that can be taken when introducing colleagues to technology skills within our library and archives teams. Many sessions at Code4Lib were highly technical, yet the conversation always circled back to helping the users and staying in touch with humanity. When I say highly technical, I do not mean “scary.” Another session reminded us that technology can often cause anxiety, and can be misinterpreted as something that can solve all problems. When we talk to people, we should let them know what technology can do, and what it can’t do. The reality is that technology knowledge is attainable and shouldn’t be feared. It cannot solve all work challenges but having a new skill set and understanding of technology can help us reach some solutions. It can be a holistic process as well. The framing of challenges is a human-defined model, and finding ways to meet the challenges will also be human driven. People will always brainstorm their best solutions with the tools and knowledge they have available to them—so let’s add digital archiving and preservation tools and knowledge to the mix.

And the Windows scripting part?

I was originally going to write about my checksum validation process on Windows, without Python, and then I went to Code4Lib which was inspiring and thought-provoking. In the distributed cohort model I am a lone archivist if you frame your perspective around my host organization. But, I primarily draw my knowledge from my awesome cohort members and my growing professional network I connected with on Twitter (Who knew? Not me.). So I am not a lone archivist in this expanded view. When I was challenged to validate a large number of checksums without the ability to install new programs to my work computer, I asked for help from my colleagues. So below is my abridged process where you can discover how I was helped through an unknown process with a workable solution using not only my ideas, but ideas from my colleagues. Or scroll all the way down for “Just the solution.”

KBOO recently received files back from a vendor who digitized some of our open-reel content. Hooray! Like any good post-digitization work, ours had to start with verification of the files, and this meant validating checksum hash values. Follow me on my journey through my day of Powershell and Windows command line.

Our deliverables included a preservation wav, mezzanine wav, and mp3 access file, plus related jpgs of the items, an xml file, and a md5 sidecar for each audio file. The audio filenames followed our filenaming convention which was designated in advance, and files related to a physical item were in a folder with the same naming convention.

Md5deep can verify file hashes with two reports created with the program, but I had to make some changes to the format of the checksum data before they could be compared.

Can md5deep run recursively through folders? Yes, and it can recursively compare everything in a directory (and subdirectories) against a manifest.

Can md5deep selectively run on just .wav files? Not that I know of, so I’ll ask some people.

Twitter & Slack inquiry: Hey, do you have a batch process that runs on designated files recursively?

Response: You’d have to employ some additional software or commands like [some unix example]

@Nkrabben: Windows or Unix? How about Checksumthing?

Me: Windows, and I can’t install new programs, including Python at the moment

@private_zero: hey! I’ve done something similar but not on Windows. But, try this Powershell script that combines all sidecar files into one text file. And by the way, remember to sort the lines in the file so they match the sort of the other file you’re comparing it to.

Me: Awesome! When I make adjustments for my particular situation, it works like a charm. Can powershell scripts be given a clickable icon to run easily like windows batch files in my work setup where I can’t install new things?

Answer: Don’t know… [Update: create a file with extension .ps1 and call that file from a .bat file]

@kieranjol: Hey! If you run this md5deep command it should run just on wav files.

Me: Hm, tried it but doesn’t seem like md5deep is set up to run with that combination of Windows parameters.

@private_zero: I tried running a command, seems like md5deep works recursively but not picking out just wav files. Additional filter needed.

My afternoon of Powershell and command line: Research on FC (file compare), sort, and ways to remove characters in text files (the vendor file had an asterisk in front of every file name in their sidecar files that needed to be removed to match the output of an md5deep report).

??? moments:

Turns out using powershell forces output as UTF-8 BOM as compared to ascii/’plain’ utf output of md5deep text files. Needed to be resolved before comparing files.

The md5deep output that I created listed names only and not paths, but that left space characters at the end of lines! That needed to be stripped out before comparing files.

I tried to perform the same function of the powershell script in windows command line but was hitting walls so went ahead with my solution of mixing powershell and command line commands.

After I got 6 individual commands to run, I combined the Powershell ones and the Windows command line ones, and here is my process for validating checksums:

Just the solution:

It’s messy, yes, and there are better and cleaner ways to do this! I recently learned about this shell scripting guide that advocates for versioning, code reviews, continuous integration, static code analysis, and testing of shell scripts. https://dev.to/thiht/shell-scripts-matter

Create one big list of md5 hashes from vendor’s individual sidecar files using Powershell

–only include the preservation wav md5 sidecar files, look for them recursively through the directory structure, then sort them alphabetically. The combined file is named mediapreserve_20170302.txt. Remove the asterisk (vendor formatting) so that the text file matches the format of an md5deep output file. After removing asterisk, the vendor md5 hash values will be in the vendormd5edited.txt file.

open powershell

nav to new temp folder with vendor files

dir .\* -exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav,*_mezz.wav,*.jpg,*.txt,*.bat -rec | gc | out-file -Encoding ASCII .\vendormd5.txt; Get-ChildItem -Recurse A:\mediapreserve_20170302 -Exclude *_mezz.wav.md5,*.xml,*.mp3, *.mp3.md5,*.wav.md5,*_mezz.wav,*.jpg,*.bat,*.txt | where { !$_.PSisContainer } | Sort-Object name | Select FullName | ft -hidetableheaders | Out-File -Encoding “UTF8” A:\mediapreserve_20170302\mediapreserve_20170302.txt; (Get-Content A:\mediapreserve_20170302\vendormd5.txt) | ForEach-Object { $_ -replace ‘\*’ } | set-content -encoding ascii A:\mediapreserve_20170302\vendormd5edited.txt

Create my md5 hashes to compare to vendor’s

–run md5deep on txt list of wav files from inside the temp folder using Windows command line (Will take a long time to hash multiple wav files)

“A:\md5deep-4.3\md5deep.exe” -ebf mediapreserve_20170302.txt >> md5.txt

Within my new md5 value list text file, sort my md5 hashes alphabetically and trim the end space characters to match the format of the vendor checksum file. Then, compare my text file containing hashes with the file containing vendor hashes.

–I put in pauses to make sure the previous commands completed, and so I could follow the order of commands.

run combined-commands.bat batch file (which includes):

sort md5.txt /+34 /o md5sorteddata.txt

timeout /t 2 /nobreak

@echo off > md5sorteddata_1.txt & setLocal enableDELAYedeXpansioN

for /f “tokens=1* delims=]” %%a in (‘find /N /V “” ^<md5sorteddata.txt’) do ( SET “str=%%b” for /l %%i in (1,1,100) do if “!str:~-1!”==” ” set “str=!str:~0,-1!” >>md5sorteddata_1.txt SET /P “l=!str!”>md5sorteddata_1.txt echo.

)

timeout /t 5 /nobreak

fc /c A:\mediapreserve_20170302\vendormd5edited.txt A:\mediapreserve_20170302\md5sorteddata_1.txt

pause

The two files are the same, so all data within it matches, therefore, all checksums match. So, we’ve verified the integrity and authenticity of files transferred successfully to our server from the vendor.

This post was written by Selena Chau, resident at KBOO Community Radio.

Posted in project updates, resident post | Tagged , | 1 Comment