ingesting large objects to fedora with ruby and rest

Sat, Feb 2, 2013

In this tutorial I show how to use the peddle bin of sanity into which you can drop your SOAP code and instead use Ruby and REST to ingest large files into a Fedora repository.

It seemed a good idea at the time. Using the Fedora SOAP API as I've used XMLBeans before and creating a client API from the WSDL is easy using Maven with XMLBeans. What you get is a very nice document based API. For example this is what adding a content datastream looks like with the XMLBeans API:

There's a ton more stuff to do before that, including messing around with XML cursors to insert the RELS-EXT but I know my way around XML so it wasn't a problem. What was a problem, it turned out, was this line:

objDatastreamVersion.setBinaryContent(item.getBinaryContent());

Never has one simple, innocuous line of code caused so much hassle! The project I was working on started off simple with simple requirements, one of which was to limit file uploads to 10Mb. Then the project took off, became popular and the requirements changed, which is fine. That's how projects work and working in an agile development framework lets you handle them just fine. One of the new requirements was to remove the filesize limit on uploads. And that's when the trouble started. You know the story, the users are breaking the door down. The project manager is shouting through a loud hailer from behind some sandbags and you're wondering what your next move is. You remember the old agile maxim of 'chuck out the first one', roll up your sleeves and dig into the Fedora source code to find out why the server is running out of memory when someone uploads a 70Mb zip file.

The simplest SOAP call to ingest an object is a complete FOXML document containing the base64 encoded zip file which Fedora then extracts into a String. A 70Mb String. All in a oner. So you can see where this is going. A better way is to get Fedora to create the object then attach the various datastreams. This way, the binary content gets chunked at the Tomcat end and is dribbled into the object rather than stuffed down its thrapple. But I had no code to do that and the prospect of working with the raw FOXML at each step, throwing in XML cursors and raw RDF parsing was looking too much. The reason I was using SOAP was the ease of use from XMLBeans and Java. Why Java? An initial requirement was the front end had to talk to multiple repositories so I wrote DRAKula (Digital Repository Abstraction Kit, err, oo-la-la!). This is middleware that can talk to Fedora and Intralibrary but that requirement had been dropped. So I was free to drop the SOAP. But what to replace it with?

Well Fedora has a nice REST API and I'd written the front end in Rails. I'd used the Ruby RestClient to communicate with DRAKula so why not go the full hog and use it to talk to Fedora directly? So I put DRAK back in his box and ported the code to Ruby as a Fedora class in the front end. And suddenly ingesting large objects became a breeze. Here's how to do it in Ruby.

In the following code the URL you'll be using is:

@fedora_api_m = "#{protocol}://#{username}:#{password}@#{url}/fedora"

e.g.

https://testuser:testpassword@localhost:8443/fedora

You first need to create the empty object and get Fedora to generate a new PID for you, which will include the repository namespace:

new_object_pid = RestClient.post "#{@fedora_api_m}/objects/new", nil

Then you can go ahead and update the default DC record for the object. This isn't very useful as it's just for Fedora housekeeping but it lets you use the simple search endpoint when developing. The OAIDC_NS is there to tell Nokogiri about the namespace. References to metadata are just a Metadata object the class gets from the front end.

After that I create a DCTERMS datastream which is heading along the path of custom metadata. I chose this as resources being imported from the Intralibrary had DCTERMS attached so I thought, what the heck, might as well use the same in Fedora. You're not limited to DCTERMS though. You can have any custom metadata you like.

Again nice and easy. Next up is the actual content. The 70Mb zip file. I've chosen to call this datastream 'original' as it's the original content. Note how Fedora will add the file extension to the downloaded file.

I then add an owner as a literal. What does 'literal' mean? Say you have the owner 'harrymcd'. If the relationship isn't literal then there has to be an object in Fedora called 'harrymcd' but with the repository prefix. That doesn't exist so instead you say it's a literal relationship. 'harrymcd' will be added as a string rather than an object reference in the RELS-EXT.

There's a lot of code I've omitted which adds the resource to collections based on the user's choice in the front end, as well as terms and conditions files which get uploaded at the same time and associated with the resource but you get the gist. Ho ho! Geddit? Gist!

Oh well, one must keep a sense of humour bubbling on the pot when one is working with middleware.

comments powered by Disqus