Parsing Economic data with Clojure

I’ve just got it into my head that I want to do some economical research and set out to find a big set of data, which I quickly did in the form of the St Lous FED’s FRED API/service.

The API exposes quite a lot of functions for retrieving data, so much that I needed to decide on an approach for getting as much of it as possible by doing as little as possible. I finally decided to use the category ids that are available on the page I linked to above.

With that in mind my code starts as follows:

(ns fred.core
  (:require
   [clojure.contrib.duck-streams :as ds]
   [clojure.xml :as xml]
   [clojure.contrib.str-utils2 :as string]
   [clojure.contrib.sql :as sql])
  (:import [java.io File]))

(def main-cats (list 23 1 9 10 15 32145 18 22 24 31 45 13 46 32263))

As you can see main-cats contains the ids mentioned above.

What we need now are a few lines which will enable us to download XML files which contain information about one or the other of the following: sub categories or series. If the XML contains series we have a leaf category for want of a better word, if it contains other categories we have a branch and need to continue to fetch its contents in turn.

Let’s start out with the obvious functions for simply retrieving an XML file:

(defn get-rest [action args]
  (str "http://api.stlouisfed.org/fred/" action "?api_key=my-FRED-API-Key&realtime_start=1776-07-04&realtime_end=9999-12-31" args)) 

(defn slurp-cat [id]
  (slurp (get-rest "category/children" (str "&category_id=" id))))

Not much to add here, note that slurp is sufficient for these simple GET requests.

As you probably could infer from the problem description above the solution should be recursive. This is something I normally would have implemented but in this case so much can go wrong (external data on the other side of the world can be a bitch) so I wanted something I could iterate and that could be restarted without having to start from the beginning. It’s also a one off operation so I didn’t mind running the algo a few times to get to the leaves of the category tree.

Let’s start out with saving the main-cats:

(defn save-cat [id]
  (ds/spit
   (str "cats/cat" id ".xml")
   (slurp-cat id)))

(defn get-main-cats []
  (.mkdir (File. "cats"))
  (map save-cat main-cats))

The duck streams library is responsible for saving the files in the cats folder. OK now we have the main categories saved, it’s time to start iterating until we have all of them.

(defn file-exists [fname]
  (.exists (File. fname)))

(defn cat-exists [id]
  (file-exists (str "cats/cat" id ".xml")))

(defn cat-ids-from-xml [f]
  (map
   #(:id (:attrs %))
   (:content (first (xml-seq (xml/parse f))))))

(defn get-ids-in-dir [dir]
  (flatten (map cat-ids-from-xml (rest (file-seq (File. dir))))))

(defn get-sub-cats []
  (map
   #(when-not (cat-exists %)
      (do (save-cat %) %))
   (get-ids-in-dir "cats"))

So get-ids-in-dir will look in each file in the cats dir and extract all category ids that their XML contains. And as you know we already have the XML files of the main-cats downloaded so we have something to start with. We also know that they do not contain any data series, hence we can safely start running get-sub-cats to start to get all the categories in FRED.

Note that xml-seq is not belonging to clojure.xml, it’s core stuff.

The xml-seq and xml-parse combo will return the XML in a mix of hashes and vectors which we navigate/manage by way of first, rest, flatten and hash key lookups.

Note a few subtleties in get-sub-cats. We plug a sequence of category ids into the map, which will add false/nil to it’s own return list when the category in question already exists, if it does exist we use do in order to make sure we return something else than nil/false. In this case that else is the id in question. This allows us to repeatedly run get-sub-cats until it returns a list filled with only nil or false (can’t remember which it is, I think it’s nil). If we get that nil list back we know we have all categories.

Getting the series the leaf categories contain and then the observations of the series is done with basically the same components and functions as above so I won’t get into that, I will just repeat myself if I do.

The takeaway here is how to work with XML files and the native Java functions for handling directories and files.

Related Posts

Tags: , ,