Clojure: Extracting child elements from an XML document with zip-filter
I’ve been following Nurullah Akkaya’s blog post about navigating XML documents using the Clojure zip-filter API and I came across an interesting problem in a document I’m parsing which goes beyond what’s covered in his post.
Nurullah provides a neat zip-str function which we can use to convert an XML string into a zipper object:
(require '[clojure.zip :as zip] '[clojure.xml :as xml])
(use '[clojure.contrib.zip-filter.xml])
(defn zip-str [s]
(zip/xml-zip (xml/parse (java.io.ByteArrayInputStream. (.getBytes s)))))
The fragment of the document I’m parsing looks like this:
(def test-doc (zip-str "<?xml version='1.0' encoding='UTF-8'?>
<root>
<Person>
<FirstName>Charles</FirstName>
<LastName>Kubicek</LastName>
</Person>
<Person>
<FirstName>Mark</FirstName>
<MiddleName>H</MiddleName>
<LastName>Needham</LastName>
</Person>
</root>"))
I wanted to be able to get the full names of each of the people such that I’d have a collection which looked like this:
("Charles Kubicek" "Mark H Needham")
My initial thinking was to get all the child elements of the Person element and operate on those:
(require '[clojure.contrib.zip-filter :as zf])
(xml-> test-doc :Person zf/children text)
Unfortunately that gives back all the names in one collection like so:
("Charles" "Kubicek" "Mark" "H" "Needham")
Since it’s not mandatory to have a MiddleName element it’s not possible to work out which names go with which person!
A bit of googling led me to stackoverflow where Timothy Pratley suggests that we need to get up to the Person element and then pick each of the child elements individually.
We can do that by mapping over the collection with a function which creates a vector for each Person containing all their names.
In pseudo-code this is what we want to do:
> (map magic-function (xml-> test-doc :Person))
(["Charles" "Kubicek"] ["Mark" "H" "Needham"])
Timothy suggests the juxt function which is defined like so:
juxt Takes a set of functions and returns a fn that is the juxtaposition of those fns. The returned fn takes a variable number of args, and returns a vector containing the result of applying each fn to the args (left-to-right).
A simple use of juxt could be to create some values containing my name:
((juxt #(str % " loves Clojure") #(str % " loves Scala")) "Mark")
Which returns:
["Mark loves Clojure" "Mark loves Scala"]
We can use juxt to build the collection of names and then use http://clojuredocs.org/clojure_core/clojure.string/join to separate them with a space.
The code to do this ends up looking like this:
(require '[clojure.string :as str])
(defn get-names [doc]
(->> (xml-> doc :Person)
(map (juxt #(xml1-> % :FirstName text) #(xml1-> % :MiddleName text) #(xml1-> % :LastName text)))
(map (partial filter seq))
(map (partial str/join " "))))
We use a filter on the second last line to get rid of any nil values in the vector (e.g. no middle name) and then combine the names on the last line.
We can then call the function:
> (get-names test-doc)
("Charles Kubicek" "Mark H Needham")
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.