Parsing large XML files in python without a billion gigs of ram


One of the attractions of working at a startup is that you can generally avoid having to contend with the technologies of yesteryear. Mainframes, leased lines, COBOL... no one should have to deal with this stuff anymore, right? Unfortunately, you will eventually collide with reality, and even though this is 2014 and we all know JSON is the light and the way, your helpdesk ticketing vendor, whom you have no control over, will tell you the only way to get bulk data out of their system is... an XML export.


OK, fine, whatever. This is a one-time thing, I don't need to care for and feed this XML, I just need to rip it apart and stick the data in Postgres where we can actually make use of it. Shouldn't be too hard, I'll just whip up a little python script...

import xml.etree.cElementTree as ET

tree = ET.parse('huge.xml')
for ticket_node in tree.findall('ticket'):

...which would work great if we were talking about a few MB, but when huge.xml is 1.3GB this approach just melts your laptop (on a 16GB MacbookPro, once the python process took more than about 3GB the system became almost totally unresponsive, and it wasn't nearly done yet). Back to square one.

First let's take a quick look at our data.

<?xml version="1.0" encoding="UTF-8"?>
<tickets report_date="20140217">
    <!-- various ticket fields, some of which I want -->
    <comments type="array">
        <!-- various comment fields, some of which I want -->
      <!-- possibly more comment tags -->
  <!-- many, many ticket tags -->

Not very complicated. As a whole it's not really a document, just a list of <ticket> nodes, and each of those is a small document, which I want to pick a few pieces out of.  I don't need to do any complex navigation of the tree, just grab a few things from each <ticket> as it's read then throw it away and read the next one.  Turns out ElementTree provides a tool for just this use case: iterparse().  Let's try again:

import xml.etree.cElementTree as ET

for event, element in ET.iterparse('huge.xml'):
    if event == 'end' and element.tag == 'ticket':
        #process ticket...

...what the?!  My laptop melted again!  This execution followed precisely the same memory-usage (and system responsiveness) arc as the parse-the-whole-file approach.  What gives?

OK, a little googling tells me that even though iterparse() emits elements as they are read, it's still building up a complete document tree in memory, just as it did when I tried parse() at first.  Several blog posts and Stack Overflow answers recommend adding element.clear() at the end of the loop as a way of cleaning up the objects you don't need anymore and thus limiting memory consumption.  I'll save you the trouble: it doesn't work. At all. Yet other blog posts, SO answers, and even an IBM whitepaper suggest a somewhat more thorough housekeeping job at the end of the loop:

import lxml.etree as ET #the IBM piece used lxml but I tried cElementTree also

for event, element in ET.iterparse('huge.xml'):
    if event == 'end' and element.tag == 'ticket':
        #process ticket...


    while elem.getprevious() is not None:
        del elem.getparent()[0]

...argh! I melted ANOTHER laptop!

Why doesn't this work? Frankly, I have no idea.

Let me digress a moment on why I love python. As a DBA and systems engineer, I face a large number of one-off programming challenges. Move this from here to there. Munge this data. Migrate from this to that. This type of challenge is well suited to a problem-solving approach I think of as Brute Force Programming (apologies to whatever forgotten blogger I'm ripping this off from). In short, sometimes it's not worth spending any time at all thinking about building an elegant, maintainable solution.  Sometimes you just need to solve this particular problem, right now, and then forget about it. Python is utterly fantastic at this. The concise syntax, complete lack of ceremony, design philosophy of one-obvious-way, and wealth of libraries all contribute to a tool that's very easy to shape quickly to the form of whatever problem you happen to be facing. Even if the result is 10 times slower than an equivalent Java solution, if it takes 5 minutes to write instead of 5 hours, I've come out ahead, because man-hours are much more valuable at the margin than cpu-hours.

All that is a long-winded way of justifying the following brute force solution to my problem, which actually works, and doesn't melt laptops:

import xml.etree.cElementTree as ET

def process_buffer(buf):
    tnode = ET.fromstring(buf)
    #pull it apart and stick it in the database

inputbuffer = ''
with open('huge.xml','rb') as inputfile:
    append = False
    for line in inputfile:
        if '<ticket>' in line:
            inputbuffer = line
            append = True
        elif '</ticket>' in line:
            inputbuffer += line
            append = False
            inputbuffer = None
            del inputbuffer #probably redundant...
        elif append:
            inputbuffer += line

Not the most elegant, or efficient, or general solution, but it Worked. Just read the file by hand, exploiting the simplicity of its structure to slice its contents into manageable chunks before parsing, then parse and process each chunk independently, and finally make double sure to throw it all away once it isn't needed any longer.

The memory-consumption problem inherent in ingesting huge XML files in python is that holding the whole tree in memory is too expensive, and the commonly-used libraries want to do this no matter how much data you're dealing with, and even if you ask them to work iteratively. The commonly-proposed solutions involve mucking about in that tree and selectively clearing away memory as you go, but when this approach fails to work it may require a deeper dive into the guts of the ElementTree object model than you are prepared to take to solve the problem at hand. Instead, ignore the XML superstructure, extract the bits you want with basic file and string operations as if XML parsers didn't exist, then attack each of those bits individually.

Comments (0)

The comments to this entry are closed.