*********************************************************************** * Scalable and Efficient XML DOM Parsing - Midway between SAX and DOM * *********************************************************************** http://www.infosys.tuwien.ac.at/staff/hummer/docs/domParsing.txt October 14, 2011 Type(s) ------- * "Praktikum" (12 ECTS) * topic is extensible to the scope of a Master's thesis Description ------------ The Extensible Markup Language (XML) is the prevalent format for data exchange on the Web. XML documents are machine-readable and specialized parsers take over the job of reading and interpreting data from a stream of characters. Two distinctly different parsing methods have emerged in the past years: SAX (Simple API for XML) and DOM (Document Object Model). While SAX is stream-based and "stateless" and hence scales for XML documents of arbitrary size, DOM parsers create the full tree model of the XML document in memory and hence only provide limited scalability. Existing research papers have proposed a lazy loading approach to XML parsing (e.g., [1,2]), which is accomplished by parsing the XML document in a two-step procedure: firstly, the plain document structure is created with links to the actual node content; secondly, the document is progressively parsed and node contents are loaded upon first access. Also the popular Xerces parser provides a method for lazy DOM evaluation: thereby, the DOM is initialized with an empty root node and child nodes are only loaded upon request. These existing approaches are able to lazily load nodes, but often fail to keep the memory consumption limited. In fact, after all elements have been (lazily) loaded (e.g., after full traversal of the document tree), the whole document again needs to fit into the available RAM. Recently, more sophisticated solutions have been proposed which allow to dynamically swap node data between RAM and disk storage [4]. This is clearly a very promising approach for complex DOM processing with limited memory capacity. Another space-efficient XML DOM parser has been presented in [5]. Unfortunately, there are so far no (or few) publicly available open-source tools that satisfactorily perform this task. This project aims at providing an easy-to-use solution for parsing very large XML documents in compliance with the Java DOM parsing interface (org.w3c.dom.*). First, existing research work and freely available frameworks will be analyzed. Secondly, these available tools will be evaluated using an experiment benchmark setup. Based on the evaluation results, the goal is to implement a prototype for parsing very-large XML documents in DOM fashion. Thereby, the goal is to design the solution based on (or as an extension of) an existing framework. Finally, the implemented prototype will be compared against the existing tools and the lessons learned shall be documented. Skill Requirements ------------------ * Good Java programming skills * Basic I/O, file handling, database technology * XML, DOM, SAX, Xerces, ... * Debugging and profiling (CPU usage, Memory leaks etc.) of Java applications [1] M. Noga, S. Schott, W. Loewe. Lazy XML processing. ACM symposium on Document engineering (DocEng'02), ACM, 2002. http://doi.acm.org/10.1145/585058.585075 [2] F. Farfan, V. Hristidis, R. Rangaswami. Beyond Lazy XML Parsing. 18th International Conference on Database and Expert Systems Applications (DEXA'07), 2007. [3] http://xerces.apache.org/xerces-j/faq-write.html#faq-4 [4] M. Probst. Processing Arbitrarily Large XML using a Persistent DOM. Balisage: The Markup Conference. 2010. http://www.balisage.net/Proceedings/vol5/html/Probst01/BalisageVol5-Probst01.html [5] F. Wang, J. Li, H. Homayounfar. A space efficient XML DOM parser. Data and Knowledge Engineering 60, 1, 185-207. 2007. http://dx.doi.org/10.1016/j.datak.2006.01.002 If you are interested in this topic please contact Waldemar Hummer (hummer@infosys.tuwien.ac.at).