reading huge xml with xmlreader

sep 22 2019, 5:04am

The DOMDocument class is good for reading small XML file but for large / huge XML, code may stall and give you no error at all. For large XML, you should use XMLReader instead to preserve your server memory usage.

XML source

A huge example of XML was downloaded from this page at Karaoke Version affiliation program and it generally has following structure:

<artists>
	<artist id="2000">
		<name>The Solids</name>
		<name_sorted>Solids, The</name_sorted>
		<url>http://www.karaoke-version.com/mp3-backingtrack/the-solids/</url>
		<rank>5866</rank>
		<songs>
			<song id="5022">
				<name>Hey Beautiful</name>
				<url>http://www.karaoke-version.com/mp3-backingtrack/the-solids/hey-beautiful.html</url>
				<rank>24467</rank>
				<preview>http://www.karaoke-version.com/preview/57278/</preview>
				...

If we somehow need to save that into our database then we may format a data row / line i.e: artist's name, artist's song name and the link for previewing the audio. Example:

The Solids, Hey Beautiful, http://www.karaoke-version.com/preview/57278/
...

PHP script

<?php
	$t = time();
	$m = memory_get_usage();

	const XML_FILENAME = 'karaokeversion_catalog_en_GBP.xml';

	$liner = new XMLReader();
	$liner->open(XML_FILENAME);

	$artistCount = 0;	//number of artists
	$songCount = 0;		//number of songs (all artists)

	while($liner->read()){
	if($liner->nodeType === XMLReader::ELEMENT && $liner->name === 'artist'){

	//convert current line into an XML node
	$node = $liner->expand();

	//for each artist node found, assume unknown artist name, initialize it
	$artistName = '';

	//walk through this artist node's child nodes to find artist name and songs
	for($j = 0; $j < $node->childNodes->length; $j++){
		$nodeChild = $node->childNodes->item($j);
		if($nodeChild->nodeType === XML_ELEMENT_NODE && $nodeChild->nodeName === 'name')
			$artistName = $nodeChild->nodeValue;
		elseif($nodeChild->nodeName === 'songs'){
			
			//walk through this songs node's child nodes
			for($k = 0; $k < $nodeChild->childNodes->length; $k++){
				$nodeGrandChild = $nodeChild->childNodes->item($k);
				if($nodeGrandChild->nodeType === XML_ELEMENT_NODE && $nodeGrandChild->nodeName === 'song'){

					//for each song node found, assume unknown song details, initialize them
					$songName = '';
					$songPreview = '';
			
					//walk through this song node's child nodes
					for($l = 0; $l < $nodeGrandChild->childNodes->length; $l++){
						$nodeGrandGrandChild = $nodeGrandChild->childNodes->item($l);
						if($nodeGrandGrandChild->nodeType === XML_ELEMENT_NODE){
							if($nodeGrandGrandChild->nodeName === 'name')
								$songName = $nodeGrandGrandChild->nodeValue;
							elseif($nodeGrandGrandChild->nodeName === 'preview')
								$songPreview = $nodeGrandGrandChild->nodeValue;
						}
					}
					//add validation first here then format a new entry line to be stored somewhere
					if(!empty($artistName) && !empty($songName) && filter_var($songPreview, FILTER_VALIDATE_URL)){
						echo "$artistName, $songName, $songPreview\n";
						$songCount++;
					}
				}
			}
		}
	}
	$artistCount++;
	}
	}//end while

	$liner->close();

	//report
	$t = time() - $t;
	$m = memory_get_usage() - $m;

	echo "time spent: $t seconds.\n",
	"memory usage: $m bytes.\n",
	"artist count: $artistCount.\n",
	"song count: $songCount.\n";
?>

Demo

Click on following link to test: xmlreader (on a sister site).

Comments