reading huge xml with xmlreader

sep 22 2019, 5:04am feb 8, 2022

The DOMDocument class is good for reading small XML file but for large / huge XML, code may stall and give you no error at all. For large XML, you should use XMLReader instead to preserve your server memory usage.

XML source

A huge example of XML was downloaded from this page at Karaoke Version affiliation program and it generally has following structure:

<artists>
	<artist id="2000">
		<name>The Solids</name>
		<name_sorted>Solids, The</name_sorted>
		<url>http://www.karaoke-version.com/mp3-backingtrack/the-solids/</url>
		<rank>5866</rank>
		<songs>
			<song id="5022">
				<name>Hey Beautiful</name>
				<url>http://www.karaoke-version.com/mp3-backingtrack/the-solids/hey-beautiful.html</url>
				<rank>24467</rank>
				<preview>http://www.karaoke-version.com/preview/57278/</preview>
				...

If we somehow need to save that into our database then we may format a data row / line i.e: artist's name, artist's song name and the link for previewing the audio. Example:

The Solids, Hey Beautiful, http://www.karaoke-version.com/preview/57278/
...

PHP script

<?php
	$t = time();
	$m = memory_get_usage();

	const XML_FILENAME = 'karaokeversion_catalog_en_GBP.xml';
	
	$liner = new XMLReader();
	
	if($liner->open(XML_FILENAME)){
		$artistCount = 0;	//number of artists
		$songCount = 0;		//number of songs of all artists

		while($liner->read()){
			if($liner->nodeType === XMLReader::ELEMENT && $liner->name === 'artist'){
				
				//convert current line into an XML node
				$node = $liner->expand();
				
				//for each artist node found, assume unknown artist name, initialize it
				$artistName = '';
				
				//walk through this artist node's child nodes to find artist name and songs
				for($j = 0; $j < $node->childNodes->length; $j++){
					$nodeChild = $node->childNodes->item($j);
					if($nodeChild->nodeType === XML_ELEMENT_NODE && $nodeChild->nodeName === 'name')
						$artistName = $nodeChild->nodeValue;
					elseif($nodeChild->nodeName === 'songs'){
						
						//walk through this songs node's child nodes
						for($k = 0; $k < $nodeChild->childNodes->length; $k++){
							$nodeGrandChild = $nodeChild->childNodes->item($k);
							if($nodeGrandChild->nodeType === XML_ELEMENT_NODE && $nodeGrandChild->nodeName === 'song'){

								//for each song node found, assume unknown song details, initialize them
								$songName = '';
								$songPreview = '';
						
								//walk through this song node's child nodes
								for($l = 0; $l < $nodeGrandChild->childNodes->length; $l++){
									$nodeGrandGrandChild = $nodeGrandChild->childNodes->item($l);
									if($nodeGrandGrandChild->nodeType === XML_ELEMENT_NODE){
										if($nodeGrandGrandChild->nodeName === 'name')
											$songName = $nodeGrandGrandChild->nodeValue;
										elseif($nodeGrandGrandChild->nodeName === 'preview')
											$songPreview = $nodeGrandGrandChild->nodeValue;
									}
								}
								//add validation first here then format a new entry line to be stored somewhere
								if(!empty($artistName) && !empty($songName) && filter_var($songPreview, FILTER_VALIDATE_URL)){
									echo "$artistName, $songName, $songPreview\n";
									$songCount++;
								}
							}
						}
					}
				}
				$artistCount++;
			}
		}//end while
		
		$liner->close();
		
		//report
		$t = time() - $t;
		$m = memory_get_usage() - $m;
		
		echo "\ntime spent: $t seconds.",
			"\nmemory usage: $m bytes.",
			"\nartist count: $artistCount.",
			"\nsong count: $songCount.";
	}
	else //if $liner->open fails
		echo 'error: can not open xml file.';
?>

Demo

Click on following link to test: huge-xml-read (on a sister site).

Comments