Tidying HTML source from PHP

In June 2011 I decided the site famous to Leeds United fans leedsfans.org.uk had been down for WAY too long.  It proved an invaluable resource to Leeds United fans on the history of the club over the years, however the sites host and admin Jabba (Jon) had given up on the project for whatever reason with no immediate intention of reviving it.
I found the last copy taken of the site on archive.org and promptly wrote a simple script to scrape all of the content archive.org had stored into a folder so that I could at-least re-host the static content. I did try to contact Jabba to see if I could take over the original domain, but unfortunately I’ve had no response. I then registered the most similar yet cheap domain name I could find (leeds-fans.org.uk) and put all the content I’d obtained back on the web so people could find it again.

Anyway after looking through the source I decided to try and at-least make the HTML valid until I get a chance to eventually re-design and re-launch the site. This is relatively easy with libtidy installed on your system and php compiled with it available.

I wrote a PHP script to go though a folder of static X/HTML pages, run them through libtidy, keep the doctype and save this back to the file.

If such a script will be useful to you here’s the source also available as a gist on github:

<?php
	defined('TIDYDIR_EXTENSION') || define('TIDYDIR_EXTENSION', 'html');
	function tidyDir($directory) {
		$htmlFiles = glob($directory.DIRECTORY_SEPARATOR.'*.'.TIDYDIR_EXTENSION);
		$filenameRegEx = '#^(.+?)\.([^\.]+?)$#';
		$htmlTidy = new tidy();
		foreach ($htmlFiles as $entry) {
			if (preg_match($filenameRegEx, $entry, $matches)) {
				$filename = $matches[1];
				$extension = $matches[2];
				$htmlContents = file_get_contents($entry);
				$doctype = (preg_match('#\A\s*(\<[\s\S]+?\>)[\s\S]*#', $htmlContents, $matches))
							? $matches[1]."\n"
							: '';
				$htmlTidy->parseString($htmlContents);
				if (0 < $htmlTidy->getStatus()) {
					if ($htmlTidy->cleanRepair()) {
						$correctedHTML = $doctype.$htmlTidy->html()->value;
						echo 'saving ',$filename,'.',$extension,"\n";
						if (!file_put_contents($filename.'.'.$extension, $correctedHTML)) {
							echo 'failed saving ',$entry,"\n";
						}
					} else {
						echo 'FAILED TO CLEAN UP ',$entry,"\n";
						die;
					}
				} else {
					echo 'Goody, ',$entry,' is valid html ',"\n";
				}
			}
		}
		$d = dir($directory);
		while (false !== ($entry = $d->read())) {
			if (0 !== strpos($entry, '.') && is_dir($directory.DIRECTORY_SEPARATOR.$entry)) {
				echo 'calling tidyDir on ',$directory,DIRECTORY_SEPARATOR,$entry,"\n";
				tidyDir($directory.DIRECTORY_SEPARATOR.$entry);
			}
		}
	}
	tidyDir(__DIR__);

Leave a Reply