RawDev.net - Just another Zabreznik.si Sites site
Home - Mail - About

Size of XKCD

Saturday, March 15th, 2008 by Marko Zabreznik

As the last post was about the size of bash.org, this one is about xkcd, the famous comic site, a simple set of scripts and you get the whole set and a few stats:
Use script wisely, it’s a strain on servers.

#!/bin/bash
echo "Downloading 395 pages."
for i in `seq 1 395`;
do
	if [ -s "xkcd/$i" ]; then
		continue
	else
		echo -n "`date +%H:%M:%S`: Trying $i ..."
		lynx --source "http://xkcd.com/$i" > "xkcd/$i"
		echo -n " Done. Image:.. "
		wget -q -p "comics" -nH "http://imgs.xkcd.com/comics/"`awk 'BEGIN{FS="<img src=\"http://imgs.xkcd.com/comics/";RS="\" title="}/<img/{print $2}' "xkcd/$i"`
		echo " Done."
		sleep 2s
	fi
done
echo "All done."

This piece of code does sometihng special, it takes the name of the image and uses wget to download it.

$n=1;
$vse=0;
while ($n < 410) {
	unset ($fajl);
	$fajl=file_get_contents("original/".$n);

	preg_match_all("|
<p class=\"quote\">(.*)<b>#(.*)</b>(.*)
<p class=\"qt\">(.*)

|Us", $fajl, $out);
	$i=0;
	while (isset($out[0][$i])) {
		echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
		echo $out[2][$i]."\n".$out[4][$i]."\n";
		$i++;
		$vse++;
	}
	$n++;
}
echo "\n(".$vse.")";

And a parser that makes the final big file of everything, coincidentally also making the comments easy to read.
Comics make the most part of the download, with ~22 MB.

And as usual, the download link: LINK (22mb), or email me for the data.

Tags: , , ,

This entry was posted on Saturday, March 15th, 2008 at 10:04 pm under Hacking, Scripting.
You can follow any responses to this entry through the RSS 2.0 feed. No Responses.
You can leave a response, or trackback from your own site.

« »