Recently i decided to opensource one of my scripts i have been using in the past.
Features:
- No Mysql needed,
- Admin page included but limited to one user/password;Adding, deleting files; adding folders,
- Lightbox,
- Automagic thumbnailing,
- You can also turn off administration and use FTP to upload images, once pages with new images are accessed, thumbnails will be created.
(more…)
As the last post was about the size of bash.org, this one is about xkcd, the famous comic site, a simple set of scripts and you get the whole set and a few stats:
Use script wisely, it’s a strain on servers.
#!/bin/bash
echo "Downloading 395 pages."
for i in `seq 1 395`;
do
if [ -s "xkcd/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://xkcd.com/$i" > "xkcd/$i"
echo -n " Done. Image:.. "
wget -q -p "comics" -nH "http://imgs.xkcd.com/comics/"`awk 'BEGIN{FS="<img src=\"http://imgs.xkcd.com/comics/";RS="\" title="}/<img/{print $2}' "xkcd/$i"`
echo " Done."
sleep 2s
fi
done
echo "All done."
This piece of code does sometihng special, it takes the name of the image and uses wget to download it.
$n=1;
$vse=0;
while ($n < 410) {
unset ($fajl);
$fajl=file_get_contents("original/".$n);
preg_match_all("|
<p class=\"quote\">(.*)<b>#(.*)</b>(.*)
<p class=\"qt\">(.*)
|Us", $fajl, $out);
$i=0;
while (isset($out[0][$i])) {
echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
echo $out[2][$i]."\n".$out[4][$i]."\n";
$i++;
$vse++;
}
$n++;
}
echo "\n(".$vse.")";
And a parser that makes the final big file of everything, coincidentally also making the comments easy to read.
Comics make the most part of the download, with ~22 MB.
And as usual, the download link: LINK (22mb), or email me for the data.
I spent the last few hours on a simple question, how large is the worlds largest irc quote database (bash.org) ?
Thinking specifically of the quotes themselves.
So first i had to get them all, a simple bash script was sufficient.
#!/bin/bash
echo "Downloading 409 pages."
for i in `seq 1 409`;
do
if [ -s "original/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://www.bash.org/?browse=$i" > "original/$i"
echo "Done."
sleep 10s
fi
done
echo "All done."
Please, do not use that script, it is a strain on the bash servers, instead you can grab the original files at the end of the article.
After a couple of hours that was done, and i had my next script ready as well;
$n=1;
$vse=0;
while ($n < 410) {
unset ($fajl);
$fajl=file_get_contents("original/".$n);
preg_match_all("|
(.*)#(.*)(.*)
(.*)|Us", $fajl, $out);
$i=0;
while (isset($out[0][$i])) {
echo '('.$out[2][$i].")\n".$out[4][$i]."\n";
echo $out[2][$i]."\n".$out[4][$i]."\n";
$i++;
$vse++;
}
$n++;
}
echo "\n(".$vse.")";
The last line is to make sure i got all of them, 20440 at the time.
Ran it with shell, and piped to “final”: php parser.php > final
So, the conclusion was, the size of bash.org is ~5 MB
This are the files if you want them: link. (or email me)