RawDev.net - Just another Zabreznik.si Sites site
Home - Mail - About

Size of Bash.org

Saturday, March 15th, 2008 by Marko Zabreznik

I spent the last few hours on a simple question, how large is the worlds largest irc quote database (bash.org) ?
Thinking specifically of the quotes themselves.

So first i had to get them all, a simple bash script was sufficient.

#!/bin/bash
echo "Downloading 409 pages."
for i in `seq 1 409`;
do
if [ -s "original/$i" ]; then
continue
else
echo -n "`date +%H:%M:%S`: Trying $i ..."
lynx --source "http://www.bash.org/?browse=$i" > "original/$i"
echo "Done."
sleep 10s
fi
done
echo "All done."

Please, do not use that script, it is a strain on the bash servers, instead you can grab the original files at the end of the article.
After a couple of hours that was done, and i had my next script ready as well;

$n=1;
$vse=0;
while ($n < 410) {
unset ($fajl);
$fajl=file_get_contents("original/".$n);

preg_match_all("|

(.*)#(.*)(.*)

(.*)|Us", $fajl, $out); $i=0; while (isset($out[0][$i])) { echo '('.$out[2][$i].")\n".$out[4][$i]."\n"; echo $out[2][$i]."\n".$out[4][$i]."\n"; $i++; $vse++; } $n++; } echo "\n(".$vse.")";

The last line is to make sure i got all of them, 20440 at the time.
Ran it with shell, and piped to “final”: php parser.php > final

So, the conclusion was, the size of bash.org is ~5 MB
This are the files if you want them: link. (or email me)

Tags: , , ,

This entry was posted on Saturday, March 15th, 2008 at 7:20 am under Hacking, Scripting.
You can follow any responses to this entry through the RSS 2.0 feed. 2 Responses.
You can leave a response, or trackback from your own site.
Mr Speaker
January 20th, 2009 at 3:49 am

Ha ha! 20440! that’s awesome… now you need to get data mining and try and dig up some interesting insights!


vertjaars
April 21st, 2009 at 8:52 pm

Thanks for this, it’s nice to have something to do when my internet is down.


»