The Web Crawler project is not dead. I’m fumbling computer hardware and software right now and making my best attempt at keeping up with school. Oh, and I have a job.
I have started hosting the project on Google’s servers so that other programmers may join me in this. I haven’t allowed access by anyone yet as I may not want help. The code, however, is available there (I think). Try looking at the rehbergwebindexer project if you wish. I don’t think there is even any code in the source file yet; just commenting.
I got my courage up Saturday and ordered the books from O’Reilly. This press has long been highly regarded by technologists, whether they are programmers, IT professionals, or just geeks. Go ahead – ask a geek if he/she has a camel book, and chances are they’ll know what you’re talking about (and it will be within reach). Don’t tell them what it is if they don’t know.
I’m posting this to chronicle my efforts to build a web crawler and eventually a search engine. I expect to make further posts about how this project develops, and perhaps what I’ve found in these books that helped.
I have ordered three books. I went there for one, but there’s always a deal to get three for the price of two, plus free shipping. And I can always find another book to get. So:
Perl & LWP. This one I’ve borrowed before, and it opened my eyes to the possibilities of automated web surfing using Perl. I built a small script one time that looked up my SMTP server’s IP at spamcop, then e-mailed me if my mail server was ever blacklisted. It was fun and quite easy, but since I can’t find that script right now I’ll have to post it later.
Spidering Hacks. I ordered this one for obvious reasons. This book’s excerpts is where I found that little bit on needing my spider registered. I expect to learn a lot and become very frustrated with what I find here.
Perl Cookbook. This was the third choice because I needed three. Also because it’s $50 and I could use the discount. There apparently is a series of “cookbooks” that have really cool stuff (recipes) in them. There is also the PHP Cookbook, the C# 3.0 Cookbook, and more. I expect to find shortcuts and things I’d never thought of in this book.
I’m taking a class right now on software requirements engineering (does one actually engineer the requirements, or did they just want to make this class sound hard?) and I came across something I might use with the web crawler project.
In the chapter about “The Software Process” which talks about the processes necessary for an individual or team to succeed at building a quality piece of software or system, I came across the Personal Software Process, or PSP. The book simply states that every developer has a process, whether anyone can see it or not. Either way, there is a proper way to go about producing software at a personal level, and here is the gist (Pressman, 2005, p.37):
Planning. This activity isolates requirements and, based on these, develops both size and resource estimates. In addition, a defect estimate (the number of defects projected for the work) is made. All metrics are recorded on worksheets or templates. Finally, development tasks are identified and a project schedule is created.
High-level design. External specifications for each component to be constructed are developed and a component design is created. Prototypes are build when uncertainty exists. All issures are recorded and tracked.
High-level design review. Formal verification methods… are applied to uncover errors in the design. Metrics are maintained for all important tasks and work results.
Development. The component level design is refined and reviewed. Code is generated, reviewed, compiled, and tested. Metrics are maintained for all important tasks and work results.
Postmortem. Using the measures and metrics collected (a substantial amount of data that shoul be analyzed statistically), the effectiveness of the process is determined. Measures and metrics should provide guidance for modifying the process to improve its effectiveness.
I’m not sure if what I’m doing will fit into this personal model of development, but it’s thought provoking. Even if I don’t collect data about what my problems might be and then analyze the data about what actually went wrong, I can still hold myself to some kind of process. Even though I don’t have a deadline or an antsy customer to deliver this to, I can possibly eliminate shortfalls if I just think it out before delving into code.
But then what fun would that be?
Reference (in our favorite APA format):
Pressman, R.S. (2005). Software engineering: A practitioner’s approach. New York: McGraw-Hill.
After toying with C# today, I’ve decided that it is way to process-intensive to write the application on a runtime environment like .NET or Java. What I need is a simple language that can download a page, rip through text like a bandit, write the necessary fields to the database, and move on. I can organize the data when the search engine extracts that data.
I can’t commit to anything yet, but my spidey-sense is telling me that the crawler will be written in Perl with LWP. I suppose I could look at Ruby, too, but I already have my Camel book and have worked with LWP before. I haven’t tied Perl to a RDBMS, but I have done it with PHP and it must be similar. Perl can also do some limited recursion from what I understand, and if it can’t I may can use a database back-end to save the stacks of URLs.
I was ready to buy books at O’Reilly today (I chickened out of spending the money) and found a book on writing spiders. From the preview I surmised my crawler/spider must be registered. That means I have to go mainstream, doesn’t it?
And now after some more reading, I have discovered that this crawler can be used to build an index for special purposes. I can build my own search engine for this site, for example, and get much better results than I can searching the Google index for benrehberg.com. I have searched for things I know I wrote about, but never found them with Google. Building my own search engine and maintaining my own index of the site can prove useful if I keep writing about programming.
Update: I have created a new label “Web Crawler” for all posts related to this project.
It seems a bit strange using the world’s best search engine to find out how to build your own. Google is my first resource in this project, though Google itself provides nothing but the idea. There is a paper at Stanford by Larry and Sergey, and that basically is the starting point. That is Google’s only contribution so far aside from the many searches I will perform.
There are three main parts to the search engine: the crawler, which tirelessly captures data from the web, the database to hold everything, and the actual search engine – the queries that put the data together in a meaningful format for you.
I could write a search engine that actually crawls the web looking for my search criteria, but that is very VERY inefficient. Google (and many others) have solved this inefficiency by effectively downloading the Web (that’s right – as much of it as they can) to their computers so it can search it much faster and have it available in one place. They’ve done a whole lot more to increase efficiency and effectiveness of searches, but downloading the web was the first thing they did. It turns out they needed a lot of computers.
I’m going to start with two. I have three desktops that no one wants to buy, and I am really tired of looking at them. I will probably need more if I get this index working soon, but there will be software considerations to make too. You can’t fit the web on one computer, no matter how big. I will learn a lot.
I have always had an interest in distributed systems and cluster computing, so this will be fun. I have a lot to learn about distributed databases and algorithm analysis. But all that is later – I haven’t even really finished thinking out the preliminaries yet. So one development/crawling machine, and one database machine. After I figure out how to crawl the web, I will begin work on performing searches. If this project holds my interest long enough, I might publish statistics at 49times.com, so keep looking. I will be posting here if I come up with anything worth publishing. I’m going to try to journal my progress and decisions without publishing code, but I realize that I very well could lose interest in this. If I get started, I will likely enjoy it and keep going, but no one can say. If you have some confidence that I will continue, you can subscribe to this blog and get the updates. Beware, though, that you’ll get everything else I write too.