Tag Archives: servers

How to Write a Search Engine

It seems a bit strange using the world’s best search engine to find out how to build your own. Google is my first resource in this project, though Google itself provides nothing but the idea. There is a paper at Stanford by Larry and Sergey, and that basically is the starting point. That is Google’s only contribution so far aside from the many searches I will perform.

There are three main parts to the search engine: the crawler, which tirelessly captures data from the web, the database to hold everything, and the actual search engine – the queries that put the data together in a meaningful format for you.

I could write a search engine that actually crawls the web looking for my search criteria, but that is very VERY inefficient. Google (and many others) have solved this inefficiency by effectively downloading the Web (that’s right – as much of it as they can) to their computers so it can search it much faster and have it available in one place. They’ve done a whole lot more to increase efficiency and effectiveness of searches, but downloading the web was the first thing they did. It turns out they needed a lot of computers.

I’m going to start with two. I have three desktops that no one wants to buy, and I am really tired of looking at them. I will probably need more if I get this index working soon, but there will be software considerations to make too. You can’t fit the web on one computer, no matter how big. I will learn a lot.

I have always had an interest in distributed systems and cluster computing, so this will be fun. I have a lot to learn about distributed databases and algorithm analysis. But all that is later – I haven’t even really finished thinking out the preliminaries yet. So one development/crawling machine, and one database machine. After I figure out how to crawl the web, I will begin work on performing searches. If this project holds my interest long enough, I might publish statistics at 49times.com, so keep looking. I will be posting here if I come up with anything worth publishing. I’m going to try to journal my progress and decisions without publishing code, but I realize that I very well could lose interest in this. If I get started, I will likely enjoy it and keep going, but no one can say. If you have some confidence that I will continue, you can subscribe to this blog and get the updates. Beware, though, that you’ll get everything else I write too.

Uh-Oh.

49times.com is down since yesterday. You know it’s on that powerful box, and I think it took the grid down for a few seconds yesterday when traffic was high. I can’t tell yet, but I think there might have been as many as 2 simultaneous users, overloading the system.

Actually, the power blinked and I’m not home to restart it. Friday night is the soonest; I know you guys can’t wait, but we all need to suffer a little bit for the cause.

And I’m Spent…

It is working. After a long battle all day yesterday (and giving up on Apache), Ruby on Rails is running. The rest of my configuration is yet to be done (no database yet), but all in good time. Take nothing for granted: this is a very powerful server. Here are the specs (and yes, it is 2008):

Fedora Core 8
450MHz Pentium II
512MB RAM
10GB HDD

Should serve very well for the amount of traffic I expect at 49times.com.

Microsoft Takes a Swing

Microsoft Bids $44.6 Billion for Yahoo – Washington Post

This actually came as a surprise today. Microsoft buying Yahoo!? Preposterous!

That could change a lot about the web, but something smells. Microsoft is a huge company, and little-old Google seems to have them grabbing at straws to figure out how to catch up. I can look at any place on the web that Google produced and tell you exactly why their advertising makes so much sense. Why their e-mail works so well, and no user has to pay for it. Why they have so much available for free.

It’s simply because they have but one reason: “Google’s mission is to organize the world’s information and make it universally accessible and useful.” That’s it. There are no software revenue projections, nothing to give the shareholders a hard-on, nothing about providing a service at a premium. Just to organize the world’s information. This has been their goal since Sergey tried to download the Internet (or was it Larry? …not important). Sure, they sell stuff. Useful stuff (except the lava lamp) even, like the Google search appliance and SketchUp Pro. But that is not their main business driver by any means. They’re spending millions every year to scan books from libraries and make them searchable. They have a vast collection of scholarly journals and papers from around the world, and one may also search inside any of these.

This is just a taste of what Google is doing. They are successful because of something I don’t have a name for, but it is something along the lines of know what you’re doing and do no evil.

Most of you know that this blog is driven by the Blogger engine. I pay nothing for it, and it works much better than some of the other tools I’ve tried. Google probably bought it, but it has improved much over the years and it has never cost me anything. The funny thing is that there aren’t ads all over it. Sure, there are ads on this site, but I put them there; they aren’t some mandated advertisement justified for making my blogging free, it’s just free.

No one has mentioned so far that Gmail doesn’t inject ads into every message sent. It’s just regular e-mail. The advertisements shown on the side of the Gmail window are only text-based, and they’re usually relevant to what my e-mail is about. That way the advertising works. And if it works, people pay for it. So Google is king.

Microsoft and Yahoo! (and AOL, for that matter) are not because their interfaces are so heavy and the ads are very annoying. Their infrastructures are likely not as efficient as Google’s (who else uses a hundred thousand servers?) and their business models are not in tune with what people want.

Okay, I changed my mind. Yahoo! and AOL are in the entertainment business. Microsoft is in the software business (I put that together all by myself!). Google, however, is in the search business. Does anyone remember when a Yahoo! search looked eerily familiar, as if the results of that search were the same as a Google search for the same string? It’s because they were. Yahoo! used to use the Google engine for its search until just a few years ago.

The business of the web is changing, but there’s no way I can predict anything. I only get feelings, and I’m usually wrong because business is definitely not one of my strong interests. If the major search engines are reduced to two, though, it could get pretty damn nasty by 2010. I’ll likely not keep up with this subject, but it may be interesting nonetheless.

Windows Home Server

I’ve written earlier about a Dell PowerEdge SC440 Server I have that I couldn’t find anything to do with.

I would like to announce that my quest for a task is over. May I introduce Windows Home Server:

You can add one of these to your home network with ease, and it takes care of file duplication and backups automatically. In my opinion this is the best product for consumers that Microsoft has come up with in a very long time.

The architecture is based on Windows Server 2003 (the startup splashscreen tells us that), which is a proven operating system as it is in use very widely throughout the enterprise market. If you use multiple Windows systems and/or have an Xbox 360 that you want to access videos and pictures from, this is your answer. HP has them for about $599, but you’ll need a second hard drive for the duplication capability. The two-hard disk model is $749, and that’s two 500GB hard drives, plus room for two more. If that somehow is not enough, it accepts external drives as well.

If a disk fails, you’ll get a message. Simply replace the drive and start the server again. It automatically rebuilds the disk and balances the storage.

You don’t need a monitor or keyboard after it’s set up. Just put it in a closet and plug it in to the network. Install the client software on each computer in the house, and go. Backups run at night, and everything (if you so choose) placed on the file server is duplicated (if so capable).

You can even access your Home Server from afar. With a little bit of configuration, you can access your files on the server and even upload new ones from anywhere you might find yourself. I have already found this useful.

Rehberg Technology can also provide, configure, and install the Home Server.