[Simson Garfinkel - Tech]



[Packet]
 
Why Spiders Bite

Search engines don't work anymore, so these engineers are developing the next generation of crawlers

Web search engines like Lycos and AltaVista are great for indexing static HTML pages. Unfortunately, the idea of searching the Web with a "spider" and then building up a massive database is beginning to break down.

Search-engine companies are having technical problems keeping their massive databases current. Consider news-oriented sites like CNN, MSNBC, and Wired News. How can a Web spider find individual stories on these sites, when the pages are changing every hour?

There's
a lot of
information
on the Web
that simply
can't be found
with the
traditional
"spider"
approach.

Although they've been useful until now, there's a lot of information on the Web that simply can't be found with the traditional "spider" approach. Bad news: There are more index-resistant sites all the time.

Consider the NASDAQ Stock Market, which now offers nearly real-time stock quotes of all companies listed on its exchange. You can look up the market value of Sun Microsystems. But you would never know this by searching for the words "Sun Microsystems," "NASDAQ," and "stock price" on any Web search engine, because the Sun quote page is dynamically generated.





I still haven't
found what I'm
looking for.
Have you?.

The latest
post to Tech is
""
by
()





Subscribe to
PacketFlash,
for Packet news.

AltaVista's director of engineering, Barry Rubinson, isn't interested in solving this problem - there are just too many monstrous databases to worry about them. AltaVista has indexed 32 million pages of text, with links to 50 million pages, says Rubinson; Dow Jones News Retrieval has 400 million. "I know of 10 other databases out there" that are just as big, he says. "We would be 100 times as large as we are now if we tried to index every page out there."

Right now, being 100 times larger is a physical impossibility. AltaVista already has four terabytes of online storage. It's just not possible with today's computer technology to build a database that has 400 terabytes of spinning storage, Rubinson said.

Besides, Scooter (AltaVista's Web spider) is having a hard time just keeping up with Internet growth. I scanned my Web server's log files and found out that Scooter is paying our site a visit less than once a month per Web page. The log shows that Scooter generally doesn't even follow links; all it's doing is fetching pages that have been individually registered with AltaVista's register-a-URL feature. Perhaps this explains why AltaVista returns so many dead links, and why it does a poor job finding new stuff.

InfoSeek lead developer Chris Lindblad and chairman Steven T. Kirsch are looking for something better: a way to quickly figure out what's new on a Web site, as well as ways to discover what information might be hidden inside a database that resides on a remote server. Rather than develop their own proprietary solution, they're working with Hector Garcia-Molina at Stanford's digital libraries project to develop a standard that's based on the Web's HTTP protocol. The standard will both let spiders scan Web sites more intelligently and allow for distributed searches throughout the Net.

Kirsch is advocating a special file in the root directory of Web servers. This file, called sitelist.txt, would list all of the files on a Web server and give the times that they were last modified. Such a file would make it easy for a spider to keep tabs on even the most complicated sites. It would eliminate the need for spiders to follow links (because the file would contain the names of all the pages on the Web site), and it would eliminate the need to pull down pages that hadn't changed (because the spider could check the modification dates and simply pull down the pages when needed).

Distributed searching is more complicated. Basically, InfoSeek is working on a way to hand off a search from one search engine to another. This way, when you searched for Sun Microsystems on InfoSeek, it could simply hand off the search to the NASDAQ Web site, and you'd get a link back for the database-generated HTML page. Unfortunately, distributed searching would solve only half the problem. The other half is distributing meta-information about the distributed search engines. Otherwise, every Web search on InfoSeek will have to be handed off to every single database on the Internet. Not only would that be terribly inefficient, it would be stupid: There's clearly no reason to search the Library of Congress card catalog when you're looking for a current stock quote.

Lindblad is building support for this narrowed distributed search into the InfoSeek enterprise search engine. Consider, says Lindblad, "a big company spread out across the world: Lots of groups have set up Web servers. The company has a private network. They don't want this network all used up by a [spider] trying to index every Web site in the company, and they don't want it used up by people doing searches all over the country. The happy medium is to have people doing indexing on each one of the local Web servers and to build meta-indexes."

When you try to do the search, the meta-search engine will first figure out which search engines around the company should get the request. It will send the search query to each search engine. And finally it will assemble the answers and give them to you in a digestible form.

To find out how to trap spiders - and why you may not want to - click "Geek This."

[switch on]Few people realize how important search engines have become to any sense of order on the Web. Today they're all but indispensable. That's why solving these problems is critical to the future of the Web.

[Simson Garfinkel]

Talk back to Simson Garfinkel in his column's Threads.

Illustration by Dave Plunkert


Join the HotWired Network, it's free. Members log in.
[to webmonkey] [to netsurf central]

Previously in Garfinkel ...

Previously in Boutin ...

 

Wired News | Wired Magazine | HotWired | Webmonkey
RGB Gallery | Animation Express | Webmonkey Guides | Suck.com

Work at Wired Digital | Advertise with us | About Wired Digital | Our Privacy Policy

Copyright © 1994-99 Wired Digital Inc. All rights reserved.