Computer ForumsComputers  

Go Back   Computer Forums > The World Wide Web > Search Engines & Internet Traffic

Reply
 
LinkBack Thread Tools Display Modes
Old 12-13-2004, 06:09 PM   #1 (permalink)
Newb Techie
 
Join Date: Nov 2004
Posts: 26
Lightbulb Google Searchengine Programming

Howdy,

I want to know how the google, webcrawler etc. searchengines really work.
I have read around 10 websites, found on google, about “how searchengines work” and not a single one of them make it clear if it is the spider or the index or the search software does the ranking according to it’s ranking algorithm.
All they ever say is that, a searchengine has 3 softwares :
a) the spider
b) the index
c) the search system (search-box, template, etc.)
The spiders crawl the web collecting webpages and then forward them to the index and then the search software searches the index for the sought keywords/phrases.
Also, some say that the spiders copy the whole website into it’s index. So, in other words, there is 2 copies of a website. One residing in the website owner’s webserver and the other residing on the index of the searchengine.
So now, I can only assume 3 possibilities how a searchengine works from all this:

1.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) gives the ranking according to the searchengine’s ranking algorithm.
This means, the spider nor the index is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.

OR

2.
The spider does the ranking according to the searchengine’s ranking algorithm.
It visits a website and grabs all it’s html codes (copy a website) and then finally dump the html codes to it’s index. When it dumps the copies of websites it ranks them according to the searchengine’s algorithm.
The Index is nothing but a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm because that has been already done by the spider when dumping the data onto the index.
This means, the spider is responsible for giving the ranking and not the index nor the search-system responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.

OR

3.
The spider does not do the ranking according to any algorithm.
All it does is visit a website, grab all it’s html codes (copy a website) and then dump the html codes to it’s index.
The Index is not only a big txt file (.txt, .html) on the searchengine’s webserver that keeps full copy (html codes) of each website but also the system that does the ranking.
When it receives data from the spider, it ranks the links in it’s database according to the searchengine’s ranking algorithm.
The search-system, when searching and finding links (in the index) does not give the ranking according to the searchengine’s ranking algorithm.
Frankly, all it does is output a copy of certain parts of the index onto a searcher’s screen.
This means, neither the spider or the search-system is responsible for the ranking because these 2 parts of the searchengine are not taught the ranking algorithm.


So, which assumption is correct according to the 3 above ?
onauc is offline   Reply With Quote
Old 12-14-2004, 02:50 AM   #2 (permalink)
Master Techie
 
Join Date: Mar 2004
Posts: 2,007
Default

well, looks like you seem to already know more than most about how search engines work.....wish i knew more to help you
jaksback is offline   Reply With Quote
Old 12-14-2004, 04:53 AM   #3 (permalink)
Ultra Techie
 
Join Date: Jun 2004
Posts: 973
Send a message via Yahoo to intercodes
Default

onauc,

AFAIK

I think the option 1 is the closest one, but not the exact one. The spider crawls the website and only fetches snapshot of some of the pages [not the full website or its html codes as you say]. The google servers recieves this snapshot and is indexed. There are various algorithm , not just search-algorithm. I guess there will be a page ranking algorithm that ranks the webpages based on various factors [ Backlinks, no of clicks ,etc etc ] and reindex the pages.
intercodes is offline   Reply With Quote
Old 12-15-2004, 03:19 PM   #4 (permalink)
Newb Techie
 
Join Date: Nov 2004
Posts: 26
Default

Quote:
Originally posted by intercodes
onauc,

AFAIK

I think the option 1 is the closest one, but not the exact one. The spider crawls the website and only fetches snapshot of some of the pages [not the full website or its html codes as you say]. The google servers recieves this snapshot and is indexed. There are various algorithm , not just search-algorithm. I guess there will be a page ranking algorithm that ranks the webpages based on various factors [ Backlinks, no of clicks ,etc etc ] and reindex the pages.
Ah ! But still, which part of the searchengine does the raning ?
Certainly not the spider as it only grabs links and their descriptions and certainly not the index as it is plain simple cache and certainly not the query interface because if the perl/php (rank algorithm code) was residing in the html of the search-=page then people would right mouse click and view the source codes of the ranking algorithm.
onauc is offline   Reply With Quote
Old 12-15-2004, 04:25 PM   #5 (permalink)
Monster Techie
 
Join Date: Jul 2004
Posts: 1,848
Send a message via AIM to OIDanTheManIO Send a message via Yahoo to OIDanTheManIO
Default

What do you plan on doing with this extra knowledge of how search engines work?

-Dan The Man
__________________
OIDanTheManIO is offline   Reply With Quote
Old 12-16-2004, 07:36 PM   #6 (permalink)
Newb Techie
 
Join Date: Nov 2004
Posts: 26
Default

Quote:
Originally posted by OIDanTheManIO
What do you plan on doing with this extra knowledge of how search engines work?

-Dan The Man
Intend on creating a spider that will crawl my sponsor's links and then show their products and compare their prices to searchers like you.
onauc is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -5. The time now is 05:19 PM.


Powered by vBulletin® Version 3.7.1
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO 3.1.0