Douglas Bunger: Seeding and Spamming Of Internet Search Engines

Douglas Bunger http://dbunger.tripod.com bunger@home.com

How To Increasing Internet Search Engine Hits
By Optimizing HTML Coding And Webpage Design

********************
Please visit my updated page at http://www.dougbunger.com.
********************

Everyone wants their website to be on the top of the list when a user executes an internet search. In my pursuit to achieve this same goal, I have been researching how search engines function and the most effective methods of seeding them. Presented here (in no particular order) are some of my findings.

These links will lead you to the ADD URL page for the specified search engines.
The engines sister sites are listed in parens.

Of the sites refered to as search engines, only half of them actually are. The other half are hierarchial directories. The most familuar directories are Yahoo, Excite, Lycos, and Infoseek.

The difference between a search engine and a directory is that the directory allows the user to 'climb the abstraction ladder.' This is to say that they are provided with a set of general catagories to choose from. After selecting a catagory, the sub-catagories of that catagory are presented for selection. This process continues until the user has reached the top of the ladder. Consider this example:
Computers, Software, Operating System, OS/2, Configuration, How to
When you climb the ladder (left to right), you are getting more specific. When you go right to left, you are becoming more general.

A search engine, on the other hand, takes a user querry and slams through its database until it finds matches for the terms. This requires the user have a good idea what they are looking for. The most popular search engines are AltaVista, Webcrawler, and Hotbot.

In this discussion I will use the term search engine to refer to both types of systems. This is appropriate, as all the directory systems employ search engine technology to toss a user directly into relevant catagories. The primary difference at this point is that the real search engines are searching their mirrors of documents, whereas the directories are searching their catagories.

Most search engines retreive documents via a text only spider. When the HTML code is downloaded, only the ALT parameter of embedded image tags are relevent. The spider will not fetch graphics as most documents are not human reviewed.

I attemped searches to determine if this alternate text was indexed by inputting hey dude, which I knew to be included under an image on my welcome.htm page. This search on AltaVista yielded over 62,000 documents. By adding nerd (randomly seleted from the body) to the search, I found a reference to my page dated 04 May, 1997.

This test leads me to feel that use of the ALT parameter would be wise for all images.

As stated above, most search engines operate in a text only mode. This means that the engine can not resolve frames documents. It seems the engines will read and catalogue the first frame loaded, but will not reference the frames after that. Not being a 'fan of frames', I have no interest in establishing a frames document to test with. For those that must use frames (per guidelines from higher authorities) I would recommend that you load the main page frame first, not the index frame, and pack it with as much raw data as possible.

When an engine is executing a search, it is not only looking for the number of occurances of a particular word, but its relationship to other search arguements. If searching for a single term, the engine has no option other than to count occurances.

Using the term Cannon Launched Guided Projectile and its acronym CLGP, as a control value, I have discovered several interesting aspects of search engine operation.

On Altavista, a search on CLGP (July 7, 1997) would yield 39 hits. Several of these hits lead to foreign language documents where clgp is a Spanish or Chezc dipthong. My article, Lasers on the Modern Battlefield, contains this term three times in the text, and appears as item number eight.

By inputting "cannon launched guided projectile" (note use of quotation marks), the search yields four hits, my article is number four. By loading all documents, counting the occurances of the data and viewing the HTML source, I found that the search term only appeared once in each doc. Furthermore, none of the articles used the META keywords function.

As I researched these documents, I began to wonder why my main page, welcome.htm did not show in the search. I tried the search using a comma in the arguement ("cannon launched, guided projectile") and by capitalizing the first letter of each word. The search yields the same four hits, indicating input parameters are not case sensitive or punctuation dependant.

Each of the docs should have carried the same weight, but mine was as the last item. My article appeared as the last hit because of the way Altavista executes its sort. Altavista lists articles with equal weight in alphabetic order based on the first line of text in the document.

The reason my mainpage is not on the list, is that the both the acronym and term appear within a tag. I placed the information on my top page in the form of an alphabetic index, but used the terms in the text of the link. If the search engine were to catalogue data within tags, a huge percentage of their information would be millions of URL's.

To increase my odds, the HTML line that reads:
<A HREF="docs/laser.htm"> CLGP- Cannon Launched, Guided Projectile</A>
needs to be changed to move the text outside of the tag:
<A HREF="docs/laser.htm"></A> CLGP- Cannon Launched, Guided Projectile
Unfotunately, this does not provide a target for the user to select. It is considered bad form to use 'click here', yet this would solve the problem. The link could contain an imbedded GIF in the form of a bullet, but displaying the graphics would delay page loading. This presents us with a classic trade-off situation. To test the ability to manipulate the search engine, I will recode the HTML with an text bullet:
<A HREF="docs/laser.htm">[_] CLGP- Cannon Launched, Guided Projectile
This will provide the user with a target, load quickly, but will not be as 'pretty' as graphical bullet. This is acceptable when considering the majority of personal webpage visitors are using low bandwidth modems. A search for CLGP on Webcrawler, however, finds the top document (welcome.htm) and not the article (laser.doc). This would indicate that Webcrawler can look inside tags, but raised the question as to why the actual article did not display.

Altavista uses a method known as word weighting. A word in a document's title is assigned more importance than the same word one in the body. The keywords META function and first line of the body also carry additional importance. The actual text of the document has the lowest rank.

Some engines do not actually catalogue subordinate documents, but resolve links to their root document. This means that adding a page to a website does not guaruntee it will appear in a search. Even if the document is entered directly into the engine via a submit feature, the engine may still check its database for matching URL prefixes. The only way to truly defeat this feature, is to have multiple Internet accounts and only store one document on each account.

Along these same lines, the engine may also reference the document back to its parent document by way of links that point to a document with a higher directory. This could easily be avoided by not including links that loop back to the parent document, but instead having the user execute a 'back' on their browser. In the case of a dedicated server, the user could be pushed back. A Java applet could also handle this task.

Engines are beginning to use anti-spam techniques to filter unsavory documents. The simplest of these is the repetitive word search. There are no occasions in proper English for the same word to appear three times in a row. Therefore, anytime a document uses repeat word spamming, it is completely ignored. Even the HTML 3.2 specification indicates that if the META keyword function contains the same word seven times, the entire line is ignored.

This raises an interesting question: if a document contains more than one keyword META statement, are the subsequent statements ignored.

Webcrawler's documentation indicates that it executes its search on a "breadth first" basis. Most students of information retrieval science or artificial intelligence are familuar with the many proofs that show this is not as effeicent as a "depth first" search. It is important to realize that Internet search engines are not in existance for the benifit of mankind, but as a business venture. Each would perfer you use their service, so they optimize their engines for speed, not accuracy. It is more important to them that you execute multiple querries, thus reloading their advertising. It is fair, however, for the burden of an effective search falls on the user. Someone that executes a search on "car" deserves the 1.75 million hits they get, when they can't possibly wish to read every article in existance.

Certain words are immediately filtered from a search querry because of the shere volume of their occurance. My research has indicated that in English, there are two hundred words that comprise 40% of all words used in a documwent. Words such as "the" or "is" should never be used as search arguments. Words classified as pronouns, articles, interogatives, and conjegations of "be" are on the list. Conjunctions should be used only as boolean search parameters.

The usefullness of adverbs is debatable. Some engines include what they call "fuzzy logic" and allow the use of words like "very." This is actually a natural language search that places more weight on the modified word than the other arguements. A search on "very red apples" would weight the word red, and may yeild an article about the sunset in Borneo. Even executing a relative position search would be of no help, as the author may have used the term "extremely red apples." This requires the engine be programmed to search not simply for "very," but all its synonyms. This would grind most servers to a halt.

Another search billed as fuzzy logic, is actually a phonetic search. In this search, "stashun" would successfully yields "station." Again, this places additional demands on the server--, demands engine operators are not prepared to accept.

A few of the search engines provide technical information about themselves for people like myself, who (in the words of one the engines) are 'terminally curious.'

Excite's Information Page
Excite's Technical Page (highly recommended reading)
AltaVista's Technology Page (a lot of bragging about hardware)
AltaVista's 'Why We're Better' Page (a nice comparative chart)
Webcrawler's 'How It Works' Page

How To Increasing Internet Search Engine Hits By Optimizing HTML Coding And Webpage Design

How To Increasing Internet Search Engine Hits
By Optimizing HTML Coding And Webpage Design