Ever come across a page in Google or Yahoo that when you saw the listing you were like, “Uhhh, I didn’t want that page indexed?” I have.
At the college where I work, we had some files on a www4 server that was from our old server and many of the files were linked to from our internal portal. Many of the files (if not all of them) were not meant for general consumption and in fact contained information that which was sensitive. We are planning on conducting a thorough content inventory, but that can be for another blog post.
So how was I going to get my information out of the search engines? Well, my first step was to modify my robots.txt file. My revised robots.txt file looked like this:
The first line User-agent: * indicates that the rule should be applied to all bot/spiders. The second line Disallow: / tells the bots/spiders to not index everything from the web root and down.
Thats all fine and dandy to prevent the search engines from indexing in the future, but how was I going to get the listing that is already in the search engines? Luckily Google, Yahoo, MSN (and maybe others) have a mechanism in place for removing existing pages that are indexed. It was my experience that Google had by the best system in place for URL removing. The links for Google, Yahoo and MSN are below.