Bypass robots.txt

Gateux · **Posted:** March 6th, 2012, 7:40 am

I was trying to view cache pages of a particular website using the Wayback Machine @ archive.org, although I was able to view the cache for the homepage, I was unable to view any other internal links within the cache page because I get error message saying

"Page cannot be crawled or displayed due to robots.txt. See domains.googlesyndication.com robots.txt page. Learn more about robots.txt."

I google around about robots.txt and see that webmasters can use robots.txt to disallow Web Crawlers from crawling some particular pages, however the website Im trying to view its cache pages are no longer alive meaning the website is no longer in service, how do I bypass the robots.txt thingy on the Wayback Machine since the website is already dead??

Funny thing is I tried google the website e.g. domain.com but after browsing through pages after pages I could not find search results containing this website meaning the search result that once click brings you directly to the mentioned website, was trying to try my luck on Google Cache to see if I can get something

How do I bypass the annoying robots.txt ?? or is there any direct link to Google Cache like we could enter the url and can see the google cache of the website??

An example Google Cache will look like this: http://webcache.googleusercontent.com/s ... clnk&gl=sg

I was playing around with Google Cache by replacing runevillage.com with domain.com and runevillage to websitename but I still get redirected to Runevillage google cache page

Magicana Drofulcus · **Posted:** March 6th, 2012, 5:38 pm

I'm guessing if the website has had robots.txt, it has prevented the page being trawled in the past, resulting in not having a cached copy?

Ventrue · **Posted:** March 8th, 2012, 8:13 pm

I'm pretty certain that you cannot bypass this. Cam is right on this one - there is no file *to* bypass. If you don't want spider bots, such as WayBackMachine, to index your page and cache it, you specify so in robots.txt. Otherwise, such cache-based websites are being intrusive.

Gateux · **Posted:** March 11th, 2012, 8:40 am

Then I guess I got no choice then, thanks peeps

Bypass robots.txt

Who is online