All times are UTC - 6 hours




Post new topic Reply to topic   Page 1 of 1
 [ 4 posts ] 
Author Message

 Post subject: Bypass robots.txt
PostPosted: March 6th, 2012, 7:40 am 
Wizard
Wizard
User avatar

Joined: April 24th, 2007, 9:34 pm
Posts: 386
Gender: Male
Status: Offline
I was trying to view cache pages of a particular website using the Wayback Machine @ archive.org, although I was able to view the cache for the homepage, I was unable to view any other internal links within the cache page because I get error message saying

"Page cannot be crawled or displayed due to robots.txt. See domains.googlesyndication.com robots.txt page. Learn more about robots.txt."

I google around about robots.txt and see that webmasters can use robots.txt to disallow Web Crawlers from crawling some particular pages, however the website Im trying to view its cache pages are no longer alive meaning the website is no longer in service, how do I bypass the robots.txt thingy on the Wayback Machine since the website is already dead??

Funny thing is I tried google the website e.g. domain.com but after browsing through pages after pages I could not find search results containing this website meaning the search result that once click brings you directly to the mentioned website, was trying to try my luck on Google Cache to see if I can get something

How do I bypass the annoying robots.txt ?? or is there any direct link to Google Cache like we could enter the url and can see the google cache of the website??

An example Google Cache will look like this: http://webcache.googleusercontent.com/s ... clnk&gl=sg

I was playing around with Google Cache by replacing runevillage.com with domain.com and runevillage to websitename but I still get redirected to Runevillage google cache page

__________________
Image Image


Top
 Profile
 

 Post subject: Re: Bypass robots.txt
PostPosted: March 6th, 2012, 5:38 pm 
the 'teflon' coated
Clan Chat Moderator
Clan Chat Moderator
User avatar

Joined: October 2nd, 2004, 11:52 pm
Posts: 6,777
Location: Sydney
Gender: Male
Status: Offline

Donor: Guardian (2010)
I'm guessing if the website has had robots.txt, it has prevented the page being trawled in the past, resulting in not having a cached copy?

__________________
[align=center][img]http://img178.imageshack.us/img178/2707/muse1.jpg[/img]
[img]http://img821.imageshack.us/img821/738/lollolololol.jpg[/img]
[img]http://img835.imageshack.us/img835/1424/rvsig.png[/img][/align]


Top
 Profile
 

 Post subject: Re: Bypass robots.txt
PostPosted: March 8th, 2012, 8:13 pm 
O_o
Clan Chat Moderator
Clan Chat Moderator
User avatar

Joined: June 22nd, 2005, 8:18 pm
Posts: 2,755
Gender: Male
Status: Offline

Donor: Prince (2011)
Friend of Hiker
I'm pretty certain that you cannot bypass this. Cam is right on this one - there is no file *to* bypass. If you don't want spider bots, such as WayBackMachine, to index your page and cache it, you specify so in robots.txt. Otherwise, such cache-based websites are being intrusive.

__________________
Image
Queen Black Dragon Kill Log


Top
 Profile
 

 Post subject: Re: Bypass robots.txt
PostPosted: March 11th, 2012, 8:40 am 
Wizard
Wizard
User avatar

Joined: April 24th, 2007, 9:34 pm
Posts: 386
Gender: Male
Status: Offline
Then I guess I got no choice then, thanks peeps :)

__________________
Image Image


Top
 Profile
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  Page 1 of 1
 [ 4 posts ] 

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 18 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Jump to:  

Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
The Village and this web site are © 2002-2012

ThePub 2.0 - Designed by Goten & Jackstick. Coded by Glodenox & Henner.
With many thanks to the Website Team!