Cable Forum - View Single Post - Virgin Media Phorm Webwise Adverts [Updated: See Post No. 1, 77, 102 & 797]

rryles · 13-05-2008, 10:14

Quote:

Originally Posted by AlexanderHanff

Re: Robots.txt
Phorm claimed at the PIA Public Meeting that before they push a request through for a GET from a user to a website they will visit the document root for the domain to see if there is a robots.txt which allows Google access; if there is they will profile the pages the user requests. There is no indication (in fact they refused to tell us) what the user-agent will be for this robots.txt request but the user-agent for the user's GET requests will (I expect, although this has not been clarified either) be unchanged from the user's normal user-agent.

The've said that robots.txt will be cached* and not fetched for every phormed user. So it seems unlikely to me that they would pick a random user and forge her user agent string.

Also: If they by some miracle actually followed the robots.txt standard then the user-agent they match against and the one they send in the http headers must match:

"The name token a robot chooses for itself should be sent
as part of the HTTP User-agent header, and must be well documented."**

Sources:

* From http://www.cl.cam.ac.uk/~rnc1/080404phorm.pdf "40. Once the robots.txt file (if any) has been fetched, it will be cached. The cache retention period will be value set by the website using standard HTTP cache-control mechanisms, or for one month if no period is specified. The minimum period that the file will be cached for is two hours."

** From http://www.robotstxt.org/norobots-rfc.txt