Quote:
Originally Posted by AlexanderHanff
Re: Robots.txt
Phorm claimed at the PIA Public Meeting that before they push a request through for a GET from a user to a website they will visit the document root for the domain to see if there is a robots.txt which allows Google access; if there is they will profile the pages the user requests. There is no indication (in fact they refused to tell us) what the user-agent will be for this robots.txt request but the user-agent for the user's GET requests will (I expect, although this has not been clarified either) be unchanged from the user's normal user-agent.
|
The've said that robots.txt will be cached* and not fetched for every phormed user. So it seems unlikely to me that they would pick a random user and forge her user agent string.
Also: If they by some miracle actually followed the robots.txt standard then the user-agent they match against and the one they send in the http headers must match:
"The name token a robot chooses for itself should be sent
as part of the HTTP User-agent header, and must be well documented."**
Sources:
* From
http://www.cl.cam.ac.uk/~rnc1/080404phorm.pdf "40. Once the robots.txt file (if any) has been fetched, it will be cached. The cache retention period will be value set by the website using standard HTTP cache-control mechanisms, or for one month if no period is specified. The minimum period that the file will be cached for is two hours."
** From
http://www.robotstxt.org/norobots-rfc.txt