Quote:
Originally Posted by AlexanderHanff
You misunderstood me I think. I was trying to explain that the system will consist of 2 stages. When you send out a web request to a web site, Phorm (not you) will go off and look for robots.txt (providing it is not already cached) to check if search engines are allowed to spider. This stage is the one where they refuse to tell us what user-agent they will use.
Then the second stage is them actually forwarding your original request (yes there are some redirects and stuff going on in between but lets try and keep it simple) where we can only assume your real user-agent will be used. Certainly there has been no indication from Phorm that they will be using a different user-agent for these requests (and realistically they wouldn't want to as they could then be easily identified and blocked).
Alexander Hanff
|
Yes, a slight misunderstanding. I thought we were just talking about the fetching of robots.txt as your paragraph that I quoted was titled "Re: Robots.txt". Back to the point anyway
I think you're probably right about them spoofing the user agent for the "second stage". Otherwise they couldn't be sure they were being served the same content. Many sites tailor content based on user-agent strings.
As for the robots.txt fetch they have a dilema. Either:
They completely emulate googlebot's behaviour which may risk litigation from Google.
or:
They do something that differentiates them from googlebot and allows them to be denied. (e.g. the user-agent string they send in the http headers is different so we serve a different robots.txt - not the easiest solution for a webmaster to implement and impossible unless you've got a proper hosting solution)
If they really wanted to create a new "gold standard" for user privacy then they would be a lot more open on these details.
"The name token a robot chooses for itself should be sent as part of the HTTP User-agent header, and
must be well documented."