Thank you for quoting your source. I've pasted below the text from that source which is Clayton's technical report which I have already read and which does not give any details on how Phorm will be using robots.txt and in fact records their refusal to give further details, even to Clayton, and further, makes the clear statement that their use of robots.txt is simply to assume that if search engines have permission, then Phorm have permission - it does not clarify how they will do that. We might think it is obvious, but nothing is obvious when we are dealing with Phorm.
The quote from Phorm in section 44 below is not a blanket statement of respect for robots.txt - it is a conditional statement, without explanatory detail that "
if the site has disallowed spidering and indexing by search engines, we respect those restrictions in robots.txt "
"39. When a website is first visited (by any ISP customer) the pages are not inspected. Instead, a request is queued to fetch the site’s “robots.txt†file; viz: a file maintained by the website owner which tells web crawlers and other automated systems which parts of the website should not be indexed or processed.
40. Once the robots.txt file (if any) has been fetched, it will be cached. The cache retention period will be value set by the website using standard HTTP cache-control mechanisms, or for one month if no period is specified. The minimum period that the file will be cached for is two hours.
41. The robots.txt file will be inspected and URLs that fall within forbidden areas of the website will not be processed by the Phorm system.
42. This mechanism, which will permit website owners to opt their pages out of the Phorm system, does not seem to have been previously described in any of Phorm’s documentation. They were unable to provide an explanation as to why this had not previously been disclosed.
43. In the meeting, Phorm were unable to tell us the User-Agent string they match against in the robots.txt file, knowledge of which would be required if a website owner wished to set particular rules for Phorm which differed from, for example, for the GoogleBot.
44. I asked for further clarifcation and was told “we work on the basis that if a site allows spidering of its contents by search engines, then its material is being openly published. Conversely, if the site has disallowed spidering and indexing by search engines, we respect those restrictions in robots.txtâ€Â.
45. It therefore still remains unclear to me what the Phorm system does if the robots.txt file does not use a User-Agent: * construction, and whether this will be in line with what the website owner intended."
On the question of dictionary attacks for email addresses
http://en.wikipedia.org/wiki/E-mail_address_harvesting
http://en.wikipedia.org/wiki/Directory_harvest_attack
http://www.sophos.com/security/spam-...yharvestattack
http://geek.focalcurve.com/archive/2...ary-attack%20/
Obviously I can't comment on what caused the particular spam in question in the original post and did not do so.
Best wishes.