Friday, January 4. 2008
I'm not going to get into the moral issues about whether he breached the privacy of his friends by attempting to view their email addresses which they had granted him permission to view. My work on a MSN messenger contact grabber and highlighting gmail contact and yahoo contact grabbers probably speaks for itself.
Instead let's see if we can make sense of the terms of service Robert Scoble decided didn't apply to him.
I've taken relevant extracts from the terms of service at 3 websites. Try and guess which belongs to facebook.
In addition, you agree not to use the Service or the Site to:
- harvest or collect email addresses or other contact information of other users from the Service or the Site by electronic or other means for the purposes of sending unsolicited emails or other unsolicited communications;
In using the service, you may not:
* use any automated process or service to access and/or use the service (such as a BOT, a spider, periodic caching of information stored by Microsoft, or "meta-searching");
You agree not to access the Service by any means other than through the interface that is provided by Yahoo! for use in accessing the Service.
Social networks are built on scraping content
The first site is facebook. Robert Scoble clearly breached the terms of service. But here is where it gets interesting.
Site 2 is hotmail for which facebook provides a automated service, which could also be called a BOT, to grab contacts.
Site 3 is yahoo for which facebook provides a automated service to grab contacts.
Terms of service be damned! Mark Zuckerberg has a fortune to make and if that means being an accessory to millions of users breaching their contracts with other companies then so be it.
Many of the companies which facebook provides bots for are moving to support porting users data between services. On Thursday facebook was invited to do the same. I would ask that they give it serious thought.
Sunday, December 2. 2007
I feel the improvements I'm going to include are really important so to ensure that everyone can make use of this tutorial I'm going to take a step back and develop it in PHP4. I suspect there may still be some using PHP4. This will likely be the last time I worry about compatibility with PHP4. In the new year it will be PHP 5 all the way.
For those who haven't seen the original article it
1) Requested login details for gmail or msn messenger (not the same as hotmail)
2) Logged in to the service and fetched the contact details
3) Listed all contacts and enabled the user to choose which should be contacted
4) Sent an email to all requested contacts.
This updated tutorial will show how to the above and also the following
1) Defend against malicious attacks
2) Prevent duplicate messages from being sent
3) Allow recipients to opt out of future messages
If you have read the follow up post to the original tutorial you'll see that the improvements are focused around minimising the problems surrounding unsolicited email rather than improving the efficiency of the process. Those potential improvements are really beyond what can sensibly be included in a generic tutorial like this one.
So, without further introduction lets get started. Continue reading "Contacting a contact list: A tutorial - revisited"
Wednesday, November 28. 2007
From various places on the web I've found three books which may be helpful and would love to get some feedback from the PHP community.
For background, programming is not a full time activity for me and although I can easily knock up a custom cms I wouldn't describe myself as a seasoned pro. Also, although it has not always been the case, I almost work exclusively in PHP and expect this to be the case for the foreseeable future.
PHP|architect's Guide to PHP Design Patterns by Jason E. Sweat.
From the reviews it looks like this one may go somewhat off-topic.
PHP 5 Objects, Patterns, and Practice by Matt Zandstra
The reviews seem positive. Probably the most likely candidate at this point.
The Object Oriented Thought Process (Developer's Library) by Matt Weisfeld
From the reviews it may be a too brief introduction.
Have you bought any of the three books above? How useful was it to you? Any other suggestions?
p.s. A cookie for anyone who can name the film I've paraphrased in the post title.
Sunday, November 4. 2007
The first thing I did was stop visiting the site for 15 minutes or so and then increase the time between requests. It briefly worked again but quickly stopped. Next, I opened up php.ini and checked what useragent PHP was using. It turned out to be 'PHP'. I changed that and for the past 3 hours (almost) the script has been working perfectly.
Moral of the story: When scraping content from the web don't make it obvious
It's worth noting that I can't say for sure that it was changing the user agent which fixed the problem, it could have just been coincidence, but it's an easy fix and why make it obvious that you're scraping content?
In this instance I was just working from my development server so I had access to php.ini but I had several options.
I could have added a line to my .htaccess file
or used ini_set.
ini_set('user_agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:22.214.171.124) Gecko/20071025 Firefox/126.96.36.199');
Curl also allows you to specify the useragent.
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:188.8.131.52) Gecko/20071025 Firefox/184.108.40.206');
If you want to take cloaking the useragent further with curl this comment in the PHP manual may be useful.
Wednesday, October 17. 2007
Matt Biddulph discussed interacting with 3rd party sites and services and the portable social graph. This session was particularly interesting to me given my interest in handling contacts from MSN messenger, Gmail, AOL messenger (AIM) and Yahoo!.
Matt's talk focused around how we can move beyond the use of such scripts with their multitude of risks to a situation were a user can join a new site, release the minimum data necessary and quickly identify their friends already using the service.
For those wondering what risks I am talking about the current situation means that for a site to view your contacts in a service they need total access. This means they could view your mail and send mail through your account. Where a username/password combination is used to access other services on a site these would also be compromised, so to take Google as an example you also grant access to Google Checkout, Adsense, Analytics etc. This is the situation today. Even Dopplr, the contact importing functionality of which the conference chairs spoke about with nothing but praise, requires your username and password to import your Gmail contacts.
Luckily, this situation is on the cusp of improving. For services where you are happy to make your list of friends publicly viewable marking them up with microformats is a simple way to allow other services to make sense of your friends list. Matt mentioned that twitter was also thinking of taking this a step further and supporting openid so that a user could prove that a friends list really was their friends list. I'm not terribly familiar with openid but I assume this wouldn't lead to once again proliferating login details and you simply delegate to your actual provider. Correct me if I'm wrong.
This works well when you're happy for your contacts to be public but you're probably going to want to keep at least some of the contacts in your email address book private so another solution needs to be found. Thankfully a standard way of achieving this is being developed. In fact, Matt talked about five.
I've also talked previously about Windows Live Contacts Control which although more limited in scope for the purposes of this discussion it does much the same thing.
Hopefully it won't be too long before most of the code on this site is nothing more than an historical curiosity. I don't think we are there yet but soon. . .
The slides from Matt's talk are also now available.
Sunday, October 7. 2007
Luckily he has previously recorded a video of a very similar (though not identical) presentation and I've embedded it below.
It's 37 minutes so for those in a hurry here are my notes from his presentation at FOWA.
The key points have also been discussed in a series of blog posts so where possible I'll link out to the relevant posts.
Importance of the backend
The first point raised is that the user perception of load time is more important than the actual load time. This means that the relevant metric is not how fast can the html document be returned to the browser but how quickly that html is rendered in the browser. In measuring how quickly a page renders it was quickly realised that the backend performance, returning the html for the page, accounted for only about 5% of the overall time it took to render the page. Even with a full cache the backend was still only 13% of overall time.
Of the top 10 sites in the US only the backend for Google accounted for more than 20% of the load time with a full cache. The Google homepage is so spartan that with a primed cache only two HTTP requests need to be made.
The importance of the cache was then discussed with data presented discussing how many people at a site had a primed cache. They inserted a one pixel image on the Yahoo! homepage and then monitored the number of HTTP requests with a 200 header (empty cache) and a 304 header (primed header).
It was found that 50% of daily users have an empty cache which accounts for 20% of daily pageviews. This varies depending on the type of site, for example an empty cache accounts for fewer pageviews in a webmail site where each user will view multiple pages, but is broadly accurate. The data highlights the importance of catering for those users without a primed cache. Excessive use of images can't be justified by the assumption that once they are loaded the cached versions can be used. 50% of your users every day will be arriving at your site with an empty cache.
Next he talked briefly about iFrames and how they can cause a 40-50 ms delay. onLoad doesn't work until the iFrame source responds which can cause a problem with 3rd party content.
Next he discussed YSlow which grades a website based on the 14 rules developed through their research. YSlow is an extension for Firebug the popular development extension for the Firefox browser. It looks at how the page was built. Despite looking at the content rather than the response time its score correlates well with the rendering time. As such it could be a valuable tool during development to predict the speed of a site prior to its launch.
Another issue which YSlow apparently solves is a bug in how Firebug charts HTTP requests. Apparently Firebug will show queries to the cache as HTTP requests and YSlow patches this.
That's all I made notes on. I've got vague memories of stepping HTTP requests to increase download speed and cookies are always worth considering but I picked up a nasty cold in London and it's all a bit fuzzy.
Saturday, October 6. 2007
The conference kicked off with a keynote from Om Malik discussing 'What is the Future of Web Apps?' Mike Arrington from Techcrunch decided to gatecrash 15 minutes or so into the keynote. The conversation that followed was interesting though with the pessimism from Om working well with Mikes optimism. I've been following Techcrunch for a while but have now added GigaOm for the potentially balancing effect.
Ben Forsaith then demoed '10 Real-world apps' in 10 minutes. Surprisingly 9 of the 10 worked without a problem. The most interesting one was probably Buzzword which is a word processing app. Online office products have been getting a lot of attention recently with available anywhere functionality playing against the more basic options. Buzzwords really grabbed my attention because during the, admittedly short, demo it looked like it could wipe the floor with Microsoft Word when it came to handle images and altering the layout of the page. I frequently have to break reports into 5 or more sections to maintain the layout so if buzzwords performs as well with large files as it did during the demo then it may be goodbye Word. I haven't tried it yet but I've bookmarked it to try later.
Site Speed and User Experience
As I mentioned in the Benchmarks, Site Speed and User Experience post the first speaker of the day following the keynotes was Steve Souders discussing 'High Performance Websites'. Watch for a post discussing this in more detail later.
The quality of speakers stayed high throughout. I think on the first day the most informative/interesting speakers came at the end with Heidi Pollock discussing mobile applications which is an area I haven't previously looked at and John Resig who talked about some of the really interesting things coming up in Firefox.
On the evening of the first day Diggnation was filmed on the keynote stage in front of a packed audience. I've not watched diggnation before but it was absolutely hilarious live. I think it is only available for premium members at the moment but if you know different let me know as I would like to see what the video version was like.
From reading the schedule I wasn't as excited by the second day as I was by the first but there was no need for worry. Simon Wardley got through 300 slides in 30 minutes with a highly engaging talk about commoditisation and utility computing. John Aizen and Eran Shir discussed the semantic web from their work at dapper. Matt Biddulph from dopplr discussed smart integration with third party sites. I'll be going into more detail on this later as well. The final session I went to was with Dick Costolo from feedburner and focused more on the business side but was interesting all the same.
Unfortunately I had to leave before the final keynotes to catch my flight but overall I felt it was a very good conference.
In addition to the conference there was also the expo hall with some interesting exhibitors.
Fav.or.it may just have what it takes to lure me away from google reader. It hasn't been officially launched yet but from what I saw during a demo it's a very interesting product. It is also built on the Zend framework which makes it worthy of note from a PHP viewpoint.
Widr sounded very promising. It's a geolocation service for the internet. It's going to potentially be more accurate than relying solely on the IP. If I understand the product correctly though I suspect it will always be a niche product as the user needs to install software for it to work. I suspect they also made a mistake in going for a .co.uk domain name rather than .com. The product has global appeal so to me a .com makes more sense.
Xcalibre launched their new flexiscale product which is probably best described as competitor for Amazon S3 and EC2. It looks like a very interesting product and from a technical perspective I suspect the better between the two but I worry that the strength of the British pound will make it less competitive on pricing.
Finally I'll highlight soup.io which is a blogging platform for less serious content. Probably best described as occupying the market between wordpress.com et al and twitter et al. It's not something I plan on using myself but it looked like a nice product which I could recommend to less web savvy family and friends.
Saturday, September 29. 2007
Firstly, let's lay the benchmarking issue to rest. Ben Ramsey, who after his initial outrage at my 7 tips post felt it "actually is really humorous" (probably unjustified praise but thanks anyway!), has a nice post highlighting the code in the PHP source confirming the lack of any difference I demonstrated in my follow up post. Wez Furlong commented on my 7 tips post and highlighted a post he made on benchmarking back in 2005. For anyone feeling my method was excessive his approach gives speedier results. Personally I'd like to see it run in triplicate though.
Next, as far as the minute differences the 'lightning fast PHP'-style posts are too often built around Ilia Alshanetsky probably has the best write-up.
Please keep in mind that these are not the 1st optimization you should perform. There are some far easier and more performance advantageous tricks, however once those are exhausted and you don't feel like turning to C, these maybe tricks you would want to consider.
Getting to articles with tips for that 1st round of optimisations you may want to make there are 13 tips for high performance websites on the Yahoo! developer network. These were written by Steve Souders who, in addition to writing the book 'High Performance Web Sites,' is speaking at the FOWA conference next week. That's one session I definitely want to catch. Hasin Hayder has a follow up post which is definitely worth reading.
Hasin goes into more detail than the Yahoo! article and provides some sample code. A three part series of posts at the IBM developerWorks site takes a PHP focused look at high performance websites and provides some useful instructions on setting up your sites to use the XCache opcode cache, Xdebug and memcache.
Three rules for high performance web sites
For those wanting the abridged version here are my 3 tips for high performance.
1) Fast environment - Start from a position of strength. I didn't post the average speeds in the better benchmarks post because I was looking at the difference rather than the absolute values but the benchmarks were running ten times faster on my web host than on my desktop. There are various reasons why this may be the case, Linux vs Windows XP, system specs, PHP 5.2.3 vs 'evil' PHP 5.2.1, but it doesn't really matter beyond illustrating the need for a good server and host. Other things to consider include an optimizer/opcode cache and gz compression.
2) Cache everything - Database and web service queries, blocks of content and even your entire page are all fair game.
3) Test everything - Time your code. Profile your code. Test your assumptions (including tips 1 & 2).
Speed doesn't matter
Finally an alternative take because playing devils advocate is fun. Download speed is not how users determine the speed of a site. To the user a site is fast if they can quickly achieve their goal. Steven O'Grady at Red Monk also raises some interesting points contrasting the perspective of the developer and the user.
As always further suggestions, alternative viewpoints and discussion are welcome in the comments below.
Sunday, September 23. 2007
I had thought that with comparing aliases of functions seven times over people would realise what I was doing but apparently my post was just too close to the sad reality and lacking in sufficient humour for people to catch on.
Here I again present benchmarks for the seven pairs of functions I compared in my last post. The difference being that this time the benchmark I use is my best attempt. If you think you can do better I would like to hear from you.
For this more rigorous test I switched from my local development server to a remote shared hosting account running PHP 5.2.3 on a Linux system.
As before the code used to run the test is available for download. Instead of just running each function one million times and timing it multiple rounds of replication are now used. Each function is run one thousand times and then its partner is run one thousand times. This process is also repeated one thousand times during the execution of a single script. This gives the one million runs performed in the previous post. This is considered to be a single test. This test is run in blocks of ten with two second intervals between each request. Each of these ten test blocks are run every five minutes via a cron job. This allows 120 'tests', and 120 million function executions, to be run an hour without any supervision.
After leaving it to run for an hour or so I got to work processing the stats. Continue reading "Better Benchmarks"
Saturday, September 22. 2007
I was recently creating a small tool in PHP and found myself hitting the max execution time and getting a fatal error. As it was only for my personal use I just bumped up the max execution time but it made me stop and think about how I could improve the speed of those scripts I do put up for public use. Most people aren't going to wait for 60 seconds for a page to load.
Naturally I hit the internet looking for tips. Blog posts, apparently entire blogs, forum posts and even dynamic web pages are devoted to speeding up our PHP based websites. Although these provide some information on small improvements I wanted more.
I set out to investigate whether there are any functions in PHP which perform similar tasks but where one function is faster than another. str_replace and preg_replace are fairly well known examples but I wanted to find others.
All tests were run on on a VIA Nehemiah 999 MHz with 480 Mb RAM (my gaming rig ) running Windows XP Professional SP2 and PHP 5.2.1. The code used to run these tests can be downloaded. In summary all tests were run 1 million times so that any potential errors averaged out.
1. sizeof vs count
First up are the sizeof and count functions. They can both be used to count the number of items in an array but does one do it better?
sizeof vs count
sizeof: 3.75928902626 seconds
count: 3.33035206795 seconds
Time saved: 0.428936958313 seconds; 12.8796280262%
The evidence says yes. The count function was over 12% faster in this test. Both functions are fast though taking 3-4 microseconds to count an array with 100,000 items. You might think it isn't worth it but remember count is also a character shorter. Not only is it faster to run but it is also faster to type!
Continue reading "7 tips for lightning fast PHP sites"
Saturday, August 11. 2007
Future of Web Apps . . . I'll be there
Well the Future of web apps conference is being held in London on the 3rd and 4th October and this one I can attend. I've already booked my place and I'm not too busy this time around.
I'm flying down late on Tuesday and then leaving as soon as the conference ends but if you're in the area, or attending the conference, and want to meet up during the conference get in touch.
The team behind the conference is also organising a roadtrip were they plan to go out and visit 12 European cities. They are visiting Edinburgh so I'll catch up with them then.
Thursday, August 2. 2007
New PEAR package for the Compete API
After writing about the compete API Hiroki Akimoto contacted me mentioning a proposal he had made to PEAR. Our scripts each had strengths and we decided to combine our attempts and hopefully make something better. After a week or so we updated the proposal with the new code and after a little time for comments and a week for voting the proposal was accepted and on 30th July a new package, Services_Compete, was available on PEAR.
The developer page at Compete is already updated.
The PEAR package is in my opinion superior to the wrapper I released but I'll keep it up in case anyone wants something more lightweight than PEAR.
Youmoz is the section of the seomoz site for 'user generated content' and over the previous weekend I decided to give it a try. Lessons Learned from Webmaster Central discusses some of the problems I discovered when I registered the site with Webmaster Central at Google. The fixes involved a little PHP, a small change to my database and a handful of new RewriteRule's in my .htaccess file.
It was more technical than most of the posts there but it seemed to go down well.
Saturday, July 21. 2007
It was initially flagged by techcrunch about a week ago. At the time it was struggling under the onslaught of being highlighted both on techcrunch and elsewhere. A few days later seomoz brought to my attention that the load had got so bad that the creator felt unable to cope and had released the source code to the community. There are now 13 mirrors you can use including Italian, German, Bulgarian, French and Dutch versions.
Despite this there are some real insights to be had for anyone interested in scraping these sorts of stats. For instance a stunningly simple method to grab the alexa rank is used which I hadn't come across before. It doesn't involve paying to access the API and you don't need to wrestle a css file into submission to extract the rank from the alexa site.
The alexa data is returned in file number 7. Just in case you're not overly thrilled by the notion of opening all 46 files to find the one that accesses the service you're particularly interested in there is a useful key in the following file - js/general_without_encryption.js
Sunday, July 8. 2007
I'm setting up a tool at the moment that is going to use a fair amount of web scraping. Sadly not all sites are kind enough to provide an API like Compete and some actively work to make your life more difficult. The three key points that I feel make web scraping a real pain are:
- Undocumented interface
- We're dealing with essentially unstructured HTML here. Web scraping also highlights just how little many companies care about validation and writing readable code. I'm assuming their code is a pain to read because they compact their files to save on bandwidth. If not then maintaining these sites could be an even bigger pain then scraping them.
- No warning of changes
- You're essentially at the mercy of the designer. If they change their layout your code breaks.
- Timeouts and blank pages
- Even with all the code in place you still may got no result. The site may be down or if you're being too aggressive your requests may be throttled. Even if you do get to a page it may nto be the one you want. Does a blank page mean the site has no info for the query you made or does it mean there was some sort of error? If the site is using 'soft' error messages it may be difficult to know.
With my recent work on web scraping I found the latest 'Whiteboard Friday' video from seomoz to be exceptionally timely. I generally read the seomoz blog for the discussion on marketing issues but they also have a couple of tools which are heavily reliant on web scraping and on Friday the issues surrounding this were discussed. Apparently they've been having trouble providing data for all their visitors due to the ever increasing demand. As well as the video Matt, who is their CTO and web developer, goes into a fair amount of detail on their process and some of it really depressed and/or surprised me
One of the most unreliable APIs I've had to deal with is the Alexa / Amazon API, which is funny because it's the only one that costs money.
Sadly this isn't the first time I've heard this. The costs aren't unreasonable but if you're paying for something I think it is fair to expect I higher standard of service than if it was free.
This entire [data fetching] process is repeated between 2-7 times with varying timeout lengths and user-agents until some kind of data is fetched.
7 attempts strikes me as amazingly high. It would be interesting to know the delay between each request.
Experience so far
I've got two scripts/tools running at the moment that require data from external sources.
The first is the msn contact grabber script which can fail after half a dozen or fewer requests on a single IP address. To make matters worse the interface is complex and poorly documented. Thankfully it is stable.
The second is the dnsbl checker which I haven't had fail on me yet. It utilises the dns system and is designed for high use. Even if the tool did become insanely popular the demand placed on external services could be limited by zone transfers. The interface is also so simple that documentation really isn't needed.
My experience seems to be at the two extremes of scraping. I'm hoping my current work will be more dnsbl checker than msn contact grabber. Maybe I should just drop Alexa data?
Monday, July 2. 2007
My post from last night discussing the Compete API received two comments. The first highlighted that there was already a proposal in PEAR. Great to hear but as I did much of the work for a wrapper last night you'll get another version now. Actually there is some potential for a superior hybrid to emerge - my wrapper provides better methods for getting at the data while the PEAR proposal could be more robust.
Given that I wrote the post on a Sunday evening I was more than a little surprised, albeit delighted, to receive a response from Jay Meattle at Compete before I woke on Monday morning. A company which gives a named contact you would expect to be fairly responsive but I still found the time frame here to be impressive.
With the exception of a link from the homepage, which is apparently in the works, Compete has made good on the points I raised so it's time for me to fulfil my part of the deal and release a PHP5 wrapper for the compete API.
Hopefully it should save anyone else interested in using the Compete API within a PHP project from starting from scratch.