A few weeks ago, I posted a note about a project I’m carrying out as part of my ongoing interest in memetics, contagious messaging and so on.
Geek alert: The rest of this story may not be wildly exciting to normal people.
It’s taken me a while to get everything together; the data, the analysis tools etc. Because it’s partly a hobby (which means that I enjoy doing the tinkering myself when I have the time, rather than letting someone better qualified do it), and partly because I didn’t know whether there was any value in the exercise, it’s taken me longer to pull together, and is in a clunkier format than I might have wished. Something that I had imagined as being fairly simple and elegant is (in fact) tied together with string and masking tape at this stage (well, it is a prototype).
What went wrong?
- Bad data: I used a brilliantly versatile piece of software called Anthracite (Mac only) to write the web scraper. I’m not really using it in the optimal manner, and that led to some anomalies and artefacts that needed to be edited out by hand. Dull, repetitive work.
- Bad data: The machine that was doing the scraping was all set up to do it automatically at the same time every night. While I was away one weekend, a powercut did for that. I lost two days’ of data, and had to throw away a third day to maintain accuracy during the analysis phase (otherwise the increments would be artificially high). Again, I had to do this by hand for 130 records.
- Insufficient skill and power: I thought I’d do all the analysis in MS Access, but I couldn’t find a way to do what I needed. In the end, Access just became a way of collating information.
In the end, it’s taken Anthracite, a text editor, Access, Perl and Excel (and an eight-stage process) to do what I thought I could do in Anthracite and Access alone. But I’ve done it. If I were to do it again, I’d probably just use Perl (or more to the point, get someone else to do it).
What have I got? Well, the interesting thing is, I don’t know yet. I’ve got raw data on 130 YouTube videos. I’ve tracked the views they received daily for up to 20 days of their lifecycle. That’s about it at this stage.

Some rough and ready charting suggests that most videos peak views occur during the second 24-hours of their life, and drop off during the third. But I’m not interested in most videos; I want to identify those that achieve some kind of epidemic success.
I’ll continue to look at this data. If it looks as though there’s anything interesting emerging (particularly about non-standard lifecycles), I’ll post it here. Meanwhile, I make the raw (edited) data available here: YouTube Lifecycle data. If anyone wants to look at the tools and scripts I pulled together, please just ask, and I’ll send them to you.
Hi Mat,
interesting project! I found a similar tool that you might be useful : http://www.vidstats.com/
(I found it a bit patchy with pulling up data on any username)
Yep, Ben: I think that rather eclipses my puny efforts in most ways. *sigh*
[...] (and criticised) study by Tubemogul on the short shelf life of online video reminded me of some research into views on YouTube videos I did back in 2006. I only looked at about 130 random YouTube videos for the first 20 days of their [...]
I thought the post made some good points on web scrapers, For web scrapers i use python for simple things, but for larger projects i have used extractingdata.com web page scraper which builds custom web scrapers and data extracting programs simple and fast
Hi Mat,
Tell me, I’ve tried using Anthracite (1.7.4) and just when things are going to get interesting – I’m looking through vbulletin forums to see how certain phrases appear and are responded to over time – the app crashes. Did you have this problem, and how did you over come it?
Johnny
[...] (and criticised) study by Tubemogul on the short shelf life of online video reminded me of some research into views on YouTube videos I did back in 2006. I only looked at about 130 random YouTube videos for the first 20 days of their [...]