Sampling YouTube

You can download the most recent (hopefully*) unbiased sample of YouTube videos here or you can read how it was made and why.

Right now, the sample includes 359,812 different YouTube videos that were uploaded mostly in April and May 2009. The file's format is CSV. First column is the URL (http://www.youtube.com/watch?v=EHlH0yWVKR4), second column is the time of upload (2009-04-28T19:47:42.000Z) and the third column is length in seconds (256). The files are sorted alphabetically by URL. The current file's size is 25.3 MB.

* Why "hopefully"? Some undetermined kind of sampling is used on YouTube's side. See bellow.

How the sample was made

While researching for my thesis, I found out that there is no simple way to get a random sample of YouTube videos. The page Recent Videos shows only 1,000 of them, which accounts for only a small portion of that day. Research using such a sample would be biased.

I found out that at least one other researcher had the same problem. Michael Wesch (whose video you might know) got his nearly random sample with the following method: "Every 2 hours, one of 8 researchers loaded the 'Most Recent' videos on YouTube and analyzed the first 20 videos (the most recently added at that moment). This was done to eliminate the sampling bias of different times." This, of course, is pretty hard to do over long periods of time. Michael Wesch chose to do this for only 1 day. Still, his YouTube Statistics research corresponds quite accurately with what I found out, so it couldn't have been that biased.

For my thesis, I chose almost the same method as Michael Wesch, but automated it.

    • I wrote a simple Python script that accessed YouTube API every 2 hours and got all the recent 1,000 videos.
    • I found out that even at this interval, two consecutive lists of videos contained duplicates. That meant that less than 1,000 videos were added to the Most Recent list every 2 hours.
      • If what the YouTube official numbers say is right, than there should be over 30 thousand new videos uploaded on YouTube every two hours.
      • That means that some kind of sampling occurs on YouTube's side. I wasn't able to determine how the sampling is done or if it's random or not. I am investigating further, though.
      • This is why I can't confidently say that the sample is really random.
    • Anyway, after I got and deduped all these videos, I had a list of over 160 thousand videos uploaded throughout April 2009. I randomly chose 385 of them, analyzed them and wrote my thesis.