Short Description: This is a data scraping projecting that displays a week’s worth of tweets with the hashtag #storyofmylife which were scraped using Tweepy and presented in two forms with TimelineJS and CartoDB. Done individually.
This project took me an incredibly long time to complete. I’m fairly satisfied, though, I think it turned out well. My original idea was to create some kind of clock showing turning points in peoples lives because I’m really interested in personal stories; however, I felt that the clock idea was too complicated, though I wanted to stick with the general idea. When we were given the assignment to try data mining social media websites, I really wanted to be able to scrape Facebook to see if I could get my friends ‘life events’. While Facebook does display these when you type ‘life events’ into the search bar and I’m sure there must have been a way to scrape them, the problem was that Facebook SDK wasn’t working (I tried valiantly to figure the issue out, but to no avail), so I had to give that up.
My next goal was to use Twitter, but now the issue was, what was I going to do? I tested scraping many different things, from the hashtag #finalsweek (seeing as that’s on everybody’s minds) to babies (seeing as a lot of people our age seem to have babies on their minds… E.g. my roommate dreamed about having triplets, Megan Graham actually had her twins, Sarabi loves babies, and so on). It took what seemed like forever, but then I remembered what I was going to try scraping from Facebook: life events. So I tried to scrape that term, then changed it to life stories, and then finally settled on the hashtag #storyofmylife. For visual representation, I decided I would use Matt’s suggestion of Knight Lab’s TimelineJS to create a timeline with the tweets, as well as CartoDB to create a map of them.
I again modified my code so that it would scrape that hashtag, as shown below. And soon, once I finally set about sorting through all of the tweets I’d scraped, I realized I had too. much. data. I couldn’t go through that many tweets – there were far too many. My code was set to scrape 1,500 tweets, because I wanted a timeline that went through a week. Forgetting that this was obviously too many to go through. The thing is… I did go through them all. So I went a little overboard. But I worked with it! Below is a description of how my process went.
STEP 1: SCRAPING DATA FROM TWITTER
Because I was making so many changes and modifying my code depending on what kind of scrape I thought I’d test out, I exceeded my rate limit several times and got the error 429 from the Twitter API because I was making too many requests too fast. Life is hard. It took me a long time to get my code together despite it looking so straightforward, because I tried many different things before settling on the final topic of my scrapes/stories/project, so all of those iterations added up quickly and I guess Twitter didn’t appreciate me spamming requests at such a rate. Whatever, man. Here’s my code:
Reasoning for the things I scraped:
- [Date] Needed the dates and times for putting tweets on the timeline. No problems acquiring these.
- [Location] Needed for putting tweets on the map. But not everybody had a location, so I couldn’t use all of the tweets I scraped.
- [Location, geo-tagging] Didn’t get much out of this. Not a lot of people have this option turned on.
- [Time zones] Helpful, because if I only received a time zone like ‘Eastern Time, US & Canada’, I couldn’t put that on the map unless the time zone was more specific (I got a tweet that said the person’s location was in the ‘UK, for now’, but their timezone stated ‘London’ which was more useful).
- [Names and screen names] To give some kind of credit to the people who’s tweets I used.
And here’s what my sorting process kind of looked like (showing all tweets, map selections, and timeline selections, from top to bottom):
STEP 2: SORTING THROUGH THE TWEETS I SCRAPED
I had to sort through all of the data I collected and select the tweets I was going to use. For example, I got tweets with the locations ‘hillbilly hell’ and ‘coleworld’, which, amusing as they were, weren’t feasible to put on the map, nor the tweets that had no location specified (which was sad if they were perfectly good tweets). I had to go through all the tweets to make sure things came out correctly, because, for example, ampersands tend to show up like: & And although I set the language to English, I got a location that was written in Urdu for someone in Quetta, Pakistan, or somebody’s name in Korean. Sometimes I had to choose between a location or a time zone if I didn’t get the other one to confirm it, or got both that were different. I ended up filtering out a lot of tweets because of all this. I also removed tweets that were about the song Story of My Life by One Direction; not what I was looking for.
While I did this, I realized I had scraped way too much. It wasn’t really possible for me to go through all of them, so because I’d already started sorting through tweets, I thought I’d use a number of those and go with that. This was my first, silly, idea. But then it occurred to me that I could just go through the days and select, say, 5 tweets per day, making a total of 35 each for the map and timeline – I didn’t use the same tweets; for the timeline, I chose those that didn’t provide a location. So, I decided to choose tweets that I found interesting; trying to get a variation of places for the map, a variation of times throughout the day for the timeline. You would think this was easy. But, no. That sentence about it not being possible for me to go through 1,500 tweets? I take it back. I did that. For hours and hours on end. Just to make sure I was choosing a good 5 out of 200+ per day of those 7 days, average. Random fact: Going through 1,500 tweets for hours and hours on end makes you dizzy.
STEP 3: CREATING A MAP USING CARTO
After I’d finished selecting tweets (finally. I should have cried with happiness), I went about adding the 35 chosen for the map to a CartoDB map. I have some prior experience with it, but I watched the video tutorial anyway, and the process was fairly straightforward. CartoDB supports Twitter, but I wasn’t sure how to go about using that function because I think I would have needed to scrape data through Carto for this, and I was just not going to do that. So, I decided to add the points manually; no real problems here, it was just a bit time consuming because I was extremely tired (having been working all day). I just really wanted to be able to finish this project because I have other finals to focus on, too.
For some reason my map isn’t embedding, so here’s a link to see it by itself.
STEP 4: CREATING A TIMELINE USING TIMELINEJS
I watched the video tutorial for this as well, because I’d never used TimelineJS before. It wasn’t so bad; again, just a bit time consuming. Because TimelineJS also allows for Twitter content to be retrieved and I had the names and screen names of the users, I found their tweets on Twitter and decided to link them to the timeline and have them displayed that way. I customized the timeline some to make it less boring, and that’s about it, I guess.
And here’s a link to see my timeline, because that won’t embed here, as explained in step 5.
STEP 5: PUTTING IT TOGETHER ON A WEBSITE
I went through CSS Zen Garden as Marianne suggested, but I didn’t really find a style that attracted me. I then decided instead to use WordPress; so, I created a website and then customized/edited it a fair bit because there were a lot of extraneous things that I didn’t need. Essentially, I wanted a simple website to put together the visualizations I’d created. This took a fair bit of time as well. I didn’t have a problem embedding my map into my website, but things were not so smooth for the timeline. According to TimelineJS’s website, it doesn’t work with WordPress; I looked things up, and there is a plugin – but that doesn’t work for WordPress websites. So I had to just leave a link on my website instead.
Overall, my project isn’t perfect, but I think it turned out well enough. It was a long journey, I guess, but also kind of fun. I think there’s potential to make a larger scale version of my project, which would be cool.