My Edmonton’s Many Data Sources
As I mentioned in my last post, there are over 30 different data sources used in My Edmonton, and a lot of different methods and screen-scraping were required. Again, I thought it would be interesting to developers out there to find out how just how this was done, as screen-scraping (although legally grey) is a great way to produce your own API.
First, some background on screen-scraping. Many different apps out there do this, as it's often the only way to get data for an application you need. Some of my favorite apps that are based on screen-scraping are:
- Mint.com: You know, the startup who Intuit (my former employer) acquired last year for untold millions?
- Pageonce.com : Another financial website/iPhone app that I personally love for managing my finances. If only CIBC would stop blocking them...
- AppSales : an iPhone app only available for iPhone developers on jailbroken phones, that screen-scrapes Apple's sales reports website for data and presents it in a nice, easy-to-use format. As one twitter poster wrote this morning (aptly timed) : "Anyone who thinks iTunes sales reports work well should be forced to visit a website where they download their email one message at a time." This little app makes iTunes sales reports actually consumable. This has recently been supplanted by Apple's own ITC mobile app, but AppSales is still far better and was 2 years ahead of Apple's. Why Apple doesn't provide an API for this service is beyond me.
- Virtually every "cheap flight" website like Expedia uses screen-scraping behind-the-scenes to get flight schedule and price information.
So, sometimes you need to do a little screen-scraping to get by.
As I mentioned, My Edmonton uses over 30 data sources. These include:
- 23 datasets from the city's Open Data Initiative. These include amenity locations, neighborhood and ward information, garbage collection schedules, transit data, and event and construction calendars.
- 7 sets were scraped off the city's own website, including the assessment data.
- Geocoding API's such as Google, Yahoo, and geocoder.ca
- Canada Post (for postal code lookups)
All of these datasets combined provide the immensely useful set of features that My Edmonton provides. For example, because My Edmonton knows the location of a certain home (thanks to the City's assessment data and Google Maps' Geocoding API), we can then look at various information:
- We can discover which Garbage collection zone the user is in without them having to manually look at it on a map.
- We can look at construction zones near their home.
- We can tell them which ward and neighborhood they are in, and provide details about how to call their city councilor.
- We can tell them their nearest bus stops, and the bus route schedules
- We can tell them the locations of various things like schools, playgrounds, etc. near them
To my knowledge, this is a novel use of the city's data and very useful.
Screen - scraping
This article was about screen-scraping, right?
We "only" had to screen-scrape about 8 different data sets - the city's website, and Canada Post. Three different methods were used for these:
- when possible, the city's website was scraped with a recent Google acquisition of Needlebase. Needlebase is on of the coolest tools I've used in a while. You can point it to a web page, and give it "examples" of what to look for and how to parse data. After about 3 examples, it usually gets everything right. I used this to scrape City councilor data, as well as amenities not provided in the City's Open Data Catalogue, like Playgrounds, Tennis courts, etc. I would have loved to gathered facility schedule information, like Kinsmen's pool hours, but that information is all in disparate formats and not scraper-friendly; some are in PDF, some websites. Given enough time, I could gather this, but it wasn't worth it for this competition.
- Canada post was scraped using a custom Ruby script using a package called Mechanize. This site is NOT scrape-friendly, as they have a kind of interview where you have to fill out multiple forms to get the data you want, and it looks like their site is Microsoft-based and all their form names were very long and convoluted.
- The city's assessment website was scraped with a custom Ruby script using nokogiri to parse the HTML. Originally, my teammate Terry wrote it in C#, but it wasn't performant enough and didn't interface with a database, so I translated it to Ruby. The script makes heavy use of caching to avoid any duplicate calls to the website. As I mentioned, we've sent over 3.5 million requests to the city's servers in 2 months, giving us now 200,000 assessments.
I prefer not to post code, as there is a lot of it, but if anybody is interested in seeing code, feel free to leave a comment.
Building My Edmonton
With over 400 hours of effort and countless hours of computer crunching put into My Edmonton, this app is a big app. I thought it might be interesting to other developers out there to hear the story of how this app evolved and some of the technical challenges behind it.
The beginnings of My Edmonton
The first time I'd ever heard of the Apps4Edmonton competition, I was at Startup Weekend on June 25th. For those who don't know, this is a balls-to-the-wall weekend of business development; Friday night you come up with an idea and pitch it to your peers; the best are chosen and teams are formed around an idea. Saturday and Sunday are full days of hacking together the idea and product, and by Sunday night, your fully-formed (or half-baked product) are born. It's totally my style and was a great event.
A few days before Startup Weekend began, Reg Cheramy and I were brainstorming over what kinds of apps we could build with the City's Open Data. I saw that they had locations of various amenities - bus stops, parks, schools, and this immediately made me think of a real estate app. I've always been a huge fan of real estate, and had actually prototyped a site like Comfree in 2006, which I abandoned in favor of the company I was involved with until 2009, Zigtag. So naturally, I was excited at the possibility of creating a new real estate app in 2 days of serious hacking.
What came out of Startup Weekend was a pretty cool iPhone app called HomeCricket: it was much like the iPhone google maps interface, and you could see the value of almost any property in the city (as long as we had the assessment data for that address - I'll get to that in a bit) as well as all the amenities nearby to that property. It seemed like a really cool app, one that would allow newcomers to the city to explore their new city, or real estate buyers to get a decent picture of their property values and nearby amenities, or ordinary citizens to explore their neighborhoods.
The key piece to the entire app was assessment data - the city's vast dataset of the value of every single property in the city. This key piece of data was actually not available through the Open Data initiative due to several administrative issues at the city, but it was available through a web portal on a one-by-one basis. Over the course of the weekend, my teammate, Terry O'Neil wrote a scraper that would run through a list of addresses and obtain the assessment for each address through the city's web portal. In two days of babysitting the scraper, we obtained something like 10,000 assessments - enough to make a decent demo for Sunday night.
Over the course of the next several months (and yes, it is still ongoing), I took Terry's script and modified it to be slightly more efficient, and kept hammering the city's servers to obtain all the assessment data (with the permission of the city's IT guys, of course). Currently, after trying over 3.5 million different addresses, we're at about 190,000 records, out of a reported 210,000.
HomeCricket's morph into My Edmonton
For the next 2 months, I refined HomeCricket into a really nice, usable iPhone app. Learning from my Fringe submission experience, in which I completely underestimated the competition (congrats, @skabenga) and didn't get the services of a designer or do much user feedback, this time would be different. I sent copies of the app to a bunch of my friends to get their feedback. The response was less-than-overwhelming; most people really didn't "get it". Apparently, I was the only one caring about how cool this real estate app was.
So, back to the drawing board. One of the comments we received at Startup Weekend was from Anne Matthews, who encouraged us to change this app into an Edmontonian-centric app, featuring local events and news. So, with that in mind, My Edmonton was born, incorporating news, events, and a plethora of information about your neighborhood. No longer was it centered around real estate values, but more around things ordinary Edmontonians would care about:
- When is my garbage picked up? Can I be notified the night before?
- What construction is going on in my neighborhood and how will it affect my commute?
- How do I call my city councilor or report a problem to the city?
- How do I know if my son/daughter is playing soccer tonight if it's raining? Are the fields closed? (This one is near-and-dear to me, as when I ref'd soccer many years ago, I'd spend hours trying to get through the busy signal on the city's sports field information phone line to see if I was reffing).
- Where is my nearest school, playground, police station, etc?
In addition, we mapped all the amenities that we could find, through the Open Data catalog, or by scraping the city's site for other amenities that weren't included, like pools, playgrounds, tennis courts, hockey arenas, etc. After another round of feedback from users, I felt that this app had what it took to appeal to regular Edmontonians. With a little help from the HomeCricket team, Christine Panter and Sebastien L'Homme, we prettied it up with some nice graphics and I spent a Friday night bug-finding with Christine.
Technical stuff
The technology behind My Edmonton is actually pretty impressive. Having never really played with GIS data before (i.e. map coordinates), this was a learning experience. In particular, trying to figure out if a point exists within an odd polygon was an interesting challenge. MySQL supports GIS data, but operates exclusively on minimum-bounding rectangles (MBRs), which basically, if you took a circle, and then drew a square around it to encompass it, that's an MBR. With things like garbage collection zones and neighborhoods, many MBRs can overlap, making it a chore to figure out if a point is inside the actual zone. Luckily, there are established algorithms out there that made the task a bit easier. (Yes, I do know that Postgres has full GIS support).
Then there was the assessment data. As I mentioned, I've been scraping the city's site for over two months, sending one request per second to not overload the city's servers and be banned (I actually did crash their servers a few times - sorry!). You'd think this would be fairly simple, except that, to my knowledge, there is no publicly available complete list of addresses in Edmonton (I'm sure you could buy one, as that's important to marketers and census people, but I couldn't find one). So, then it became necessary to come up with a way of guessing the addresses. Initially, we used a naive approach, i.e. 101, 1 Ave NW; 102, 1 AVE NW; you get the picture. Then, WAY too late, I discovered that Canada Post actually had a search feature, where if you inputted a street name, it would give you a list of house numbers. Luckily, the city does have a list of street names available, so I began scraping Canada Post to get a list of addresses, which were then fed into the City's assessment site. Suddenly, my hit rate on a given address went from 1/100 to 1/4 - much better!
Address Parsing
Who knew that addresses could be represented in so many different ways? One of the corner-stones of the app is the ability to type in any address, and have it intelligently figure out the exact location that you mean (reverse geocoding). This is very similar to Google Maps, except Google Maps doesn't do a very good job. My algorithm relies on first, an exact or close match to an address I have in my database. If I can't find an exact match (which is often the case), then my backup is to ask Google Maps where the user might mean, and then again look in my database for lat/long coordinates Google brings back. However, this isn't foolproof. For example, if you type in 10830 38 A Ave, Google Maps does not recognize that "38 A" is actually "38A", and will ask you about 38 Avenue instead, and not even suggest 38A Ave. This was just one of the problems I encountered. People input addresses in many weird ways - for example "5-58.ave.nw" or 304-10611-111st. The variability in addresses coming in is actually a nightmare, and I've resorted to keeping a table of all address lookups and if they were successful, so that I can improve the parsing algorithm accordingly. But the end of this, I should have a really nice entity resolution algorithm.
Entity Reverse Geo-coding
The next challenge was getting lat/long coordinates for a bunch of amenities. The City's open data sets did provide GIS coordinates, but those that I scraped from the city's website, like tennis courts, playgrounds and pools, did not, they just provided addresses. I thought that this would be quite easy to reverse geocode using Google's and Yahoo's APIs - and I was wrong. Both API's were not able to resolve about 50% of the addresses, and I resorted to other means, like looking for schools or community leagues with the same name and approximating their locations. Even still, about 10% of these amenities had to be manually geocoded.
And then, yet again, there are the assessments. Every one of the 200,000 assessments I've collected had to be geocoded so they could be placed on a map. 95% of these were successful, but if you look closely, some of the locations aren't so good. In lots of regions, the assessments are placed evenly on a street, but every so often, you find a cluster of houses, when they should be evenly placed. And a small proportion of those addresses actually resolved to somewhere close to Hwy 16 west of stony plain... why Google would locate them there, I have no idea. I actually had to tweak the addresses to force Google to resolve them correctly. I only realized this problem when I had an error the other day, where a user's garbage zone was not returning any results, which should be impossible, as every assessment should be in a garbage collection zone. I ended running a search for every assessment to look at if it was placed within an appropriate region, and close to other similar addresses. (i.e. if the street and avenue were similar, but more than a kilometer apart, then something was obviously wrong).
Lastly, the road construction data had to be geocoded. I had envisioned a really cool interface, where, using Google Maps directions, I could show a red line over the affected region. Again, this worked in 50% of the cases, but it turns out that Google is really terrible at recognizing intersections. Calgary Trail and 23 Ave is geocoded to South Park shopping center (Cgy Tr and 40 Ave - way off!). And Anthony Henday and 34 St is geocoded to somewhere south of Ellerslie and nowhere close to 34 St. So, some of my construction zones were over 10 km long! Again, through a LOT of massaging and some manual tweaking, I managed to get most construction zones coded with a route, but many are just placed with one marker that resolved correctly and don't show a red line. I'll try to fix this as time goes on.
Data inconsistency
Another problem I encountered was the lack of data consistency from the City's data. Garbage collection schedules, for example, worked fine when queried through their web portal, but the same data set was not returned when downloading the CSV. Incidentally, another lesson I learned from my Fringe experience was to always store a local copy of the data that I controlled. As a developer, you're responsible for the integrity of your data, even if it's from a 3rd party. During the Fringe, I had assumed that I could query the City's API directly, but, when, 2 days before the Festival started, the Fringe released a new schedule and changed the date format, my iPhone app could not correctly parse the new date format and failed. Given that the App Store has a turnaround time of about a week, an updated version could not be put on the store until half way through the festival, and I pulled the app from the store to avoid people buying an app that didn't work.
Thus, My Edmonton relies solely on its own cached copy of data, except where real-time feeds are required (events, news, etc). The Garbage collection schedule problem is resolved now (thanks to Devin Serink at the City), but it's frustrating to have so much data inconsistency.
Future directions
Now that we have a rock-solid base of data, the question remains - what else can I do with this data? (I guess this is the same question the Apps4Edmonton contest is meant to answer as well.) I doubt the city will ever publicly release this data because of too many legal issues - so what kind of cool things can we do? A heat map of property values in the city is one idea. Any others?
