Portable data exploration with Docker: looking at mobility in Mexico City
Posted in data visualization
In many data projects, exploration is an important component of the analysis pipeline; modern tools such as D3 allows developers to make exploration much more complete and flexible; the downside of this is that the increase in complexity can make the resulting program difficult to share or even reproduce. Fortunately, Docker containers can help make our life easier by encapsulating all of this complexity away and letting team members and clients makers focus on the results.
Understanding and visualizing the data is an important step towards the success of any data project, either as a preamble to feature engineering or simply for presenting results. In my experience interactive visualization can convey deeper insights and have the added bonus of letting others interact with the data directly. However, this can get a bit impractical depending on factors like dataset size and presentation formats: having to install R and a bunch of system dependencies in the client's laptop is not great; shiny doesn't seem to take it well when the app or the data get relatively large; paying for a full web server is not ideal in some projects.
Containerization technology has been on the rise for a few years now, specially in large scale systems, because it provides standard objects (containers) in which each piece of the system can be encapsulated with all of its dependencies and from there interact with other parts of the system; it allows sort of a zoomed-out version of the object oriented programming principles, simplifying development enormously. Regarding data projects, Docker containers' lightwight virtualization also allows greater flexibility. In the sections below I'll explain how this can be achieved for a non-trivial project that involved different technologies; the use case is about exploring how Mexico City's bike sharing system works on a day to day basis.
Visualizing bike mobility in Mexico City
Ecobici is Mexico City's public bike sharing systems. It started in 2010, and to date, it comprises around 500 stations spread across some of the busiest commercial and corporate districts in the city, moving hundreds of thousands of users each month; the service has an online API where anyone can request real-time as well as historic anonimized data, and putting this together with public geographic data from INEGI, we can get a good idea of the general mobility patterns in the system. Furthermore, even though the data is anonimised, using Google's geocode API we can roughly see where most of the transport flow takes place.
For the sake of demonstration, the data snapshot shown here contains only one month of data, which is around 70MB, but depending on how much historic data we want to look at, the dataset can get to a few GB in size. To make dynamic figures we need to query this data efficiently, so storing it in indexed MySQL tables can be a good idea. From there, we can use Python to do some preprocessing, and use D3 to generate the figures.
The full interface look like this:
Using some simple algorithms on top of this give us some interesting insights of the system: network methods such as Louvain modularity and eigencentrality calculations tell us that there are three main subnetworks in the system (though fancier algorithms would probably offer more granular insights). For those that know Mexico City, they are strongly associated to Polanco (pink), Paseo de la Reforma and downtown neighborhoods in general (green), and colonia Narvarte (red) respectively. Stations exactly on Paseo de la Reforma form a cluster of central nodes in the network. Another interesting feature is that there are stations spread over the network seemingly at random but very related to each other (in a shade of pale green, maybe not too clear in this picture in particular). Probably those are stations in which bikes are resupplied or redistributed manually by system operators.
Clicking stations in corporate neighbourhoods we can see that daily usage patterns are trimodal, and this corresponds to the opening and closing times of businesses as well as lunch time; rush hours can be seen in activity spikes.
Traveled distance from particular stations is very uneven throughout the day, which means that users from that station travel significantly farther away at certain times; this could be related to congestion in other transportation systems at rush hours, but this is only a guess.
Besides general findings, this could be useful to do fine-grained analysis across stations, dates, user demographics and so on.
Containerizing the analysis
By Placing a MySQL dump file in a cloud storage bucket, and by uploading the containerized Python application's image to a registry like DockerHub, anyone else can reproduce this project in their local machine with minimal effort, provided they have Docker installed and run on Linux (or at least have access to a Linux VM).
To avoid wrapping everything up in a relatively large Linux image, I needed to use 2 separate images, one for the MySQL instance and the other for Python. In this case I think the separation is not strictly necessary because the project is small enough, but it improves modularity anyway.
If you want to run this project, a
run.sh script like the following would do:
#!/bin/bash echo "Downloading latest data batch (temporary)" curl https://storage.googleapis.com/ecobici_app_data/ecobici.sql > ecobici.sql echo "setting up containerized MySQL instance..." sudo docker network create ecobicinetwork sudo docker run \ --network ecobicinetwork \ --name ecobicidb \ -e MYSQL_USER=ecobici \ -e MYSQL_PASSWORD=ecobici \ -e MYSQL_ROOT_PASSWORD=ecobici \ -e MYSQL_DATABASE=ecobici \ --expose 3306 \ -d \ mysql:5.7.30 sleep 15 # need to wait for mysql initialization to complete sudo docker exec -i ecobicidb sh -c 'exec mysql -u"$MYSQL_USER" -p"$MYSQL_PASSWORD" ecobici' < ecobici.sql echo "setting up local server..." sudo docker run \ -d \ --network ecobicinetwork \ --name ecobiciapp \ --expose 3000 \ -p 3000:3000 \ nestorsag/ecobici_app echo "service available at localhost:3000" rm ecobici.sql