This tutorial will introduce PostgreSQL, a Relational Database Management System (RDBMS), and PostGIS, a spatial extension for PostgreSQL that allows for storage of geographic data and for geographic queries. In this tutorial, we will install PostgreSQL on our local machine, install and enable the PostGIS extension, then load a CSV and a Shapefile into the database. Finally, we'll run a few queries that show how the database is used and to sample some of its functionality.
The datasets we'll be working with today are a CSV of coffee shops in the City of Cambridge, and a shapefile of Cambridge Neighborhoods.
Two items for preparation:
The rest we will handle in the following steps.
Create a database containing Coffee Shops and Neighborhoods in the city of Cambridge. Once created, we'll run some queries that will tell us the number of coffee shops in each neighborhood and the closest coffee shops to a specific location.
The first step is to get PostgreSQL and PostGIS up and running on your machine or your database web server. Once PostgreSQL and PostGIS are installed, the process for running analysis is largely the same, all you have to do is adjust the address and credentials you use to access the database. The steps below will show you how to install PostgreSQL and PostGIS on your localhost machine, which is great the development, testing, and learning.
Start by downloading and installing PostgreSQL, the main RDBMS. You can find it at the following link. Download the most current, non-beta version for your operating system. For example, PostgreSQL version 9.5 on Mac.
On your machine. Download for your appropriate OS, open the installer and install PostgreSQL.
Once in StackBuilder, install PostGIS. Find it under Spatial Extensions.
In general, you can accept the defaults, it will put things in the right place and enable you to work with PostGIS in your databases.
We should be ready to go! Next we'll connect to our PostgreSQL instance using pgAdmin, create a database we can work within, and load data into that database.
If you want to access your data over the web, which will often be the case for building web applications and online maps, you can easily set up PostGIS on Amazon Web Services. Follow the tutorial on the AWS documentation below to set up your database, then install PostGIS once your database is setup.Setting up a PostgreSQL Database on Amazon RDS
Important Note: Make sure you make a big enough RDS instance for your data. Default is 5 gigabytes, you might want more.
Next, connect to your database. Do this using pgAdmin (download it if you don't have it). Using pgAdmin, you can following the instructions at the following AWS documentation link.Connecting to a DB Instance Running the PostgreSQL Database Engine
Last, you need to enable the PostGIS extension, which is included with the PostgreSQL RDBMS install. Find the instructions at the following documentation site.Setting up PostGIS on Amazon RDS
Voila, you have PostGIS running on Amazon! Connect to it as you would any other database!
For development purposes, this will work great. Optionally, for further security or to move towards a production environment, set up a proxy running through a EC2 VM, and only accept connections from this proxy. View the following tutorial for a nice summary on how to do this. Creating a Geospatial Database on Amazon RDS by Felix Palmer.
Now, let's get into our database.
Connect to your database using pgAdmin to confirm proper setup. Find pgAdmin in your Applications folder and launch it. Navigate to File -> Add Server. Here, input your database credentials. This requires the default username (postgres) and the password we created for it in the installation step. We are connecting to our localhost, but if you wanted to connect to another location, replace localhost with your desired IP address. Fill out the credentials.
After you successfully connect, your database connection will appear in the pgAdmin window. Click on it to view the contents of your database. We don't have any data in there yet, but we have a structure of the database and all of the templates and defaults.
Create a new database to hold our datasets. Right click on Databases, and select New Database.
In the New Database dialog, setup the following.
Click OK. Your new database will be created and appear in pgAdmin. Click on cambridge to highlight it and select it. We can now add a table.
There are a number of different ways you can get data into your database. We will look at two of them in the following steps. The first we will look at is loading a CSV saved on your machine, then we will look at using the Database Manager in QGIS.
With cambridge highlighted, navigate to Tools -> Query Tool. This opens up our SQL query window in pgAdmin that we can use to send statements and queries to and from this database. We would like to load the cambridge_coffee.csv
In the Query Tool, create a new empty table called cambridge_coffee_shops using the following SQL statement. This is an empty table based on the schema of our CSV. For the PostgreSQL data types available, view the documentation pages on Data Types.
CREATE TABLE cambridge_coffee_shops ( id serial NOT NULL, name character varying(50), address character varying(50), city character varying(50), state character varying(50), zip character varying(10), lat numeric, lon numeric );
Now we have an empty table we can load our data into.
View your new table. It is nested within our cambridge database, in Schemas -> public -> Tables -> cambridge_coffee_shops. Once you find it, right click on it and select Import. See the figure below.
In the Import dialog, go through the tabs to set up the settings for our table import based on our CSV.
Click Import, it will import, pretty quickly, then click Done
View the table by right clicking on it and selecting View Data -> View All Rows. Here is our dataset.
PostGIS uses a special geometry field in the dataset that encodes our geometry into a large variable character string. These are unreadable to the human eye, but are interpreted by the database and GIS software as the geometry. Before we try to view this dataset in QGIS, we need to create this field. It is by default called geom, so we will alter our table to include in this column, then update it using a PostGIS statement that takes our latitude and longitude columns and interprets them into an encoded string in the geom column. Make sense? If not, we'll demonstrate in these next steps.
Create the geom field using the following statement in the Query Tool for our table. Use the ALTER statement to add a column. Name it geom. For the data type, give it a type of geometry(POINT,4326). The arguments are type of geometry (POINT) and coordinate system (EPSG code, which for WGS84 is 4326).
ALTER TABLE cambridge_coffee_shops ADD COLUMN geom geometry(POINT,4326)
Now we want to populuate the geom field. This can be done using a simple UPDATE statement on our geom column, and then using two simple PostGIS functions that set the coordinate system (ST_SetSRID) on a point made from our latitude and longitude fields (ST_MakePoint. Note, all PostGIS functions start with ST. Run the following in the Query Tool.
UPDATE cambridge_coffee_shops SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);
Double check our table.
CREATE INDEX cambridge_coffee_shops_gist ON cambridge_coffee_shops USING gist (geom);
We have data! And it is in our database!
Next, open QGIS and connect to our database to view our data. In QGIS, we can use the Database Manager to import a Shapefile into our PostGIS database.
Open QGIS. To view our data in QGIS, we need to connect to our database. Create a new blank document.
In the blank document, create a connection to our PostGIS database. Open the Browser Panel if it is not open, and the find and right click on PostGIS in the list under the Browser Panel. Select New Connection.... Enter your credentials for your database and select Also list tables with no geometry. This will let us see tables in our database that don't have geometry as well as those that do.
Test your Connection, and Click OK.
Click your connection in the Browser Tab, and then drag it into your map. You should get your coffee shops on the map.
Add the cambridge_neighborhoods shapefile to your map. Note that it is also in WGS84 projection system. If the data you are working with is different, reproject it now. Then, click on Database -> DB Manager. In the DB Manager, select the cambridge database, public schema, and then click import at the top of the window. This will let us select a layer in our map to import to the database.
In the dialog, navigate to cambridge_neighborhoods, and setup the import by stating the schema, table, keys, SRID, and other options.
Click OK. Your dataset will import.
Right click on your database in Browser Panel and Refresh your database. Your layer will appear in the browser, and our dataset is in PostGIS.
Close QGIS, we'll run some PostGIS queries in pgAdmin to demonstrate the usage. Going back to pgAdmin, we can now get into our cambridge database and run some spatial queries. View the database in pgAdmin, you should see both of our tables. (If not, right-click the database to refresh the connection). View the cambridge_neighborhoods dataset by right-clicking and selecting View Data -> View All Rows. Your dataset should look as follows.
Go to the Query Tool. Let's run some queries!
When running queries, you will become very intimate with the PostGIS documentation and functions. Read all about these here. Put in simple terms, you essentially write SQL statements, such as SELECT, ALTER, CREATE, etc. and use PostGIS functions as geometric operators on the tables you are including in your statement. For our first task, let's find how many coffee shops are in each neighborhood. To do this, we want to run a selection by intersection. The PostGIS equivalent for us will be ST_Intersects. Your query, as such, will look like the following.
SELECT cambridge_neighborhoods.name as name, count(*) FROM cambridge_coffee_shops, cambridge_neighborhoods WHERE ST_Intersects(cambridge_coffee_shops.geom, cambridge_neighborhoods.geom) GROUP BY cambridge_neighborhoods.name;
Entering this statement will yield the following selection, giving a count of coffee shops by neighborhood.
For our second task, order the coffee shops by distance to 77 Massachusetts Ave and return the name and address of the shop. The latitude and longitude at 77 Massachusetts Ave is 42.359055, -71.093500. To do this, we want to ORDER BY distance. We can use the distance operator (<->) in PostGIS to get us these values, and we need to make a point from our input latitude and longitude. Your query, as such, will look like the following.
SELECT name, address FROM cambridge_coffee_shops ORDER BY geom <-> ST_SetSRID(ST_MakePoint(-71.093500,42.359055),4326);
Entering this statement will yield the following selection, coffee shops ordered by distance (as the crow flies) to MIT.
For this example, unit doesn't really matter. But if we want to do more proximity analysis, we should probably project the dataset.
For our next task, we want to select coffee shops based on a buffer around Harvard Square. To do this, we need to project our coffee shop dataset to a local coordinate system that will have more accurate distances. To reproject your data, we need to ALTER our table so that the geom field is in the right coordinate system. Let's tranform our data to UTM Zone 19N, where Cambridge is located. Because we are already using the WGS84 datum, we'll use the UTM projection based on WGS84. The projection unit is meters. The EPSG code for this is 32619. We accomplish this using ST_Transform in PostGIS.
a. Create a New Table
Create a new table, called cambridge_coffee_shops_utm, using the following statement.
CREATE TABLE cambridge_coffee_shops_utm AS SELECT * FROM cambridge_coffee_shops;
b. Reproject the New Table
Reproject the cambridge_coffee_shops_utm dataset so that the geom column is based on UTM Zone 19N. Use the following statement to ALTER our table with the ST_Transform function.
ALTER TABLE cambridge_coffee_shops_utm ALTER COLUMN geom TYPE geometry(POINT,32619) USING ST_Transform(geom,32619);
A nice summary of projection transformation can be found on the PostGIS tips page.
c. Run the Query
Run the query on cambridge_coffee_shops_utm. We are going to use the ST_DWithin function and create a distance in meters , as that is the unit of our projection. The coordinates of the center of Harvard Square are 42.373570, -71.118961. The ST_DWithin function takes two geometries and distance. The first geometry is going to be the geom field from our coffee shops dataset. The second is going to be the geometry of a point we decide, the center of Harvard Square. We can get this using ST_SetSRID and ST_MakePoint. Because we are providing a latitude and longitude in WGS84 (4326), we need to transform (ST_Transform) the point to UTM 19N (32619). The last argument, distance, is in meters because it is based on our projection which we just set to be UTM 19N for the dataset. This number can be anything in meters, or in our case 500. All said and done, our query will look like the following.
SELECT name, address FROM cambridge_coffee_shops_utm WHERE ST_DWithin(geom,ST_Transform(ST_SetSRID(ST_MakePoint(-71.118961,42.373570),4326),32619),500);
Entering this statement will yield the following selection, coffee shops within 500 meters (as the crow flies) of Harvard Square.
The results of our query are not ordered. We could use an ORDER BY to get them in an order from the station.
As you can, see from this point on, the opportunities are almost limitless as to what you can do. Datasets can get much larger, and you can integrate these technologies within your applications. Explore the PostGIS Documentation to learn more of the functions available and capabilities of the system, add a few more datasets to your database that could enable some big data mining, and integrate this with Leaflet or OpenLayers to create nice online web maps. Happy spatial databasing.
Go to Main DUSPviz Tutorials Page