Working with PostgreSQL and PostGIS

This tutorial will introduce PostgreSQL, a Relational Database Management System (RDBMS), and PostGIS, a spatial extension for PostgreSQL that allows for storage of geographic data and for geographic queries. In this tutorial, we will install PostgreSQL on our local machine, install and enable the PostGIS extension, then load a CSV and a Shapefile into the database. Finally, we'll run a few queries that show how the database is used and to sample some of its functionality.

Cambridge Coffee Shops and Neighborhoods

The datasets we'll be working with today are a CSV of coffee shops in the City of Cambridge, and a shapefile of Cambridge Neighborhoods.


Preparation

Two items for preparation:

  1. If you don't have QGIS on your machine, install QGIS
  2. Download the coffee shop and neighborhood datasets

The rest we will handle in the following steps.

Goal

Create a database containing Coffee Shops and Neighborhoods in the city of Cambridge. Once created, we'll run some queries that will tell us the number of coffee shops in each neighborhood and the closest coffee shops to a specific location.


1. Install PostgreSQL and PostGIS

The first step is to get PostgreSQL and PostGIS up and running on your machine or your database web server. Once PostgreSQL and PostGIS are installed, the process for running analysis is largely the same, all you have to do is adjust the address and credentials you use to access the database. The steps below will show you how to install PostgreSQL and PostGIS on your localhost machine, which is great the development, testing, and learning.

a. Download PostgreSQL

Start by downloading and installing PostgreSQL, the main RDBMS. You can find it at the following link. Download the most current, non-beta version for your operating system. For example, PostgreSQL version 9.5 on Mac.

On your machine. Download for your appropriate OS, open the installer and install PostgreSQL.

  1. Use the default Installation Directory
  2. Also, use the default Data Directory (We can always change this later)
  3. Input a password for the superuser (default database adminstrator account with name postgres). IMPORTANT: DO NOT FORGET THIS PASSWORD. WITHOUT IT YOU CANNOT LOG IN AND WILL NEED TO START OVER.
  4. Enter the port number the server will be listening on. The default is 5432, use it.
  5. Set the Locale to be the default.
  6. Click OK to begin the install.
  7. After install, click Allow to let the application accept incoming network requests
  8. Launch StackBuilder to install additional applications.

b. Install PostGIS

Once in StackBuilder, install PostGIS. Find it under Spatial Extensions.

PostGIS is in Spatial Extensions

In general, you can accept the defaults, it will put things in the right place and enable you to work with PostGIS in your databases.

We should be ready to go! Next we'll connect to our PostgreSQL instance using pgAdmin, create a database we can work within, and load data into that database.

Setting up PostGIS on Amazon Web Services


Use Amazon RDS

If you want to access your data over the web, which will often be the case for building web applications and online maps, you can easily set up PostGIS on Amazon Web Services. Follow the tutorial on the AWS documentation below to set up your database, then install PostGIS once your database is setup.

Setting up a PostgreSQL Database on Amazon RDS

Important Note: Make sure you make a big enough RDS instance for your data. Default is 5 gigabytes, you might want more.

Next, connect to your database. Do this using pgAdmin (download it if you don't have it). Using pgAdmin, you can following the instructions at the following AWS documentation link.

Connecting to a DB Instance Running the PostgreSQL Database Engine

Last, you need to enable the PostGIS extension, which is included with the PostgreSQL RDBMS install. Find the instructions at the following documentation site.

Setting up PostGIS on Amazon RDS

Voila, you have PostGIS running on Amazon! Connect to it as you would any other database!

For development purposes, this will work great. Optionally, for further security or to move towards a production environment, set up a proxy running through a EC2 VM, and only accept connections from this proxy. View the following tutorial for a nice summary on how to do this. Creating a Geospatial Database on Amazon RDS by Felix Palmer.

2. Connect to your Database

Now, let's get into our database.

i. Open and Connect with pgAdmin

Connect to your database using pgAdmin to confirm proper setup. Find pgAdmin in your Applications folder and launch it. Navigate to File -> Add Server. Here, input your database credentials. This requires the default username (postgres) and the password we created for it in the installation step. We are connecting to our localhost, but if you wanted to connect to another location, replace localhost with your desired IP address. Fill out the credentials.

Adding our Data Server to pgAdmin

After you successfully connect, your database connection will appear in the pgAdmin window. Click on it to view the contents of your database. We don't have any data in there yet, but we have a structure of the database and all of the templates and defaults.

ii. Create a New Database

Create a new database to hold our datasets. Right click on Databases, and select New Database.

Creating a New Database

In the New Database dialog, setup the following.

  1. In the Properties tab, name it cambridge. We are going to store Cambridge coffee shop and neighborhood datasets. Set the Owner to be the postgres superuser.
  2. In the Definition tab, set the Template to be template_postgis.

Click OK. Your new database will be created and appear in pgAdmin. Click on cambridge to highlight it and select it. We can now add a table.

3. Loading Data

There are a number of different ways you can get data into your database. We will look at two of them in the following steps. The first we will look at is loading a CSV saved on your machine, then we will look at using the Database Manager in QGIS.

i. Load a CSV into the Database

With cambridge highlighted, navigate to Tools -> Query Tool. This opens up our SQL query window in pgAdmin that we can use to send statements and queries to and from this database. We would like to load the cambridge_coffee.csv

a. Create a New Empty Table

In the Query Tool, create a new empty table called cambridge_coffee_shops using the following SQL statement. This is an empty table based on the schema of our CSV. For the PostgreSQL data types available, view the documentation pages on Data Types.

CREATE TABLE cambridge_coffee_shops
(
  id serial NOT NULL,
  name character varying(50),
  address character varying(50),
  city character varying(50),
  state character varying(50),
  zip character varying(10),
  lat numeric,
  lon numeric
);

Now we have an empty table we can load our data into.

b. Load the CSV

View your new table. It is nested within our cambridge database, in Schemas -> public -> Tables -> cambridge_coffee_shops. Once you find it, right click on it and select Import. See the figure below.

Navigate to our Table and Import

In the Import dialog, go through the tabs to set up the settings for our table import based on our CSV.

  • On the File Options tab, browse to the cambridge_coffee_shops.csv file, set the format to be csv, and select UTF8 for the encoding.
  • In the Columns tab, make sure each of the columns you want to import are checked.
  • In Misc. Options, we can tell pgAdmin that the first column of our CSV is a header and that delimiters are commas. Check header, and put a comma in the Delimiter field.
  • Leave Quote Options and NULL Options as default for this table.

Click Import, it will import, pretty quickly, then click Done

c. View the Table

View the table by right clicking on it and selecting View Data -> View All Rows. Here is our dataset.

Cambridge Coffee Shops
d. Create the Geom field

PostGIS uses a special geometry field in the dataset that encodes our geometry into a large variable character string. These are unreadable to the human eye, but are interpreted by the database and GIS software as the geometry. Before we try to view this dataset in QGIS, we need to create this field. It is by default called geom, so we will alter our table to include in this column, then update it using a PostGIS statement that takes our latitude and longitude columns and interprets them into an encoded string in the geom column. Make sense? If not, we'll demonstrate in these next steps.

Create the geom field using the following statement in the Query Tool for our table. Use the ALTER statement to add a column. Name it geom. For the data type, give it a type of geometry(POINT,4326). The arguments are type of geometry (POINT) and coordinate system (EPSG code, which for WGS84 is 4326).

ALTER TABLE cambridge_coffee_shops
	ADD COLUMN geom geometry(POINT,4326)
e. Populate the Geom field

Now we want to populuate the geom field. This can be done using a simple UPDATE statement on our geom column, and then using two simple PostGIS functions that set the coordinate system (ST_SetSRID) on a point made from our latitude and longitude fields (ST_MakePoint. Note, all PostGIS functions start with ST. Run the following in the Query Tool.

UPDATE cambridge_coffee_shops SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);

Double check our table.

The GEOM Field

Lastly, let's create an index for our data to make it more easily searchable using the following statement. For more on indexes, check out the PostgreSQL docs, or read this nice article from Heroku.

CREATE INDEX cambridge_coffee_shops_gist
  ON cambridge_coffee_shops
  USING gist (geom);

We have data! And it is in our database!


ii. Loading Data Using QGIS

Next, open QGIS and connect to our database to view our data. In QGIS, we can use the Database Manager to import a Shapefile into our PostGIS database.

a. View our Data in QGIS

Open QGIS. To view our data in QGIS, we need to connect to our database. Create a new blank document.

In the blank document, create a connection to our PostGIS database. Open the Browser Panel if it is not open, and the find and right click on PostGIS in the list under the Browser Panel. Select New Connection.... Enter your credentials for your database and select Also list tables with no geometry. This will let us see tables in our database that don't have geometry as well as those that do.

Connect to Your DB

Test your Connection, and Click OK.

Click your connection in the Browser Tab, and then drag it into your map. You should get your coffee shops on the map.

QGIS with our Coffee Shops layer

b. Use the DB Manager to Add a Shapefile

Add the cambridge_neighborhoods shapefile to your map. Note that it is also in WGS84 projection system. If the data you are working with is different, reproject it now. Then, click on Database -> DB Manager. In the DB Manager, select the cambridge database, public schema, and then click import at the top of the window. This will let us select a layer in our map to import to the database.

QGIS DB Manager

In the dialog, navigate to cambridge_neighborhoods, and setup the import by stating the schema, table, keys, SRID, and other options.

Import Vector Layer to PostGIS

Click OK. Your dataset will import.

Right click on your database in Browser Panel and Refresh your database. Your layer will appear in the browser, and our dataset is in PostGIS.


4. Dive into PostGIS Queries

Close QGIS, we'll run some PostGIS queries in pgAdmin to demonstrate the usage. Going back to pgAdmin, we can now get into our cambridge database and run some spatial queries. View the database in pgAdmin, you should see both of our tables. (If not, right-click the database to refresh the connection). View the cambridge_neighborhoods dataset by right-clicking and selecting View Data -> View All Rows. Your dataset should look as follows.

Cambridge Neighborhoods

Go to the Query Tool. Let's run some queries!

i. Number of Coffee Shops in a Neighborhood

When running queries, you will become very intimate with the PostGIS documentation and functions. Read all about these here. Put in simple terms, you essentially write SQL statements, such as SELECT, ALTER, CREATE, etc. and use PostGIS functions as geometric operators on the tables you are including in your statement. For our first task, let's find how many coffee shops are in each neighborhood. To do this, we want to run a selection by intersection. The PostGIS equivalent for us will be ST_Intersects. Your query, as such, will look like the following.

SELECT cambridge_neighborhoods.name as name, count(*)
	FROM cambridge_coffee_shops, cambridge_neighborhoods
	WHERE ST_Intersects(cambridge_coffee_shops.geom, cambridge_neighborhoods.geom)
	GROUP BY cambridge_neighborhoods.name;

Entering this statement will yield the following selection, giving a count of coffee shops by neighborhood.

Coffee Shop Count by Neighborhood

ii. Order Coffee Shops by Distance to MIT

For our second task, order the coffee shops by distance to 77 Massachusetts Ave and return the name and address of the shop. The latitude and longitude at 77 Massachusetts Ave is 42.359055, -71.093500. To do this, we want to ORDER BY distance. We can use the distance operator (<->) in PostGIS to get us these values, and we need to make a point from our input latitude and longitude. Your query, as such, will look like the following.

SELECT name, address
	FROM cambridge_coffee_shops
	ORDER BY geom <-> ST_SetSRID(ST_MakePoint(-71.093500,42.359055),4326);

Entering this statement will yield the following selection, coffee shops ordered by distance (as the crow flies) to MIT.

Coffee Shops by Distance to MIT

For this example, unit doesn't really matter. But if we want to do more proximity analysis, we should probably project the dataset.

iii. Reproject a Dataset and find Coffee Shops within 500m of Harvard Square

For our next task, we want to select coffee shops based on a buffer around Harvard Square. To do this, we need to project our coffee shop dataset to a local coordinate system that will have more accurate distances. To reproject your data, we need to ALTER our table so that the geom field is in the right coordinate system. Let's tranform our data to UTM Zone 19N, where Cambridge is located. Because we are already using the WGS84 datum, we'll use the UTM projection based on WGS84. The projection unit is meters. The EPSG code for this is 32619. We accomplish this using ST_Transform in PostGIS.

a. Create a New Table

Create a new table, called cambridge_coffee_shops_utm, using the following statement.

CREATE TABLE cambridge_coffee_shops_utm AS SELECT * FROM cambridge_coffee_shops;

b. Reproject the New Table

Reproject the cambridge_coffee_shops_utm dataset so that the geom column is based on UTM Zone 19N. Use the following statement to ALTER our table with the ST_Transform function.

ALTER TABLE cambridge_coffee_shops_utm
	ALTER COLUMN geom TYPE geometry(POINT,32619) 
	USING ST_Transform(geom,32619);

A nice summary of projection transformation can be found on the PostGIS tips page.

c. Run the Query

Run the query on cambridge_coffee_shops_utm. We are going to use the ST_DWithin function and create a distance in meters , as that is the unit of our projection. The coordinates of the center of Harvard Square are 42.373570, -71.118961. The ST_DWithin function takes two geometries and distance. The first geometry is going to be the geom field from our coffee shops dataset. The second is going to be the geometry of a point we decide, the center of Harvard Square. We can get this using ST_SetSRID and ST_MakePoint. Because we are providing a latitude and longitude in WGS84 (4326), we need to transform (ST_Transform) the point to UTM 19N (32619). The last argument, distance, is in meters because it is based on our projection which we just set to be UTM 19N for the dataset. This number can be anything in meters, or in our case 500. All said and done, our query will look like the following.

SELECT name, address
    FROM cambridge_coffee_shops_utm
    WHERE ST_DWithin(geom,ST_Transform(ST_SetSRID(ST_MakePoint(-71.118961,42.373570),4326),32619),500);

Entering this statement will yield the following selection, coffee shops within 500 meters (as the crow flies) of Harvard Square.

Coffee Shops within 500 meters of Harvard Square

The results of our query are not ordered. We could use an ORDER BY to get them in an order from the station.


Conclusion

As you can, see from this point on, the opportunities are almost limitless as to what you can do. Datasets can get much larger, and you can integrate these technologies within your applications. Explore the PostGIS Documentation to learn more of the functions available and capabilities of the system, add a few more datasets to your database that could enable some big data mining, and integrate this with Leaflet or OpenLayers to create nice online web maps. Happy spatial databasing.


Next Steps...


Go to Main DUSPviz Tutorials Page