This tutorial will introduce PostgreSQL, a Relational Database Management System (RDBMS), and PostGIS, a spatial extension for PostgreSQL that allows the user to store and spatially query geographic data. Both PostgreSQL and PostGIS are free and open-source software. Remember CARTO? CARTO is built on top of a PostGIS-enabled PostgreSQL database; we've already begun to use PostGIS!

The difference here is that we may not always want to host our data on CARTO. In the future, you may not have access to CARTO's services. You may have data that we may not want to store on CARTO for reasons of privacy or propriety. You may be averse to becoming dependent on cloud platforms. You may be preparing for a not-inconceivable world in which CARTO goes belly-up (or changes its terms of service in a way that give you pause). Anyway, the point here is to take some of the power back and learn to deploy our own hosted spatial databases.

In this tutorial, we will install PostgreSQL and the PostGIS extension, then load spatial data stored as a CSV and as a Shapefile into the database. Finally, we'll run a few spatial queries that demonstrate some of its functionality. Create a database containing Coffee Shops and Neighborhoods in the city of Cambridge. Once created, we'll run some queries that will tell us the number of coffee shops in each neighborhood and the closest coffee shops to a specific location.

The datasets we'll be working with today are a CSV of coffee shops in the City of Cambridge, and a shapefile of Cambridge Neighborhoods.

Prerequisites

  1. Download the coffee shop and neighborhood datasets
  2. If you don't have QGIS on your machine, install QGIS.
Cambridge Coffee Shops and Neighborhoods

Install PostgreSQL and PostGIS

The first step is to get PostgreSQL and PostGIS up and running on your machine or your web server. The steps below will walk through an installation of PostgreSQL and PostGIS on your local hard drive (as opposed to a server), which will be very useful for development and learning purposes.

Install PostgreSQL

Start by downloading and installing PostgreSQL, the main RDBMS. You can find downloads listed here. Download the most current, supported version for your operating system (version 10 as of this writing).

The defaults will serve you just fine; do make sure that you remember your password for the postgres superuser account. There is no 'forgot my password' feature and without your password, you won't be able to access your database - you do the math!

After install, you'll likely see a dialog box asking if the application should be allowed to accept incoming network requests - click allow. Finally, launch StackBuilder, which will allow you to install extensions for PostgreSQL (like PostGIS).

Install PostGIS

Once in StackBuilder, install PostGIS. Find it under Spatial Extensions.

PostGIS is in Spatial Extensions

You can accept the defaults, it will put things in the right place and enable you to work with PostGIS in your databases.

We should be ready to go! Next we'll connect to our PostgreSQL instance using pgAdmin, create a database we can work within, and load data into that database.

Connect to and Modify Your Database

Open and Connect with pgAdmin

Find pgAdmin in your Applications folder and launch it; this will give us a graphical user interface for interacting with our PostgreSQL databases. pgAdmin will likely automatically locate your database, which will appear as 'PostgreSQL 10' under the 'Servers' list. If not, right click the 'Servers' list and choose Create > Server. Here, input your database information and credentials. This requires the default superuser name (postgres), the the password we created for it in the installation step, and the port (5432 by default). We are connecting to our localhost, but if you wanted to connect to another location, you would simply replace localhost with your desired IP address. You'll have to modify the 'General' and 'Connection' tabs like below.

After you successfully connect, your database connection will appear in the pgAdmin window. Click on it to view the contents of your localhost server connection. We haven't included any data yet, but we have a structure to hold our databases which is populated by defaults.

Create a New Database

We now need to create a new database to hold our datasets. Right click on Databases, and select Create > Database.

Creating a New Database

In the New Database dialog, setup the following.

  1. In the General tab, name it cambridge. We are going to store Cambridge coffee shop and neighborhood datasets. Set the Owner to be the postgres superuser.
  2. In the Definition tab, set the Template to be template_postgis.

Click OK. Your new database will be created and appear in pgAdmin. Click on cambridge to highlight it and select it. We can now add a table.

Loading Data

There are a number of different ways you can get data into your database. The first we will look at is loading a CSV saved on your machine; a bit later, we will look at using the Database Manager in QGIS.

Load a CSV into the Database

Right-click the 'cambridge' database, navigate to Tools -> Query Tool. This opens up our SQL query window in pgAdmin that we can use to send statements and queries to and from this database. We would like to load the cambridge_coffee.csv

Create a New Empty Table

In the Query Tool, create a new empty table called cambridge_coffee_shops using the following SQL statement. This is an empty table based on the schema of our CSV. For the PostgreSQL data types available, view the documentation pages on Data Types.

CREATE TABLE coffee_shops
	(
	  id serial NOT NULL,
	  name character varying(50),
	  address character varying(50),
	  city character varying(50),
	  state character varying(50),
	  zip character varying(10),
	  lat numeric,
	  lon numeric
	);

Now we have an empty table we can load our data into.

Load the CSV

View your new table. It is nested within our cambridge database, in Schemas -> public -> Tables -> coffee_shops. Once you find it, right click on it and select Import/Export . See the figure below.

Navigate to our Table and Import

In the Import dialog, go through the tabs to set up the settings for our table import based on our CSV.

  • In the Options tab, browse to the cambridge_coffee_shops.csv file, set the format to be csv, and select UTF8 for the encoding.
  • In Misc. Options, we can tell pgAdmin that the first column of our CSV is a header and that delimiters are commas. Check header, and put a comma in the Delimiter field.
  • Leave Quote Options and NULL Options as default for this table.
  • In the Columns tab, make sure each of the columns you want to import are checked.
Import Options
Import columns

Click Import, it will import, pretty quickly, then click Done

View the Table

View the table by right clicking on it and selecting View Data -> View All Rows. Here is our dataset.

Cambridge Coffee Shops

Create the Geom field

PostGIS uses a special geometry field in the dataset that encodes our geometry into a large variable character string. These are unreadable to the human eye, but are interpreted by the database and GIS software as the geometry. Before we try to view this dataset in QGIS, we need to create this field. It is by default called geom, so we will alter our table to include in this column, then update it using a PostGIS statement that takes our latitude and longitude columns and interprets them into an encoded string in the geom column. Make sense? If not, we'll demonstrate in these next steps.

Create the geom field using the following statement in the Query Tool for our table. Use the ALTER statement to add a column. Name it geom. For the data type, give it a type of geometry(POINT,4326). The arguments are type of geometry (POINT) and coordinate system (EPSG code, which for WGS84 is 4326).

ALTER TABLE coffee_shops
	ADD COLUMN geom geometry(POINT,4326)
Populate the Geom field

Now we want to populuate the geom field. This can be done using a simple UPDATE statement on our geom column, and then using two simple PostGIS functions that set the coordinate system (ST_SetSRID) on a point made from our latitude and longitude fields (ST_MakePoint. Note, all PostGIS functions start with ST. Run the following in the Query Tool.

UPDATE coffee_shops SET geom = ST_SetSRID(ST_MakePoint(lon,lat),4326);

Double check our table.

The GEOM Field

Lastly, let's create an index for our data to make it more easily searchable using the following statement. For more on indexes, check out the PostgreSQL docs, or read this nice article from Heroku.

CREATE INDEX coffee_shops_gist
  ON coffee_shops
  USING gist (geom);

We have data! And it is in our database!


Loading Data Using QGIS

Next, open QGIS and connect to our database to view our data. In QGIS, we can use the Database Manager to import a Shapefile into our PostGIS database.

View our Data in QGIS

Open QGIS. To view our data in QGIS, we need to connect to our database. Create a new blank document.

In the blank document, create a connection to our PostGIS database. Open the Browser Panel if it is not open, and the find and right click on PostGIS in the list under the Browser Panel. Select New Connection.... Enter your credentials for your database and select Also list tables with no geometry. This will let us see tables in our database that don't have geometry as well as those that do.

Connect to Your PostGIS Database

Test your Connection, and Click OK.

Click your connection in the Browser Tab, and then drag it into your map. You should get your coffee shops on the map.

QGIS with our Coffee Shops layer

Use the DB Manager to Add a Shapefile

Add the cambridge_neighborhoods shapefile to your map. Note that it is also in WGS84 projection system. If the data you are working with is different, reproject it now. Then, click on Database -> DB Manager. In the DB Manager, select the cambridge database, public schema, and then click import at the top of the window. This will let us select a layer in our map to import to the database.

QGIS DB Manager

In the dialog, navigate to cambridge_neighborhoods, and setup the import by stating the schema, table, keys, SRID, and other options.

Import Vector Layer to PostGIS

Click OK. Your dataset will import.

Right click on your database in Browser Panel and Refresh your database. Your layer will appear in the browser, and our dataset is in PostGIS.


Dive into PostGIS Queries

Close QGIS, we'll run some PostGIS queries in pgAdmin to demonstrate its functionality. Going back to pgAdmin, we can now get into our cambridge database and run some spatial queries. View the database in pgAdmin, you should see both of our tables. (If not, right-click the database to refresh the connection). View the cambridge_neighborhoods dataset by right-clicking and selecting View Data -> View All Rows. Your dataset should look as follows.

Cambridge Neighborhoods

Go to the Query Tool. Let's run some queries!

Number of Coffee Shops in a Neighborhood

When running queries, you will become very intimate with the PostGIS documentation and functions. Read all about these here. In simple terms: you write SQL statements, such as SELECT, ALTER, CREATE, etc. and use PostGIS functions as geometric operators on the tables you are including in your statement. For our first task, let's find how many coffee shops are in each neighborhood. To do this, we want to run a selection by intersection. The PostGIS equivalent for us will be ST_Intersects. Your query, as such, will look like the following.

SELECT cambridge_neighborhoods.name as name, count(*)
	FROM coffee_shops, cambridge_neighborhoods
	WHERE ST_Intersects(coffee_shops.geom, cambridge_neighborhoods.geom)
	GROUP BY cambridge_neighborhoods.name;

Entering this statement will yield the following selection, giving a count of coffee shops by neighborhood.

Coffee Shop Count by Neighborhood

Order Coffee Shops by Distance to MIT

For our second task, order the coffee shops by distance to 77 Massachusetts Ave and return the name and address of the shop. The latitude and longitude at 77 Massachusetts Ave is 42.359055, -71.093500. To do this, we want to ORDER BY distance. We can use the distance operator (<->) in PostGIS to get us these values, and we need to make a point from our input latitude and longitude. Your query, as such, will look like the following.

SELECT name, address
	FROM coffee_shops
	ORDER BY geom <-> ST_SetSRID(ST_MakePoint(-71.093500,42.359055),4326);

Entering this statement will yield the following selection, coffee shops ordered by distance (as the crow flies) to MIT.

Coffee Shops by Distance to MIT

For this example, unit doesn't really matter. But if we want to do more proximity analysis, we should probably project the dataset.

Reproject a Dataset and find Coffee Shops within 500m of Harvard Square

For our next task, we want to select coffee shops based on a buffer around Harvard Square. To do this, we need to project our coffee shop dataset to a local coordinate system that will have more accurate distances. To reproject your data, we need to ALTER our table so that the geom field is in the right coordinate system. Let's tranform our data to UTM Zone 19N, where Cambridge is located. Because we are already using the WGS84 datum, we'll use the UTM projection based on WGS84. The projection unit is meters. The EPSG code for this is 32619. We accomplish this using ST_Transform in PostGIS.

Create a New Table

Create a new table, called coffee_shops_utm, using the following statement.

CREATE TABLE coffee_shops_utm AS SELECT * FROM coffee_shops;

Reproject the New Table

Reproject the coffee_shops_utm dataset so that the geom column is based on UTM Zone 19N. Use the following statement to ALTER our table with the ST_Transform function.

ALTER TABLE coffee_shops_utm
	ALTER COLUMN geom TYPE geometry(POINT,32619)
	USING ST_Transform(geom,32619);

A nice summary of projection transformation can be found on the PostGIS tips page.

Run the Query

Run the query on cambridge_coffee_shops_utm. We are going to use the ST_DWithin function and create a distance in meters , as that is the unit of our projection. The coordinates of the center of Harvard Square are 42.373570, -71.118961. The ST_DWithin function takes two geometries and distance. The first geometry is going to be the geom field from our coffee shops dataset. The second is going to be the geometry of a point we decide, the center of Harvard Square. We can get this using ST_SetSRID and ST_MakePoint. Because we are providing a latitude and longitude in WGS84 (4326), we need to transform (ST_Transform) the point to UTM 19N (32619). The last argument, distance, is in meters because it is based on our projection which we just set to be UTM 19N for the dataset. This number can be anything in meters, or in our case 500. All said and done, our query will look like the following.

SELECT name, address
    FROM coffee_shops_utm
    WHERE ST_DWithin(geom,ST_Transform(ST_SetSRID(ST_MakePoint(-71.118961,42.373570),4326),32619),500);

Entering this statement will yield the following selection, coffee shops within 500 meters (as the crow flies) of Harvard Square.

Coffee Shops within 500 meters of Harvard Square

The results of our query are not ordered. We could use an ORDER BY to get them in an order from the station.


Conclusion

As you can, see from this point on, the opportunities are almost limitless as to what you can do. Datasets can get much larger, and you can integrate these technologies within your applications. Explore the PostGIS Documentation to learn more of the functions available and capabilities of the system, add a few more datasets to your database that could enable some big data mining, and integrate this with Leaflet or OpenLayers to create nice online web maps. Happy spatial databasing.


Next Steps...

Setting up PostGIS on Amazon Web Services


Use Amazon RDS

If you want to access your data over the web, which will often be the case for building web applications and online maps, you can easily set up PostGIS on Amazon Web Services. Follow the tutorial on the AWS documentation below to set up your database, then install PostGIS once your database is setup.

Setting up a PostgreSQL Database on Amazon RDS

Important Note: Make sure you make a big enough RDS instance for your data. Default is 5 gigabytes, you might want more.

Next, connect to your database. Do this using pgAdmin (download it if you don't have it). Using pgAdmin, you can following the instructions at the following AWS documentation link.

Connecting to a DB Instance Running the PostgreSQL Database Engine

Last, you need to enable the PostGIS extension, which is included with the PostgreSQL RDBMS install. Find the instructions at the following documentation site.

Setting up PostGIS on Amazon RDS

Voila, you have PostGIS running on Amazon! Connect to it as you would any other database!

For development purposes, this will work great. Optionally, for further security or to move towards a production environment, set up a proxy running through a EC2 VM, and only accept connections from this proxy. View the following tutorial for a nice summary on how to do this. Creating a Geospatial Database on Amazon RDS by Felix Palmer.


Go to Main DUSPviz Tutorials Page