What is Geospatial Data?
Geospatial data, or Geodata, is any data that contains a spatial dimension i.e. it’s data that has a where. Examples of geodata include demographics, street networks, administrative boundaries, vehicle locations, and much, much more.
Additionally, geospatial data commonly has attribute data associated with it such as names, descriptions, and classifications. It is also common for these features to include temporal information such as dates and time periods. The table below provides an earthquake data point that features a spatial, attribute, and temporal component.
Spatial | Attribute | Temporal |
---|---|---|
60.669, -150.843 | Magnitude 9.2 | March 27, 1964 |
Alone, this data point isn’t much different than non-spatial data. The true power of geodata is unlocked when it is plotted on a map. The following figure depicts every earthquake recorded by the USGS between 1900-2017.
Types of Geodata
There are three major types of geospatial data. The specific type of data impacts which visualizations are possible and which spatial operations can be performed.
Vectors
Vector data is used to represent discrete entities. Vectors are represented in three distinct forms.
- Points: to show locations like cities, events, and establishments.
- Lines: to represent linear features like rivers, roads, and trajectories
- Polygons: to depict areas like lakes, parks, and building footprints.
Vectors offer a high degree of precision in depicting things with well-defined boundaries or locations like administrative zones or human structures. Vector data is the most common type of data you will encounter in traditional relational databases since it is the easiest to store.
Rasters
Rasters use grid cells like pixels to represent continuous data. Each pixel is assigned a value that is representative of some data value. Elevation data, often called digital elevation models (DEM), are provided in raster format. The value of the pixel corresponds to the average elevation within the grid cell.
Rasters are ideal for data that is continuous as opposed to discrete. Elevation, foliage, pollution, and sunlight are a few examples of continuous phenomena. Unlike vectors, resolution is a significant consideration with rasters. The finer the resolution, the more accurate details can be but also the more cumbersome the data becomes.
Rasters are common in remote sensing, photogrammetry, and environmental modeling data. Some rasters are not represented by square grid cells, but rather triangles, hexagons, or other continuous shapes. Uber’s H3 indexing library offers an example of a hexagonal grid.
Topologies
Topological data focuses on spatial relationships between features rather than exact coordinates and distances. Topology is generally used to model networked data such as transportation, electricity, and communications.
Topological relationships use concepts like adjacency, connectivity, and containment to model how different elements relate in a space. For example, in road networks, recognizing how roads connect and interact is vital for effective navigation and traffic flow management.
Working with topological data requires more specific tools and skills. Creating and managing these data sets is usually harder than with vector or raster types. However, topological data is essential for analyzing and modeling networks.
Geodata Formats
When searching for the above data types, it can be helpful to know what kind of file formats you are looking for. Each file format is unique and choosing the right format for your purposes is important.
Shapefiles (.shp)
Shapefiles are one of the most popular formats for vector data. Because the data is encoded in binary, shapefiles are compact. Shapefiles have a 2GB data limit and are unable to store topological information.
The name shapefile is a bit of a misnomer since the data is stored in a series of sidecar files rather than just one file. Every shapefile must have a geometry (.shp), a shape index format (.shx), and a database (.dbf) file associated with it. Other optional sidecar files may include projection and coordinate information (.prj), and metadata (.shp.xml). A comprehensive list can be found at this site.
GeoJSON (.geojson)
GeoJSON is a vector data format based on JavaScript Object Notation (JSON). The JSON format is extended with a field that encodes geometry. Since GeoJSON is a text-based format, it is human-readable when opened in an editor. Conversely, this can cause large file sizes to become unwieldy and affect performance.
Geojson.io provides a web-based playground to create and edit GeoJSON files and see their internal data structures.
TopoJSON (.topojson)
TopoJSON is an extension of GeoJSON that encodes both topology and geometry. The emphasis on topology reduces redundancy because overlapping geometries must only be encoded once.
Users may find TopoJSON more complex when looking for a simple geometry format. TopoJSON is preferred over GeoJSON when working with large data sets with overlapping such as the administrative borders on a world map or highly interconnected network data. Luckily most modern GIS tools offer support for TopoJSON and conversion to or from GeoJSON where necessary.
Well-Known Text (WKT)
Well-known text representation of geometry (WKT) and its cousin Well-known Binary (WKB) are not file formats like the previous examples; they could be better described as data representations.
WKT represents vector geometries by using a standard notation that is human-readable. It is common to store WKT fields in relational databases where spatial extensions like PostGIS can natively support the reading and handling of WKT for spatial operations and visualizations. CSV files frequently use WKT to encode their geometries as well.
WKB is analogous to WKT but instead of encoding geometry in a human-readable text format, the information is stored in binary. WKB is useful for maintaining full numeric precision by avoiding the loss that occurs in textual representations.
You may also encounter Extended WKT (EWKT) and Extended WKB (EWKB). These extensions mean that the geometry encoding includes a spatial reference identifier (SRID). SRID provides context regarding the specific coordinate systems and their resolution.
WKTmap provides a web-based playground to create and edit WKT geometries.
Comma Separated Values (.csv)
CSV files are a common file format for sharing all types of data. They are analogous to tables like one would find in a relational database. In the context of geodata, CSV includes one or more columns to encode vector geometry. This could be in the form of latitude and longitude, a geospatial index, or even WKT geometry. The rest of the columns provide either attribute or temporal information.
Similar to GeoJSON, CSV files can be unwieldy when working with large volumes of data since they have no inherent spatial indexing. Without an index, performance can be severely affected when querying the data.
Geoparquet (.geoparquet)
Geoparquet is a spatial extension built on top of Apache Parquet. Parquet is a highly efficient, columnar storage file format optimized for fast retrieval of data. Parquet enhances performance by compressing data, allowing for more effective handling of large data sets. Like CSV, Geoparquet is used to encode vector data.
Geoparquet is a relatively new file format but can already be utilized in many popular software libraries and tools like QGIS, Geopandas, and cloud data warehouses.
GeoTIFF (.tiff or .tif)
GeoTIFF is the first format on this list that is used to store raster data. GeoTIFF is based on the TIFF file format with extra metadata that includes information like coordinate systems and map projections. This metadata is generally self-contained in the GeoTIFF itself.
A plain TIFF can also come with a sidecar world (.wld) file that encodes geographic metadata.
GPS Exchange Format (.gpx)
GPS Exchange Format (GPX) is an XML schema designed to share GPS data. It is typical to find GPX used in the context of cycling, hiking, wayfinding, and other outdoor recreational activities.
Geodatabase (.gbd)
Geodatabase is a proprietary data structure created by ESRI. It comes in both file and server-based options. Geodatabases can store a variety of spatial and non-spatial data types, including feature classes, raster data, and tables, in a single database structure. Geodatabases can be queried using SQL just like traditional relational databases.
Sources of Geospatial Data
One of the advantages of working with geodata is that there are many free sources available. Open data can often be combined with enterprise data to enhance analyses. Below are just a few popular sources of geospatial data, but the list is by no means exhaustive.
OpenStreetMap
OpenStreetMap (OSM) is a global collection of geographic data sourced from and maintained by volunteers. OSM also includes a mapping platform for editing and downloading their data. OSM can be thought of as Wikipedia for maps and geodata; anyone can contribute and edit the data. Their database holds a tremendous amount of data and their license encourages its use and dispersal, just be sure to follow their rules on data attribution.
OSM’s unique data structure can be queried to access a veritable treasure trove of geodata. You can also contribute to the project in the true spirit of open data.
NASA
NASA provides free earth observation data. The datasets cover topics like the atmosphere, pollution, the biosphere, and much more. NASA provides a variety of digital elevation models from their various missions such as MODIS, Landsat, and ASTER. There are different resolutions to choose from and at least some of their data sets have complete global coverage.
Natural Earth
Natural Earth provides public-domain datasets at various scales (1:10m, 1:50m, and 1:110m) for large to small-scale projects. The two main types of data provided are:
- Cultural Data - administrative boundaries, borders, and urban areas.
- Physical Data - lakes, rivers, coastlines, and other natural features.
Administrative Open Data
Many governments provide open data that can be downloaded and used for free. The exact nature and quality of this data will vary greatly based on region and provider. Government-provided geodata is often served through a special portal like in the case of the United States’Geoplatform.
Geodata from these sources is often interspersed with other types of data. The upshot is that geospatial data can often be easily enriched by non-geospatial data from the same source. For example, if a government provides postal code areas they may also provide statistics grouped by postal code like demographic or economic data.
Conclusion
Geodata is unique. It comes in unique file formats and is available from unique sources. Although there are many overlapping concepts with non-spatial data, geodata requires special attention and consideration.
The best way to understand geospatial data is to work with it. If you need help with a specific geospatial problem or need similar content written for your own blog, please reach out.
Happy mapping!