## Introduction to Statistics and Geographic Data

## INTRODUCTION

This chapter is intended to introduce students to the text, and help them differentiate between various types of data. In this chapter, students will learn about the importance of the geographic statistic, and common terms and graphs that will be repeated throughout the text. This overture will also introduce the computer software used for geographic analyses.

Why take this course?

The most important thing you can learn during from this text is how to teach yourself about geographic information systems (GIS) and analysis, as most of the skills you will use in your professional career you will have because you cared enough to learn it. This course will point you to several resources that will help you teach yourself about analysis. Don’t worry, we also intend to teach you a few things. GIS makes it easy to run a very large variety of very powerful analysis tools. However, these tools are worthless, or worse, detrimental, if you do not know how/when to apply them and how to interpret the results they produce. This course will prepare you for more in-depth courses of analyses, such as remote sensing. Technology changes quickly, and some of the information in this course will change or even become obsolete soon after it becomes available. A passion for learning how to apply software is key to keep up with changing technology.

Statistics in Geospatial Science

Statistics provides geographers with a toolbox to better visualize and understand data! When dealing with data in space and time, there are several variables to consider (e.g. x, y, and z coordinates, time), which requires high-level statistics for analysis. The spatial statistics description provided by Wikipedia shows the various aspects of this type of analysis.

Application of statistics

For this class, we will focus on application of techniques of statistics and how those techniques can be applied. For the sake of time, there will be minimal proofs and equations, but I insist that if you are curious about a specific technique to use resources at your disposal to investigate its inner workings more thoroughly. We want to prepare you for the world of GIS analysis and describe how GIS plays into that analysis. However, limitations and a general description of how the “tools” work will be provided.

Role of Software

To prepare you for direct application of statistics to GIS, we will incorporate software tools into this class. Due to the complex nature of the analyses we will be conducting, use of software is necessary for many types of GIS analysis.

What is the role of statistics (the subject)?

Statistics provides methods for organizing and simplifying data so that we can understand their significance. You can use statistics to distill information into comprehensive and intuitive displays, including graphs and tables. Using statistics, we can summarize data in a manner that can be numerically sound and useful. In the time of “big data,” the study of statistics is very important. Statistics provides methods for drawing inferences from samples, which allows you to answer questions and compare data sets with certainty and quantitative error values. We can make informed decisions when collecting samples, so that we can properly describe the population of the data in which we are interested. Statistics allows us to accounts for uncertainty and define error. With statistics we can use limited information to describe a larger population.

An understanding of statistics is important to avoid push-button (aka black box) calculations. With the widespread availability of modern software tools, it is very easy to put data into the software, push a button, and get results. If you do not understand the tool you are using, there can be serious implications with the results.

What does this have to do with spatial data?

Use of spatial statistics can condense several confusing maps of data into one clear, intuitive map. With statistics, we can summarize several thousand points covering an area. If we know spatial statistics, we can properly interpolate data with informed decisions of the data distribution. Spatial statistics allow us to predict what to expect at a specific location. We can use spatial statistics to explain different phenomena at different locations and describe how data are related in space and time. Spatial statistics can maximize efficiency of the placement or movement of spatially oriented items.

Example of Spatial Data

One of the first well documented cases of applied spatial statistics was by John Snow in London, England. John Snow used maps to outline outbreaks of cholera in London in 1854. Based on his data, he was able to pinpoint the source of the outbreak – a specific community water well. Once he convinced authorities to remove the pumping handle of the water pump to the well, the cholera cases in the area dwindled. A copy of these historic data is available at Robin's blog.

Why take this course?

The most important thing you can learn during from this text is how to teach yourself about geographic information systems (GIS) and analysis, as most of the skills you will use in your professional career you will have because you cared enough to learn it. This course will point you to several resources that will help you teach yourself about analysis. Don’t worry, we also intend to teach you a few things. GIS makes it easy to run a very large variety of very powerful analysis tools. However, these tools are worthless, or worse, detrimental, if you do not know how/when to apply them and how to interpret the results they produce. This course will prepare you for more in-depth courses of analyses, such as remote sensing. Technology changes quickly, and some of the information in this course will change or even become obsolete soon after it becomes available. A passion for learning how to apply software is key to keep up with changing technology.

Statistics in Geospatial Science

Statistics provides geographers with a toolbox to better visualize and understand data! When dealing with data in space and time, there are several variables to consider (e.g. x, y, and z coordinates, time), which requires high-level statistics for analysis. The spatial statistics description provided by Wikipedia shows the various aspects of this type of analysis.

Application of statistics

For this class, we will focus on application of techniques of statistics and how those techniques can be applied. For the sake of time, there will be minimal proofs and equations, but I insist that if you are curious about a specific technique to use resources at your disposal to investigate its inner workings more thoroughly. We want to prepare you for the world of GIS analysis and describe how GIS plays into that analysis. However, limitations and a general description of how the “tools” work will be provided.

Role of Software

To prepare you for direct application of statistics to GIS, we will incorporate software tools into this class. Due to the complex nature of the analyses we will be conducting, use of software is necessary for many types of GIS analysis.

- ArcGIS is an industry standard, as it is used by most GIS professionals. There are a wide variety of tools applicable to spatial statistics and geostatistics. The most important toolboxes in ArcGIS are the Spatial Analyst, Spatial Statistics, and Geostatistics. This is proprietary software, but has many user contributors. Geospatial Modeling Environment, developed by ecologist Hawthorne Beyer, has incorporated many important statistical tools into ArcGIS and allows for rudimentary integration of the R programming language (see below) into GIS. ESRI is the producer of ArcGIS, and it makes many resources available for users to learn how to use its software, such as blogs, training, and video tutorials. The help file associated with the software is very well maintained and complete, which makes it an excellent learning tool.**ArcGIS 10.2**

- R is a free statistical programming language that has been adapted by contributors to process spatial data. The software and many valuable resources on how to use the software can be found at the R website. Learning R should be essential for any GIS analyst. However, R requires some knowledge of scripting, and therefore can have a steep learning curve for beginners of GIS. In this text, we will cover the basics of R, so that you can start integrating it into your analyses. With R, we will be using the R commander package (Rcmdr) as a graphical user interface, to reduce the requirements of typing commands and understanding R syntax. The text by Natasha Karp titled "R commander an Introduction" is an excellent primer to R commander. While R does require scripting, it is an excellent analysis software because of the huge user base that contributes to it. Every type of statistical analysis that you can imagine can be found as an optional add-on to R. There are many blogs and online learning tools provided for R users.**R**

- Octave is a GNU software very similar to a popular and very powerful proprietary software Matlab created by Mathworks. Instructions to download and install Octave can be found at the Octave website. This software also requires knowledge of software-specific scripting language, and uses scripting syntax that is compatible with Matlab scripts. The scripting language used by Matlab and Octave is known as an array programming language (aka vector programming language), which makes Octave extremely well suited and applicable to processing raster data. Octave has statistical packages made by and contributed to by users.**Octave**

- Python is an open-source high-level, object oriented programming language. Python requires some knowledge of programming, but is relatively easy to use and is a good language for learning how to program. Python is an excellent tool for numerical analysis, and is getting better every year. There are a number of libraries available that help with numerical analysis, including Numpy, Scipy, Pandas, and R2Py. The learning curve can be steep, but it is worth the time, there are several of resources available, and Python is a good place to start. I recommend winpython, especially for scientists and people familiar with Matlab and Octave. One excellent advantage of learning Python is that Python is supported by ArcGIS via the ArcPy Python library.**Python**

What is the role of statistics (the subject)?

Statistics provides methods for organizing and simplifying data so that we can understand their significance. You can use statistics to distill information into comprehensive and intuitive displays, including graphs and tables. Using statistics, we can summarize data in a manner that can be numerically sound and useful. In the time of “big data,” the study of statistics is very important. Statistics provides methods for drawing inferences from samples, which allows you to answer questions and compare data sets with certainty and quantitative error values. We can make informed decisions when collecting samples, so that we can properly describe the population of the data in which we are interested. Statistics allows us to accounts for uncertainty and define error. With statistics we can use limited information to describe a larger population.

An understanding of statistics is important to avoid push-button (aka black box) calculations. With the widespread availability of modern software tools, it is very easy to put data into the software, push a button, and get results. If you do not understand the tool you are using, there can be serious implications with the results.

What does this have to do with spatial data?

Use of spatial statistics can condense several confusing maps of data into one clear, intuitive map. With statistics, we can summarize several thousand points covering an area. If we know spatial statistics, we can properly interpolate data with informed decisions of the data distribution. Spatial statistics allow us to predict what to expect at a specific location. We can use spatial statistics to explain different phenomena at different locations and describe how data are related in space and time. Spatial statistics can maximize efficiency of the placement or movement of spatially oriented items.

Example of Spatial Data

One of the first well documented cases of applied spatial statistics was by John Snow in London, England. John Snow used maps to outline outbreaks of cholera in London in 1854. Based on his data, he was able to pinpoint the source of the outbreak – a specific community water well. Once he convinced authorities to remove the pumping handle of the water pump to the well, the cholera cases in the area dwindled. A copy of these historic data is available at Robin's blog.