# Data Grouping Options for NM-IBIS Maps

An analysis of maps prepared by authors in various academic disciplines fails to
show any rational or standardized procedures for the selection of class intervals.
Evidently intuition, inspiration, revelation, mystical hunches, prejudices, legerdemain,
and predetermined ideas of what the class intervals should be have characterized the
work of most map-makers… Apparently many authors believe that maps are an art-form
which allow liberties not admissible in verbal or tabular presentation (Jenks and
Coulson 1963:120).

NM-IBIS query results and indicator reports that provide data by geographic areas (e.g., counties, small areas, etc.) display a map. The NM-IBIS maps are a type of map called "choropleth" maps. This page is intended to describe choropleth maps and the types of grouping options available with the choropleth maps on NM-IBIS.

# 1. About NM-IBIS Maps

Choropleth maps display data for predefined geographic areas. The areas on a choropleth map are shaded or patterned to reflect values of a variable such as population density or birth rate. Choropleth maps are an easy way to visualize differences and patterns across geographic areas.One challenge presented by choropleth maps is that, by forcing the data into discrete geographic zones, the underlying data distribution can be obscured or misrepresented (either purposefully or accidentally). It will help to understand the methods used to group the map data. For the most part, data classification involves two basic issues: 1) identifying the number of groups and 2) identifying how to assign geographic areas to each group. If too few groups are used, a choropleth map may obscure subtle gradations in a spatial distribution. Too many categories are also unlikely to reveal any existing spatial patterns because a viewer can be visually overwhelmed. (Most map readers find difficulty in distinguishing among more than seven classes; Kraak and Ormeling 2003).

Different types of classification can be used to assign geographic areas to groups. Some grouping methods are better suited than others for different data types. When selecting a grouping method, the underlying data distribution should first be explored. Common classification types that are available on IBIS include equal groups (quantiles), equal intervals, mean standard deviation, Jenks natural breaks, arithmetic progression, and geometric progression. These classification methods and their applications are further described below.

Choropleth maps have an inherent weakness, in that they require the aggregation of data into geographic areas (e.g., counties) that do not necessarily correspond exactly with the data's underlying spatial distribution. To maximize the effectiveness of such a map, the data grouiping method should strive to balance between several goals. After classification, each group should contain an appropriately apportioned share of observed data values. The resulting map should also faithfully represent spatial patterns without excluding extreme high or low values. The resulting map should also endeavor to approach the data's statistical surface (a three-dimensional data representation in which the z-coordinate is proportional to the data value) as closely as possible (Kraak and Ormeling 2003).

Finally, choropleth maps do possess other limitations. Small geographic areas that contain a large number of cases (e.g., cities) tend to impose a smaller visual impact and attract the viewer's attention less than large (e.g., rural) geographic areas which may be sparsely populated. Another common error is the use of raw data counts, which represent magnitude, when a choropleth is more appropriate to the use of normalized values that produce a map of rate, density, concentration, or the like, by geographic unit (Monmonier 1991:22-23).

See also: http://en.wikipedia.org/wiki/Choropleth_map

# 2. Data-Grouping Methods

## a. Equal Groups (Quantiles)

This grouping method distributes all the values into some number of groups, with each group having the same number of observations. Data that are evenly distributed (i.e., showing a rectangular or flat shape in a histogram) are well suited to quantile classification. Also, "quantiles seem to be one of the best methods for facilitating comparison [among a series of maps] as well as aiding general map reading" (Brewer and Pickle 2002:679), and this method is also useful for conducting experimental data analysis. One possible disadvantage of quantile classification may arise when large gaps occur between attribute values; such gaps may lead to an over-weighting of an outlier in that class. A two-class quantile identifies the median, while three-class quantiles are called tertiles or terciles, four-class quantiles are called quartiles, and five-class quantiles are called quintiles.

See also: http://wiki.gis.com/wiki/index.php/Quantile

## b. Equal Intervals

This classification method splits the entire data span (from lowest to highest value) into intervals that are the same size, each containing the same proportion of the range of values. Data that are evenly distributed (i.e., showing a rectangular or flat shape in a histogram) are well suited to equal interval classification. Choropleth maps created with this classification are good for revealing values that are either over or under represented, but intervals that are overrepresented will result in maps that are shaded mostly the same color.

See also: http://wiki.gis.com/wiki/index.php/Equal_Interval_classification

## c. Mean, Standard Deviation

The standard deviation classification method forms classes by adding and subtracting a defined portion of the standard deviation from the mean of the dataset. This method is most appropriately suited for use with data that conforms to a normal (bell-shaped) distribution in a histogram, but this method can provide valuable visual breaks even when used to map highly skewed data. Note that the use of standard deviation classification is not appropriate for data ranges defined by percentages, unless weighted averaging can be implemented (which is not presently available in NM-IBIS).

As implemented on NM-IBIS, the proportion of the standard deviation that is captured in each class is dependent upon the number of classes that are selected.

- If two classes are displayed, the breakpoint between the classes is the mean, and the low and the high class then include all standard deviations below and above the mean to the limits of the data values.
- If three classes are displayed, the middle class contains the mean and extends to +/- 0.5 standard deviations. The highest and lowest classes then include all other ranges from +/- 0.5 standard deviations out to the limits of the data values.
- If four classes are displayed, again the breakpoint between the two middle classes is the mean. The next highest and lowest classes extend +/- 0 to 1 standard deviation from the mean, and the maximally highest and lowest classes then include all other ranges from +/- 1 standard deviation to the limits of the data values.
- If five classes are displayed, again the middle class contains the mean and extends to +/- 0.5 standard deviations. The next highest and lowest classes then extend +/- 0.5 to 1.5 standard deviations, and the maximally highest and lowest classes then include all other ranges from +/- 1.5 standard deviations out to the limits of the data values.

See also: http://wiki.gis.com/wiki/index.php/Probability_distribution, http://en.wikipedia.org/wiki/Standard_deviation

## d. Jenks Natural Breaks

The Jenks Natural Breaks method, also referred to as the Jenks Optimization method or the goodness of variance fit (GVF), is a data-classification method designed to determine the best way to classify features using natural breaks in data values. The method was developed with the intention of dividing data into relatively few data classes (seven or fewer) for mapping purposes. Jenks Natural Breaks iteratively compares the sums of the squared difference between observed values within each class and the class means. The best resulting classification identifies breaks in the ordered distribution of values that minimizes the variance within classes and maximizes the variance between classes (Jenks 1967).

The Jenks Natural Breaks method is well suited to the creation of choropleth maps because it identifies real classes within the data, resulting in maps that can accurately portray data trends. This is a good choice for datasets that are multi-modal, but, this method is not recommended for data that have a low variance. Also, this classification is data-specific and is not useful for comparing multiple maps built from different datasets.

See also: http://wiki.gis.com/wiki/index.php/Jenks_Natural_Breaks_Classification

## e. Geometric Progression

For data with heavy-tailed (skewed) distributions, classes generally cannot be imposed in a linear manner (e.g., as equal steps); instead, a nonlinear method can be used. Using the geometric progression method, the widths of the category intervals increase at a geometric (i.e., multiplicative) rate. Starting from the lowest value, each following class breakpoint is derived from the previous term by multiplying by a constant (C, the ratio of the series, which is derived by finding the difference of the logarithms of the highest and lowest values and dividing by the number of classes; Kraak and Ormeling 2003).

This method is best applied to IBIS data that is positively (right) skewed (producing a J-shaped distribution curve with a peak at the low end of a histogram), particularly when there is a long "stretch" between low and high values. For datasets that are normally distributed or that are rectangular or flat, the classification results of geometric progression may not provide useful discriminatory classes; in fact, the resulting classes may resemble an equal interval or arithmetic progression classification instead. Further, even when appropriately applied to a skewed dataset, it may be that class intervals imposed by geometric regression do not capture the underlying data hierarchy (Jiang 2013). Finally, for data that is heavily skewed toward the left, an inverse geometric progression could be implemented, but this functionality is not presently available in NM-IBIS.

See also: http://en.wikipedia.org/wiki/Geometric_progression

## f. Arithmetic Progression

Similar to geometric progression in its applicability to skewed distributions, this classification method increases the widths of the category intervals at an arithmetic (i.e., additive) rate. As implemented on IBIS, if the first category is one unit wide, for example, the next categories are incremented one additional unit at a time, resulting in a second category that is two units wide, a third category three units wide, and so forth to the end of the distribution. This method has many of the same strengths and shortcomings as the geometric method, but can provide a nonlinear classification at a different scale, which may be appropriate to different data sets.

See also: http://en.wikipedia.org/wiki/Arithmetic_progression

Visit the NM-IBIS Help Page for Downloading Map Layers for use in other programs.

## References

1. Jenks, George, and Michael Coulson. 1963.
Class Intervals for Statistical Maps. International Yearbook of Cartography, 3:119-134.

2. Kraak, Menno-Jan, and Ferjan Ormeling.
2003. Cartography: Visualization of Geospatial Data. Longman Group, United Kingdom.

4. Brewer, Cynthia A., and Linda Pickle.
2002. Evaluation of Methods for Classifying Epidemiological Data on Choropleth Maps in
Series, Annals of the Association of American Geographers, 92(4):662-681.

5. Jenks, George F. 1967. The Data Model
Concept in Statistical Mapping, International Yearbook of Cartography 7: 186-190.

6. Jiang, Bin. 2013. Head/tail Breaks: A
New Classification Scheme for Data with a Heavy-tailed Distribution, The Professional
Geographer, 65(3), 2013, 482-494.