Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

STING : A Statistical Information Grid Approach to Spatial Data Mining | CSCI 6905, Papers of Computer Science

Material Type: Paper; Professor: Ding; Class: Topics in Computer Science; Subject: Computer Science; University: East Carolina University; Term: Unknown 1997;

Typology: Papers

2009/2010

Uploaded on 02/24/2010

koofers-user-wjy
koofers-user-wjy 🇺🇸

10 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STING : A Statistical Information Grid Approach to Spatial Data
Mining
Wei Wang
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
weiwang@cs.ucla.edu
Jiong Yang
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
jyang@cs.ucla.edu
Richard Muntz
Department of Computer Science
University of California, Los
Angeles
CA 90095, U.S.A.
muntz@cs.ucla.edu
Abstract 1 Introduction
Spatial data mining, i.e., discovery of interesting
characteristics and patterns that may implicitly
exist in spatial databases, is a challenging task
due to the huge amounts of spatial data and to the
new conceptual nature of the problems which
must account for spatial distance. Clustering and
region oriented queries are common problems in
this domain. Several approaches have been
presented in recent years, all of which require at
least one scan of all individual objects (points).
Consequently, the computational complexity is at
least linearly proportional to the number of
objects to answer each query. In this paper, we
propose a hierarchical statistical information grid
based approach for spatial data mining to reduce
the cost further. The idea is to capture statistical
information associated with spatial cells in such a
manner that whole classes of queries and
clustering problems can be answered without
recourse to the individual objects. In theory, and
confirmed by empirical studies, this approach
outperforms the best previous method by at least
an order of magnitude, especially when the data
set is very large.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed
for direct commercial advantage, the VLDB copyright notice
and the title of the publication and its date appear, and notice
is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, requires
a fee and/or special permission from the Endowment.
Proceedings of the 23rd VLDB Conference
Athens, Greece, 1997
In general, spatial data mining, or knowledge discovery in
spatial databases, is the extraction of implicit knowledge,
spatial relations and discovery of interesting
characteristics and patterns that are not explicitly
represented in the databases. These techniques can play an
important role in understanding spatial data and in
capturing intrinsic relationships between spatial and
nonspatial data. Moreover, such discovered relationships
can be used to present data in a concise manner and to
reorganize spatial databases to accommodate data
semantics and achieve high performance. Spatial data
mining has wide applications in many fields, including
GIS systems, image database exploration, medical
imaging, etc.[Che97, Fay96a, Fay96b, Kop96a, Kop96b]
The amount of spatial data obtained from satellite,
medical imagery and other sources has been growing
tremendously in recent years. A crucial challenge in
spatial data mining is the efficiency of spatial data mining
algorithms due to the often huge amount of spatial data
and the complexity of spatial data types and spatial
accessing methods. In this paper, we introduce a new
STatistical INformation Grid-based method (STING) to
efficiently process many common “region oriented”
queries on a set of points. Region oriented queries are
defined later more precisely but informally, they ask for
the selection of regions satisfying certain conditions on
density, total area, etc. This paper is organized as follows.
We first discuss related work in Section 2. We propose
our statistical information grid hierarchical structure and
discuss the query types it can support in Sections 3 and 4,
respectively. The general algorithm as well as a detailed
example of processing a query are given in Section 5. We
analyze the complexity of our algorithm in Section 6. In
Section 7, we analyze the quality of STING’s result and
propose a sufficient condition under which STING is
guaranteed to return the correct result. Limiting Behavior
of STING is in Section 8 and, in Section 9, we analyze the
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download STING : A Statistical Information Grid Approach to Spatial Data Mining | CSCI 6905 and more Papers Computer Science in PDF only on Docsity!

STING : A Statistical Information Grid Approach to Spatial Data

Mining

Wei Wang

Departmentof Computer Science

University of California, Los

Angeles

CA 90095, U.S.A.

weiwang@cs.ucla.edu

Jiong Yang

Departmentof Computer Science

University of California, Los

Angeles

CA 90095, U.S.A.

jyang@cs.ucla.edu

Richard Muntz

Departmentof Computer Science

University of California, Los

Angeles

CA 90095, U.S.A.

muntz@cs.ucla.edu

Abstract 1 Introduction

Spatial data mining, i.e., discovery of interesting

characteristics and patterns that may implicitly

exist in spatial databases,is a challenging task

due to the huge amountsof spatial data and to the

new conceptual nature of the problems which

must account for spatial distance. Clustering and

region oriented queries are common problems in

this domain. Several approaches have been

presentedin recent years, all of which require at

least one scan of all individual objects (points).

Consequently,the computational complexity is at

least linearly proportional to the number of

objects to answer each query. In this paper, we

propose a hierarchical statistical information grid

basedapproach for spatial data mining to reduce

the cost further. The idea is to capture statistical

information associatedwith spatial cells in such a

manner that whole classes of queries and

clustering problems can be answered without

recourse to the individual objects. In theory, and

confirmed by empirical studies, this approach

outperforms the best previous method by at least

an order of magnitude, especially when the data

set is very large.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 23rd VLDB Conference Athens, Greece, 1997

In general, spatial data mining, or knowledge discovery in

spatial databases,is the extraction of implicit knowledge,

spatial relations and discovery of interesting

characteristics and patterns that are not explicitly

representedin the databases.Thesetechniquescan play an

important role in understanding spatial data and in

capturing intrinsic relationships between spatial and

nonspatial data. Moreover, such discovered relationships

can be used to present data in a concise manner and to

reorganize spatial databases to accommodate data

semantics and achieve high performance. Spatial data

mining has wide applications in many fields, including

GIS systems, image database exploration, medical

imaging, etc.[Che97,Fay96a,Fay96b, Kop96a, Kop96b]

The amount of spatial data obtained from satellite,

medical imagery and other sources has been growing

tremendously in recent years. A crucial challenge in

spatial data mining is the efficiency of spatial data mining

algorithms due to the often huge amount of spatial data

and the complexity of spatial data types and spatial

accessing methods. In this paper, we introduce a new

STatistical INformation Grid-based method (STING) to

efficiently process many common “region oriented”

queries on a set of points. Region oriented queries are

defined later more precisely but informally, they ask for

the selection of regions satisfying certain conditions on

density, total area,etc. This paper is organized as follows.

We first discuss related work in Section 2. We propose

our statistical information grid hierarchical structure and

discussthe query types it can support in Sections 3 and 4,

respectively. The general algorithm as well as a detailed

example of processinga query are given in Section 5. We

analyze the complexity of our algorithm in Section 6. In

Section 7, we analyze the quality of STING’s result and

propose a sufficient condition under which STING is

guaranteedto return the correct result. Limiting Behavior

of STING is in Section 8 and, in Section 9, we analyze the

performance of our method. Finally, we offer our conclusions in Section 10.

2 Related Work Many studies have been conducted in spatial data mining, such as generalization-based knowledge discovery [Kno96, Lu93], clustering-based methods [Est96, Ng94, Zha96], and so on. Those most relevant to our work are discussed briefly in this section and we emphasize what we believe are limitations which are addressed by our approach.

2.1 Generalization-based Approach [Lu93] proposed two generalization based algorithms: spatial-data-dominant (^) and non-spatial-data-dominant algorithms. Both of these require that a generalization hierarchy is given explicitly by experts or is somehow generated automatically. (However, such a hierarchy may not exist or the hierarchy given by the experts may not be entirely appropriate in some cases.) The quality of mined characteristics is highly dependent on the structure of the hierarchy. Moreover, the computational complexity is O(MogN), where N is the number of spatial objects. Given the above disadvantages, there have been efforts to find algorithms that do not require a generalization hierarchy, that is, to find algorithms that can discover characteristics directly from data. This is the motivation for applying clustering analysis in spatial data mining, which is used to identify regions occupied by points satisfying specified conditions.

2.2 Clustering-based Approach

2.2.1 CLARANS

[Ng94] presents a spatial data mining algorithm based on a clustering algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search) on spatial data. This is the first paper that introduces clustering techniques into spatial data mining problems and it represents a significant improvement on large data sets over traditional clustering methods. However the computational complexity of CLARANS is still high. In [Ng94] it is claimed that CLARANS is linearly proportional to the number of points, but actually the algorithm is inherently at least quadratic. The reason is that CLARANS applies a random search-based method to find an “optimal” clustering. The time taken to calculate the cost differential between the current clustering and one of its neighbors (in which only one cluster medoid is different) is linear and the number of neighbors that need to be examined for the current clustering is controlled by a parameter called maxneighbor, which is defined as max(250, 1.25%K(N - K)) where K is the number of

clusters. This means that the time consumed at each step of searching is O(KN’). It is very difficult to estimate how many steps need to be taken to reach the local optimum, but we can certainly say that the computational complexity of CLARANS is sZ(KN’). This observation is consistent with the results of our experiments and those mentioned in [Est96] which show that the performance of CLARANS is close to quadratic in the number of points. Moreover, the quality of the results can not be guaranteed when N is large since randomized search is used in the algorithm. In addition, CLARANS assumes that all objects are stored in main memory. This clearly limits the size of the database to which CLARANS can be applied.

2.2.2 BIRCH Another clustering algorithm for large data sets, called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), is introduced in [Zha96]. The authors employ the concepts of Clustering Feature and CF tree. Clustering feature is summarizing information about a cluster. CF tree is a balanced tree used to store the clustering features. This algorithm makes full use of the available memory and requires a single scan of the data set. This is done by combining closed clusters together and rebuilding CF tree. This guarantees that the computation complexity of BIRCH is linearly proportional to the number of objects. We believe BIRCH still has one other drawback: this algorithm may not work well when clusters are not “spherical” because it uses the concept of radius or diameter to control the boundary of a cluster’.

2.2.3 DBSCAN Recently, [Est96] proposed a density based clustering algorithm (DBSCAN) for large spatial databases. Two parameters Eps and MinPts are used in the algorithm to control the density of normal clusters. DBSCAN is able to separate “noise” from clusters of points where “noise” consists of points in low density regions. DBSCAN makes use of an R* tree to achieve good performance. The authors illustrate that DBSCAN can be used to detect clusters of any shape and can outperform CLARANS by a large margin (up to several orders of magnitude). However, the complexity of DBSCAN is O(MogN). Moreover, DBSCAN requires a human participant to determine the global parameter Eps. (The parameter MinPts is fixed to 4 in their algorithm to reduce the computational complexity.) Before determining Eps, DBSCAN has to calculate the distance between a point and its kth (k = 4) nearest neighbors for all points. Then it

’ We could not verify this since we do not have BIRCH source code.

determine a “kernel” calculation in the generic algorithm

as will be discussedin detail shortly.

3.2 Parameter Generation

We generate the hierarchy of cells with their associated

parameters when the data is loaded into the database.

Parametersn, m, s, min, and nru.xof bottom level cells are

calculated directly from data. The value of distribution

could be either assignedby the user if the distribution type

is known before hand or obtained by hypothesistests such

as X2-test.Parametersof higher level cells can be easily

calculated from parametersof lower level cell. Let n, m, s,

min, ~UZX,dist be parametersof current cell and ni, mi, si,

mini, mi, and disti be parametersof correspondinglower

level cells, respectively. The n, m, s, min, and lluu~can be

calculated as follows.

n=Cn,

Cmini m=L

n

Ji”--

min = *(mini )

mar = my( mi )

The determination of dist for a parent cell is a bit more

complicated. First, we set dist as the distribution type

followed by most points in this cell. This can be done by

examining disti and ni. Then, we estimate the number of

points, say co@, that conflict with the distribution

determined by dist, m, and s according to the following

rule:

1. If disti # dist, mi c-m and si = s, then confl is increased

by an amount of ni;

2. If disti # dist, but either mi = m or si = s is not

satisfied, then set confl to n (This enforcesdist will be

set to NONE later);

3. If disti = dist, mi = m and si = S, then conjl is not

changed;

4. If disti = dist, but either mi = m or si = s is not

satisfied, then conjl is set to n.

conj

Finally, if - is greater than a threshold r (This

n

threshold is a small constant,say 0.05, which is set before

the hierarchical structure is built), then we set dist as

NONE; otherwise, we keep the original type. For

example, the parameters of lower level cells are as

follows.

Table 1: Parametersof Children Cells

i 1 2 3 4

ni 100 50 60 10

mi 20.1^ 19.7^ 21.0^ 20.

Si 2.3^ 2.2^ 2.4^ 2.

mini 4.5 5.5 3.8 7

maxi 36 34 37 40

disti NORMAL NORMAL NORMAL NONE

Then the parametersof current cell will be

n = 220

m = 20.

s = 2.

min = 3.

mar=

dist = NORMAL

The distribution type is still NORMAL based on the

following: Since there are 210 points whose distribution

type is NORMAL, dist is first set to NORMAL. After

examining disti, mi, and si of each lower level cell, we find

confl

out co@7= 10. So, dist is kept as NORMAL (-=

n

0.045 c 0.05).

We only need to go through the data set once in order

to calculate the parametersassociatedwith the grid cells at

the bottom level, the overall compilation time is linearly

proportional to the number of objects with a small

constantfactor. (And only has to be done once - not for

each query.) With this structure in place, the response

time for a query is much faster since it is O(K) instead of

O(N). We will analyzeperformancein more detail in later

sections.

4 Query Types

If the statistical information stored in the STING

hierarchical structure is not sufficient to answer a query,

then we have recourse to the underlying database.

Therefore, we can support any query that can be

expressedby the SQL-like languagedescribed later in this

section. However, the statistical information in the STING

structure can answer many commonly asked queries very

efficiently and we often do not need to access the full

database.Even when the statistical information is not

enough to answer a query, we can still narrow the set of

possiblechoices.

STING can be usedto facilitate several kinds of spatial

queries. The most commonly asked query is region query

which is to select regions that satisfy certain conditions

(Exl). Another type of query selects regions and returns

some function of the region, e.g., the range of some

attributes within the region (Ex2). We extend SQL so that

it can be used to describe such queries. The formal definition is in [Wan97]. The following are several query examples.

Exl. Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $4OOK and with total area at least 100 units with 90% confidence.

SELECT REGION FROM house-map WHERE DENSITY IN (100, -) AND price RANGE (400000, -> WITH PERCENT (0.7, 1) AND AREA ( 100, -) AND WITH CONFIDENCE 0.

Ex2. Select the range of age of houses in those maximal regions where there are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.

SELECT RANGE(age) FROM house-map WHERE DENSITY IN (100, -) AND price RANGE (150000,300000) WITH PERCENT (0.7, 1) AND AREA (100, -) AND LOCATION California

5 Algorithm With the hierarchical structure of grid cells on hand, we can use a top-down approach to answer spatial data mining queries. For each query, we begin by examining cells on a high level layer. Note that it is not necessary to start with the root; we may begin from an intermediate layer (but we do not pursue this minor variation further due to lack of space). Starting with the root, we calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell (exactly how this is computed is described later). This likelihood can be defined as the proportion of objects in this cell that satisfy the query conditions. (If the distribution type is NONE, we estimate the likelihood using some distribution-free techniques instead.) After we obtain the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level. When we finish examining the current layer, we proceed to the next lower level of cells and repeat the same process. The only difference is that instead of going through all cells, we only look at those cells that are children of the relevant cells of the previous layer. This procedure continues until we finish examining the lowest level layer (bottom layer). In most

cases, these relevant cells and their associated statistical information are enough to give a satisfactory result to the query. Then, we find all the regions formed by relevant cells and return them. However, in rare cases (People may want very accurate result for special purposes, e.g. military), this information are not enough to answer the query. Then, we need to retrieve those data that fall into the relevant cells from database and do some further processing. After we have labeled all cells as relevant or not relevant, we can easily find all regions that satisfy the density specified by a breadth-first search. For each relevant cell, we examine cells within a certain distance (how to choose this distance is discussed below) from the center of current cell to see if the average density within this small area is greater than the density specified. If so, this area is marked and all relevant cells we just examined are put into a queue. Each time we take one cell from the queue and repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When the queue is empty, we have identified one region. The distance we use above is calculated from the specified density and the granularity of the bottom

level cell. The distance d = max(l,

are the side length of bottom layer cell, the specified density, and a small constant number set by STING (It does not vary from a query to another), respectively.

Usually, 1 is the dominant term in max(l, J

5). As a

result, this distance can only reach the neighbor cells. In this case, we just need to examine neighboring cells and find regions that are formed by connected cells. Only when the granularity is very small, this distance could cover a number of cells. In this case, we need to examine every cell within this distance instead of only neighboring cells. For example, if the objects in our database are houses and price is one of the attributes, then one kind of query could be “Find those regions with area at least A where the number of houses per unit area is at least c and at least p% of the houses have price between a and b with (1 - a) confidence” where a < b. Here, a could be -0~ and b could be +m. This query can be written as

SELECT REGION

FROM house-map WHERE DENSITY IN [c, -) AND price RANGE [a, b] WITH PERCENT [ p%, l] AND AREA [A, -) AND WITH CONFIDENCE 1 - a

hierarchical structure.Notice that the total number of cells

is 1.33K, where K is the number of cells at bottom layer.

We obtain the factor 1.33 becausethe number of cells of a

layer is always one-forth of the number of cells of the

layer one level lower. So the overall computation

complexity on the grid hierarchy structure is O(K).

Usually, the number of cells needed to be examined is

much less, especially when many cells at high layers are

not relevant. In Step 8, the time it takes to form the

regions is linearly proportional to the number of cells. The

reason is that for a given cell, the number of cells need to

be examinedis constantbecauseboth the specified density

and the granularity can be regarded as constants during

the execution of a query and in turn the distance is also a

constant since it is determined by the specified density.

Since we assumeeach cell at bottom layer usually has

several dozens to several thousandsobjects, K CC N. So,

the total complexity is still O(K).Usually, we do not need

to do Step 7 and the overall computational complexity is

O(K). In the extremecasethat we needto go to Step7, we

still do not need to retrieve all data from database.

Therefore, the time required in this step is still less than

linear. So, this algorithm outperforms other approaches

greatly.

7 Quality of STING

STING makes use of statistical information to

approximate the expected results of query. Therefore, it

could be imprecise since data points can be arbitrarily

located. However, under the the following sufficient

condition, STING can guaranteethat if a region satisfies

the specificatonof the query, then it is returned.

Definition 1. Let F be a region. The width of F is defined

as the side length of the maximum squarethat can fit in F.

Sufficient Condition:

Let A and c be the minimum area and density

specified by query, respectively. Let R and W be a

region satisfying the ,conditions specified by the

query and its width, respectively. If W2 - 4(fW/

+1)1’ 2 A where 1 is the side length of the bottom

level cell, then R must be returned by STING.

Let S be a maximum square in R with side length W.

Let I be the set of bottom level cells that intersectwith S. I

can be divided into two disjoint subsetsII and ZZ.II is the

set of cells that cross the boundary of S while I* is the set

of cells that are within S. It is obvious that all cells in I

are connected. A line segmentof length W can cross at

most rW/ll + 1 bottom level cells. In turn, the cardinality

of II is at most 4[W/11+ 1). The total area of cells in II is

at most 4(fW/1!1 + 1)12and the total area of S is W2. As a

result, the total area of cells in I2 is at least W2- 4(rW/Zl+

1)1’.STING can detect all the cells in I2 as relevant. Since

W2 - 4(-W/Z] +1)1’ 2 A, the total area of cells in Z2is at

least A. Therefore, STING can guarantee to return R.

However, the boundary of the returned region could be

slightly different from the expectedone.

8 Limiting Behavior of STING is Equivalent to DBSCAN

The regions returned by STING are an approximation of

the result by DBSCAN. As the granularity approaches

zero, the regions returned by STING approach the result

of DBSCAN. In order to compareto DBSCAN, we only

use the number of points here since DBSCAN can only

cluster points according to their spatial location. (i.e., we

do not consider conditions on other attributes.) DBSCAN

has two parameters:Eps and MinPts. (Usually, MinPts is

fixed to k.) In our case, STING has only one parameter:

the density c. We set c =

MinPts + 1 k+l

Eps’. a

=- in order

Eps’. K

to approximate the result of DBSCAN. The reason is that

the density of any area inside the clusters detected by

DBSCAN is at least

MinPts + 1

Eps’. k

since for each core point

there are at least MinPts points (excluding itself) within

distance Eps. In STING, for each cell, if it < S x c, then

we label it as not relevant; otherwise, we label it as

relevant where n and S are the number of points in this

cell and the area of bottom layer cell, respectively. When

we form the regions from relevant cells, the examining

distance is set to be d = max(l, 5). When the

granularity is very small,

d

k+l

; becomesthe dominant

term. As the granularity approacheszero, the area of each

cell at bottom layer goesto zero. So, if there is at least one

point in a cell, this cell will be labeled as relevant. Now

what we need to do is to form the region to be returned

according to distanced and density c. We can seethat d =

k+l

k+l

= Eps. For each relevant cell, we

-.?r

Eps’. z

examine the area around it (within distance d) to see if the

density is greaterthan c. This is equivalent to check if the

number of points (including itself) within this area is

greater than c x nd2 = k + 1. As a result, the result of

STING approaches that of DBSCAN when the granularity

approacheszero.

9 Performance

We run several tests to evaluate the performance of STING. The following tests are run on a SPARC 10 machine with Solaris 5.5 operating system (192 MB memory).

9.1 Performance Comparison of Two Distributions

To obtain performance metric of STING, we implemented the house-price example discussed in Section 5. Exl is the query that we posed. We generated two data sets, both of which have 100,000 data points (houses). The hierarchical structure has seven layers in this test. First, we generate a data set (DSI) such that the price is normally distributed in each cell (with similar mean). The hierarchical structure generation time is 9.8 seconds. (Generation needs to be done once for each data set. All the queries for the same data set can use the same structure. Therefore, we do not need to generate it for each query.) It takes STING 0. second to answer the query given the STING structure exists. The expected result and the result returned by STING are in Figure 3a and 3b, respectively. From Figure 3a and 3b, we can see that STING’s result is very close to the expected one. In the second data set (DS2), the prices in each bottom layer cell follow a normal distribution (with different mean) but they do not follow any known distribution at higher levels. The hierarchical structure generation time is 9.7 seconds. It takes STING 0.22 second to answer the query. The expected result and the result returned by STING are in Figure 4a and 4b, respectively.

Figure 3a: Expected Result with DS 1

Figure 3b: STING’s Result on DS 1

Figure 4a: Expected Result with DS

Figure 4b: STING’s Result on DS

Brooks/Cole Publishing Company, Pacific Grove, California, 1991.

[Est95] M. Ester, H. P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. Proc. 4th Int, Symp. on Large Spatial Databases (SSD’95), pp. 67-82, Poland, Maine, August 1995.

[Est96] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, Portland, OR, USA, August 1996.

[Fay96a] U. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI Magazine, 17(3):37-54, Fall

[Fay96b] U. Fayyad, G. P.-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAVMIT Press, Menlo Park, CA, 1996.

[Fot94] S. Fotheringham and P. Rogerson. Spatial Analysis and GIS. Taylor and Francies, 1994

[Kno96] E. M. Knorr and R. Ng. Extraction of spatial proximity patterns by concept generalization. Proc. 2nd Int. Conf Knowledge Discovery and Data Mining (KDD-96), pp. 347-350, Portland, OR, USA, August 1996.

[Kop96a] K. Koperski, J. Adhikary, and J. Han. Spatial data mining: progress and challenges. SIGMOD’96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’96), Montreal, Canada, June 1996.

[Kop96b] K. Koperski and J. Han. Data mining methods for the analysis of large geographic databases. Proc. 10th Annual Conf on GIS. Vancouver, Canada, March 1996.

[Lu93] W. Lu, J. Han, and B. C. Ooi. Discovery of general knowledge in large spatial databases. Proc. Far East Workshop on Geographic Information Systems, pp. 275-289, Singapore, June 1993.

[Ng94] R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. Proc.

I994 Int. Conf Very Large Databases, pp. 144- 155, Santiago, Chile, September 1994.

[Sam901 H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990.

[Sto93] M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The SEQUOIA 2000 storage benchmark. Proc. I993 ACM-SIGMOD Int. Conf Management of Data, pp. 2-11, Washington DC, 1993.

[Wan971 W. Wang, J. Yang, and R. R. Muntz. STING: A satistical information grid approach to spatial data mining. Technical Report No. 970006, Computer Science Department, UCLA, February 1997.

[Zha96] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. Proc. I996 ACM- SIGMOD Int. Conf Management of Data, pp. 103-l 14, Montreal, Canada, June 1996.