






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Material Type: Paper; Professor: Ding; Class: Topics in Computer Science; Subject: Computer Science; University: East Carolina University; Term: Unknown 1997;
Typology: Papers
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Abstract 1 Introduction
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 23rd VLDB Conference Athens, Greece, 1997
performance of our method. Finally, we offer our conclusions in Section 10.
2 Related Work Many studies have been conducted in spatial data mining, such as generalization-based knowledge discovery [Kno96, Lu93], clustering-based methods [Est96, Ng94, Zha96], and so on. Those most relevant to our work are discussed briefly in this section and we emphasize what we believe are limitations which are addressed by our approach.
2.1 Generalization-based Approach [Lu93] proposed two generalization based algorithms: spatial-data-dominant (^) and non-spatial-data-dominant algorithms. Both of these require that a generalization hierarchy is given explicitly by experts or is somehow generated automatically. (However, such a hierarchy may not exist or the hierarchy given by the experts may not be entirely appropriate in some cases.) The quality of mined characteristics is highly dependent on the structure of the hierarchy. Moreover, the computational complexity is O(MogN), where N is the number of spatial objects. Given the above disadvantages, there have been efforts to find algorithms that do not require a generalization hierarchy, that is, to find algorithms that can discover characteristics directly from data. This is the motivation for applying clustering analysis in spatial data mining, which is used to identify regions occupied by points satisfying specified conditions.
2.2 Clustering-based Approach
[Ng94] presents a spatial data mining algorithm based on a clustering algorithm called CLARANS (Clustering Large Applications based upon RANdomized Search) on spatial data. This is the first paper that introduces clustering techniques into spatial data mining problems and it represents a significant improvement on large data sets over traditional clustering methods. However the computational complexity of CLARANS is still high. In [Ng94] it is claimed that CLARANS is linearly proportional to the number of points, but actually the algorithm is inherently at least quadratic. The reason is that CLARANS applies a random search-based method to find an “optimal” clustering. The time taken to calculate the cost differential between the current clustering and one of its neighbors (in which only one cluster medoid is different) is linear and the number of neighbors that need to be examined for the current clustering is controlled by a parameter called maxneighbor, which is defined as max(250, 1.25%K(N - K)) where K is the number of
clusters. This means that the time consumed at each step of searching is O(KN’). It is very difficult to estimate how many steps need to be taken to reach the local optimum, but we can certainly say that the computational complexity of CLARANS is sZ(KN’). This observation is consistent with the results of our experiments and those mentioned in [Est96] which show that the performance of CLARANS is close to quadratic in the number of points. Moreover, the quality of the results can not be guaranteed when N is large since randomized search is used in the algorithm. In addition, CLARANS assumes that all objects are stored in main memory. This clearly limits the size of the database to which CLARANS can be applied.
2.2.2 BIRCH Another clustering algorithm for large data sets, called BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), is introduced in [Zha96]. The authors employ the concepts of Clustering Feature and CF tree. Clustering feature is summarizing information about a cluster. CF tree is a balanced tree used to store the clustering features. This algorithm makes full use of the available memory and requires a single scan of the data set. This is done by combining closed clusters together and rebuilding CF tree. This guarantees that the computation complexity of BIRCH is linearly proportional to the number of objects. We believe BIRCH still has one other drawback: this algorithm may not work well when clusters are not “spherical” because it uses the concept of radius or diameter to control the boundary of a cluster’.
2.2.3 DBSCAN Recently, [Est96] proposed a density based clustering algorithm (DBSCAN) for large spatial databases. Two parameters Eps and MinPts are used in the algorithm to control the density of normal clusters. DBSCAN is able to separate “noise” from clusters of points where “noise” consists of points in low density regions. DBSCAN makes use of an R* tree to achieve good performance. The authors illustrate that DBSCAN can be used to detect clusters of any shape and can outperform CLARANS by a large margin (up to several orders of magnitude). However, the complexity of DBSCAN is O(MogN). Moreover, DBSCAN requires a human participant to determine the global parameter Eps. (The parameter MinPts is fixed to 4 in their algorithm to reduce the computational complexity.) Before determining Eps, DBSCAN has to calculate the distance between a point and its kth (k = 4) nearest neighbors for all points. Then it
’ We could not verify this since we do not have BIRCH source code.
Cmini m=L
Ji”--
4 Query Types
it can be used to describe such queries. The formal definition is in [Wan97]. The following are several query examples.
Exl. Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $4OOK and with total area at least 100 units with 90% confidence.
SELECT REGION FROM house-map WHERE DENSITY IN (100, -) AND price RANGE (400000, -> WITH PERCENT (0.7, 1) AND AREA ( 100, -) AND WITH CONFIDENCE 0.
Ex2. Select the range of age of houses in those maximal regions where there are at least 100 houses per unit area and at least 70% of the houses have price between $150K and $300K with area at least 100 units in California.
SELECT RANGE(age) FROM house-map WHERE DENSITY IN (100, -) AND price RANGE (150000,300000) WITH PERCENT (0.7, 1) AND AREA (100, -) AND LOCATION California
5 Algorithm With the hierarchical structure of grid cells on hand, we can use a top-down approach to answer spatial data mining queries. For each query, we begin by examining cells on a high level layer. Note that it is not necessary to start with the root; we may begin from an intermediate layer (but we do not pursue this minor variation further due to lack of space). Starting with the root, we calculate the likelihood that this cell is relevant to the query at some confidence level using the parameters of this cell (exactly how this is computed is described later). This likelihood can be defined as the proportion of objects in this cell that satisfy the query conditions. (If the distribution type is NONE, we estimate the likelihood using some distribution-free techniques instead.) After we obtain the confidence interval, we label this cell to be relevant or not relevant at the specified confidence level. When we finish examining the current layer, we proceed to the next lower level of cells and repeat the same process. The only difference is that instead of going through all cells, we only look at those cells that are children of the relevant cells of the previous layer. This procedure continues until we finish examining the lowest level layer (bottom layer). In most
cases, these relevant cells and their associated statistical information are enough to give a satisfactory result to the query. Then, we find all the regions formed by relevant cells and return them. However, in rare cases (People may want very accurate result for special purposes, e.g. military), this information are not enough to answer the query. Then, we need to retrieve those data that fall into the relevant cells from database and do some further processing. After we have labeled all cells as relevant or not relevant, we can easily find all regions that satisfy the density specified by a breadth-first search. For each relevant cell, we examine cells within a certain distance (how to choose this distance is discussed below) from the center of current cell to see if the average density within this small area is greater than the density specified. If so, this area is marked and all relevant cells we just examined are put into a queue. Each time we take one cell from the queue and repeat the same procedure except that only those relevant cells that are not examined before are enqueued. When the queue is empty, we have identified one region. The distance we use above is calculated from the specified density and the granularity of the bottom
level cell. The distance d = max(l,
are the side length of bottom layer cell, the specified density, and a small constant number set by STING (It does not vary from a query to another), respectively.
Usually, 1 is the dominant term in max(l, J
5). As a
result, this distance can only reach the neighbor cells. In this case, we just need to examine neighboring cells and find regions that are formed by connected cells. Only when the granularity is very small, this distance could cover a number of cells. In this case, we need to examine every cell within this distance instead of only neighboring cells. For example, if the objects in our database are houses and price is one of the attributes, then one kind of query could be “Find those regions with area at least A where the number of houses per unit area is at least c and at least p% of the houses have price between a and b with (1 - a) confidence” where a < b. Here, a could be -0~ and b could be +m. This query can be written as
FROM house-map WHERE DENSITY IN [c, -) AND price RANGE [a, b] WITH PERCENT [ p%, l] AND AREA [A, -) AND WITH CONFIDENCE 1 - a
7 Quality of STING
Sufficient Condition:
8 Limiting Behavior of STING is Equivalent to DBSCAN
d
k+l
examine the area around it (within distance d) to see if the
number of points (including itself) within this area is
9 Performance
We run several tests to evaluate the performance of STING. The following tests are run on a SPARC 10 machine with Solaris 5.5 operating system (192 MB memory).
9.1 Performance Comparison of Two Distributions
To obtain performance metric of STING, we implemented the house-price example discussed in Section 5. Exl is the query that we posed. We generated two data sets, both of which have 100,000 data points (houses). The hierarchical structure has seven layers in this test. First, we generate a data set (DSI) such that the price is normally distributed in each cell (with similar mean). The hierarchical structure generation time is 9.8 seconds. (Generation needs to be done once for each data set. All the queries for the same data set can use the same structure. Therefore, we do not need to generate it for each query.) It takes STING 0. second to answer the query given the STING structure exists. The expected result and the result returned by STING are in Figure 3a and 3b, respectively. From Figure 3a and 3b, we can see that STING’s result is very close to the expected one. In the second data set (DS2), the prices in each bottom layer cell follow a normal distribution (with different mean) but they do not follow any known distribution at higher levels. The hierarchical structure generation time is 9.7 seconds. It takes STING 0.22 second to answer the query. The expected result and the result returned by STING are in Figure 4a and 4b, respectively.
Figure 3a: Expected Result with DS 1
Figure 3b: STING’s Result on DS 1
Figure 4a: Expected Result with DS
Figure 4b: STING’s Result on DS
Brooks/Cole Publishing Company, Pacific Grove, California, 1991.
[Est95] M. Ester, H. P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. Proc. 4th Int, Symp. on Large Spatial Databases (SSD’95), pp. 67-82, Poland, Maine, August 1995.
[Est96] M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf Knowledge Discovery and Data Mining (KDD-96), pp. 226-231, Portland, OR, USA, August 1996.
[Fay96a] U. Fayyad, G. P.-Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI Magazine, 17(3):37-54, Fall
[Fay96b] U. Fayyad, G. P.-Shapiro, P. Smyth, and R. Uthurusamy, editors. Advances in Knowledge Discovery and Data Mining. AAAVMIT Press, Menlo Park, CA, 1996.
[Fot94] S. Fotheringham and P. Rogerson. Spatial Analysis and GIS. Taylor and Francies, 1994
[Kno96] E. M. Knorr and R. Ng. Extraction of spatial proximity patterns by concept generalization. Proc. 2nd Int. Conf Knowledge Discovery and Data Mining (KDD-96), pp. 347-350, Portland, OR, USA, August 1996.
[Kop96a] K. Koperski, J. Adhikary, and J. Han. Spatial data mining: progress and challenges. SIGMOD’96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’96), Montreal, Canada, June 1996.
[Kop96b] K. Koperski and J. Han. Data mining methods for the analysis of large geographic databases. Proc. 10th Annual Conf on GIS. Vancouver, Canada, March 1996.
[Lu93] W. Lu, J. Han, and B. C. Ooi. Discovery of general knowledge in large spatial databases. Proc. Far East Workshop on Geographic Information Systems, pp. 275-289, Singapore, June 1993.
[Ng94] R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. Proc.
I994 Int. Conf Very Large Databases, pp. 144- 155, Santiago, Chile, September 1994.
[Sam901 H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990.
[Sto93] M. Stonebraker, J. Frew, K. Gardels, and J. Meredith. The SEQUOIA 2000 storage benchmark. Proc. I993 ACM-SIGMOD Int. Conf Management of Data, pp. 2-11, Washington DC, 1993.
[Wan971 W. Wang, J. Yang, and R. R. Muntz. STING: A satistical information grid approach to spatial data mining. Technical Report No. 970006, Computer Science Department, UCLA, February 1997.
[Zha96] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. Proc. I996 ACM- SIGMOD Int. Conf Management of Data, pp. 103-l 14, Montreal, Canada, June 1996.