-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any suggestions about build index using Apache-spark? #9
Comments
Sorry for the delay, I just noticed this. I am actually working on building multiple types of indexes from HDFS data using Spark. Currently, I have the a simple implementation of a Grid index. I am looking to build an R-tree as well but I have not figured it all out yet. What dataset are you working with? |
I have used the sampling data from HDFS and build a quadtree from the meanwhile, what is the relationship between building SP-GIST via MapReduce? On Sat, Mar 14, 2015 at 9:42 PM, Thamir Qadah [email protected]
|
Hi Mingjie, The SP-GIST is a framework of building space partitioning trees and I believe that the other students are trying to realize the framework and make it extensible. To my understanding, the main difference is that the are focusing on the framework and how it can be used realize multiple space partitioning search trees while I focus on standalone spark applications that will build spatial indexes for my datasets. I need to build an index for (60GB to ~500GB) of data and maybe more. For example, you cannot use SP-GIST to realize an R-Tree index or its variants because they are not space partitioning search trees. The limitation also applies to the Grid index because the grid is not actually a search tree. I will definitely seek your advice when I get to the quad-tree implementation if I need to. |
I know that there is a project for spatial data based on Hadoop's Map-Reduce http://spatialhadoop.cs.umn.edu/. And I would like to move it to Apache-Spark. Suppose the input file in HDFS is point or rectangle dataset, I now want to build R-tree index as |
Have you tried using the current implementation of the map/reduce functions that exist in SpatialHadoop but calling them from Spark? It depends on how it is written ( I have not looked at it myself) but you may need modify the code a little bit. If you just need to build an R-tree index, this is one way to go. Otherwise, you will need to come up with your own approach (SparkApp) of building the R-tree index using Spark. I am also using datasets from the SpatialHadoop project but I will probably have my own approach. I will let you know if I got something up. BTW, @ChenZhongPu, What is your affiliation? Regards, |
I have found a nice java library http://www.vividsolutions.com/jts/javadoc/index.html for building R-tree index using STR. For very big dataset, is it recommended for parallel using Apache-Spark? And I also want to save the index into file in HDFS for later usage. Since I am not very familiar with See more at http://stackoverflow.com/questions/29113702/strtree-in-jts-topology-suite-bulk-load-data-and-build-index. Last, I am just a CS college student in China @qadahtm . |
@ChenZhongPu You may want to check my other repository on indexing using Spark https://github.com/qadahtm/SpatialSparkIndexer/blob/master/src/main/scala/qadahtm/OSMGridIndex.scala for some hints. I only have the code for the Grid-based index for now. This is still a work in progress but I hope it can help you. Regards, |
@qadahtm, |
As the question, I want to build R-tree index for spatial data using Apache-Spark, the input and output is
HDFS
.The text was updated successfully, but these errors were encountered: