Introduction to Geospatial Analytics
Geospatial Analytics is related to data that is used for locating anything on the globe, an uber driver to a man in a new neighbourhood place everybody uses its data in some way or the other. Its technology involves GPS (global positioning systems), GIS (geographical information systems), and RS (remote sensing). This blog we will explore the topic in depth. We start with the basics and then deep dive into all the details.
Why is it important?
It is necessary for so many things and is used daily for various reasons. From commuting purposes for an ordinary man to data in missiles of a defence organization of a particular county, everything requires its data. It is extracted from various resources. Every phone having an active internet connection somehow adds up to contributing to geospatial data, satellites collect data daily. It is of great use in everyday life, and so it requires a significant amount of attention. It can be used for various reasons, to help support natural hazards and, to know of disasters, global climate change, wildlife, natural resources, etc. It is used for satellite imagery too that could be for tactical or for weather forecasting purposes. Many tech giants like uber etc. use it on daily bases to help ease everyday life. A company has to be efficient in extracting the data efficiently and use it, to stand out in the market.
How to retrieve Geospatial Data?
Various methods could do this, but mainly Presto and hives are used to extract and reform the data that’s present in hundreds of petabyte and use it efficiently and make the lives of billions easy. This data is vital as it touches the mass majority and is used every second. GIS is a part of its data that helps in the collection, storage, manipulation, analyzation, and present spatial data. Whatever the situation is going on at local, regional or national level, if where is asked for it come to play. It wouldn’t be effective without Visualization.
Geospatial Analytics Using Presto: Presto is an open-source distributed SQL query, used to solve the question of any size or type. It runs on Hadoop. It supports many non-relational resources and Teradata. It can query data on its respective location, without moving the actual data to any separate system. The execution of the query runs parallel over a pure memory-based architecture, with most results returning within seconds. Many tech giants use it. It’s a popular choice for undertaking interactive queries that are in data ranging in100s of PetaByte.
Geospatial Analytics Using Hive: It is a data warehouse infrastructure tool to process any structured data and developed on top of the Hadoop distributed file system. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing of any kind of data accessible.
What is the architecture of Hive?
It is an ETL and Data Warehousing tool built on top of the Hadoop. It helps to perform many operations secure like:
- Analysis of large data sets
- Data encapsulation
- Ad-hoc queries
What are its major components?
- Client
- Services
- Processing & Resource Management
- Distributed Storage
Hive Clients: It supports all the application written in languages like Java, Python, C++ etc. It is using Thrift, JDBC and ODBC drivers. It’s easy to write its client application in the desired language. Its clients are categorized into three types:
- Thrift Clients: Apache Hive’s servers are based on Thrift, so it’s easy for it to serve all the request from the languages that support Thrift
- JDBC Clients: It allows java apps to connect to it by using its JDBC driver
- ODBC Clients: ODBC Driver will enable applications that support ODBC protocol to connect to it. It uses Thrift to communicate to its server.
Hive Services: It provides with various services like:
- CLI (Command Line Interface): It is the default shell provided by it, which helps to execute its queries and command directly.
- Web Interface: It gives an option to execute queries and commands on a web-based GUI provided by it.
- Server: It is built on Apache Thrift and is also knows as Thrift Server. It allows different clients to submit requests and retrieve the final result from it.
- Driver: It is responsible for receiving the queries submitted by clients. It compiles, optimizes and executes the queries.
What is the architecture of Presto?
There is two central part in it: Coordinator and Worker. It is an open-source distributed system that can be run on multiple machines. Its distributed SQL query engine was built for fast analytic queries. Its deployment will include one Coordinator and any number of it.
- Coordinator: Used to submit queries and manages parsing, planning, and scheduling query processing.
- Worker: Processes the queries, adding more workers gives faster query processing.
What are its key components?
The key components of presto are:
- Coordinator: It is the brain of any installation; it manages all the worker nodes for all the work comes related to queries. It gets results from workers and returns the final output to the client. It connects with workers and clients via REST.
- Worker: It helps to execute the task and to process the data. These nodes share data amongst each other and get data from the Coordinator.
- Catalogue: It contains information related to data, such as where the data is located, where the schema is located and the data source.
- Tables and Schemas: It is similar to what it means in a relational database. The table is set of rows organized into named columns and schema is what you use to hold your tables.
- Connector: lt issued to help it to integrate with the external data source.
- Stage: To execute a query, Presto breaks it up into steps.
- Tasks: Stages are implemented as a series of functions that might get distributed on Workers.
- Drivers and Operators: Tasks contains one or more parallel drivers, and they are operators in memory. An operator consumes, transforms and produces data.
What are the deployment strategies?
The deployment strategies for Hive are listed below:
- AWS: Amazon EMR is used to deploy its megastore. User can opt from three configurations that Amazon has to offer, namely – Embedded, Local or Remote. There are two options for creating an external Hive megastore for EMR: By using AWS Glue data catalogue or Use Amazon RDS / Amazon Aurora.
- Cloud Dataproc: Apache Hive on Cloud Dataproc provides an efficient and flexible way by storing data of it in Cloud Storage and hosting its metastore in MySQL database on the Cloud SQL. It offers some advantages like flexibility and agility by letting user tailor cluster configuration for specific workloads and scale the cluster according to the need. It also helps in saving cost.
The deployment strategies for Presto
- AWS: Amazon EMR allows to quickly spin up a managed EMR cluster with a presto query engine and run interactive analysis on the data stored in Amazon S3. It is used to run interactive queries. Its implementation can be built on the cloud on Amazon Web Services. Amazon EMR and Amazon Athena provides with building and implementation of it.
- Cloud Dataproc: The cluster that includes its component can easily prepare in Presto.
What are the various ways to optimise?
The various ways to optimise are described below:
Hive
- Tez-Execution Engine: It is an application framework built on Hadoop Yarn.
- Usage of Suitable File Format: Usage of appropriate file format on the basis of data will drastically increase the query performance. ORC file format is best suited for the same.
- Partitioning: By partitioning the entries into the different dataset, only the required data is called during the time of the execution of the query, thus making the performance more efficient and optimized.
- Bucketing: It helps divide the datasets into more manageable parts, for this purpose bucketing is used. User can set the size of manageable pieces or Buckets too.
- Vectorization: Vectorized query execution is used for more optimized performance of it. It happens by performing aggregation over batches of 1024 rows at once instead of the single row each time.
- Cost-Based Optimization (CBO): It performs optimization based on query cost. To use CBO parameters are to be set at the beginning of the query.
- Indexing: Indexing helps increase optimization. It helps the speed of the process of executing queries by taking less time to do so.
Presto
- File format: Usage of ORC file format is best suited for optimizing the execution of queries while using it.
- It can join automatically if the feature is enabled.
- Dynamic filter feature optimizes the use of JOIN queries
- It has added a new connector configuration to skip corrupt records in input formats other than orc, parquet and rcfile.
- By setting task.max-worker-threads in config.properties, number of CPU cores into hyper-threads per core on a worker node.
- Splits can be used for efficient and optimized use in executing the queries in Presto.
What are the advantages?
The advantages of Hive and Presto are:
Hive
- It is a stable query engine and has a large and active community
- Its queries are similar to that of SQL, which are easy to understand by RDBMS professionals
- It supports ORC, TextFile, RCFile, Avro and Parquet file Formats
Presto
- It supports file formats like ORC, Parquet and RCFile formats, eliminating the need for data transformation.
- It works well with Amazon S3 queries and Storage, it can query data in mere seconds even if the data is of the size of petabytes.
- It also has an active community.
Geospatial Analytics Using Presto and Hive
Modelling geospatial data has quite many complexities. Well, Known Texts are used to model different locations on the map. Various types like point and polygon shapes are used for these purposes. The Spatial Library is used for spatial processing in it with User-Defined Functions and SerDes. Through allowing this library in it, queries may be created using its Query Language (HQL), which is somewhat close to SQL. You will, therefore, stop complex MapReduce algorithms and stick to a more common workflow. Its plugin is running in production at Uber. All GeoSpatial traffic at Uber, more than 90% of it is completed within 5 minutes. Compared with brute force its MapReduce execution, Uber’s Geospatial Plugin is more than 50X faster, leading to greater efficiency.
Summing up
Presto has the edge over Hive as it can be used to process unstructured data too, and query processing in it is faster than that in it. The data is collected in a humongous amount daily, and it needs to be extracted efficiently and judiciously to have better working software that requires it.