Cloud-native Apache Hadoop & Apache Spark
Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. Operations that used to take hours or days take seconds or minutes instead, and you pay only for the resources you use (with per-second billing). Cloud Dataproc also easily integrates with other Google Cloud Platform (GCP) services, giving you a powerful and complete platform for data processing, analytics and machine learning.
Create Cloud Dataproc clusters quickly and resize them at any time—from three to hundreds of nodes—so you don’t have to worry about your data pipelines outgrowing your clusters. With each cluster action taking less than 90 seconds on average, you have more time to focus on insights, with less time lost to infrastructure.
Adopting Google Cloud Platform pricing principles, Cloud Dataproc has a low cost and an easy to understand price structure, based on actual use, measured by the second. Also, Cloud Dataproc clusters can include lower-cost preemptible instances, giving you powerful clusters at an even lower total cost.
The Spark and Hadoop ecosystem provides tools, libraries, and documentation that you can leverage with Cloud Dataproc. By offering frequently updated and native versions of Spark, Hadoop, Pig, and Hive, you can get started without needing to learn new tools or APIs, and you can move existing projects or ETL pipelines without redevelopment.
CLOUD DATAPROC FEATURES
Google Cloud Dataproc is a managed Apache Spark and Apache Hadoop service that is fast, easy to use, and low cost.
Managed deployment, logging, and monitoring let you focus on your data, not on your cluster. Your clusters will be stable, scalable, and speedy.
Clusters can be created and scaled quickly with a variety of virtual machine types, disk sizes, number of nodes, and networking options.
Built-in integration with Cloud Storage, BigQuery, Bigtable, Stackdriver Logging, and Stackdriver Monitoring, giving you a complete and robust data platform.
Image versioning allows you to switch between different versions of Apache Spark, Apache Hadoop, and other tools.
Run clusters with multiple master nodes and set jobs to restart on failure to ensure your clusters and jobs are highly available.
Multiple ways to manage a cluster, including an easy-to-use Web UI, the Google Cloud SDK, RESTful APIs, and SSH access.
Run initialization actions to install or customize the settings and libraries you need when your cluster is created.
Cloud Dataproc automatically configures hardware and software on clusters for you while also allowing for manual control.
Clusters can use custom machine types and preemptible virtual machines so they are the perfect size for your needs.
Cloud Dataflow vs. Cloud Dataproc: Which should you use?
Cloud Dataproc and Cloud Dataflow can both be used for data processing, and there’s overlap in their batch and streaming capabilities. How do you decide which product is a better fit for your environment?