Question-27: The 20 million automobiles that belong to TerramEarth are dispersed around the globe. The telemetry data of the car is placed in a Google Cloud Storage (GCS) regional bucket that is determined by the vehicle's location (US, Europe, or Asia). You have been requested to run a report on the raw telemetry data in order to understand why cars are breaking down after 100,000 miles. This request came from the Chief Technology Officer. You need to apply this task to each and every piece of data. How can we do this task in the most efficient manner possible?
A. After moving all of the data into one location, launch a Cloud Dataproc cluster to process the job.
B. First, all of the data should be moved into one region, and then the job should be run using a Google Cloud Dataproc cluster.
C. Launch a cluster in each region to perform preprocessing on the raw data and compression, then move the data into a bucket that spans multiple regions and use a Dataproc cluster to complete the task.
D. Launch a cluster in each area to preprocess and compress the raw data. Once this has been completed, place the data into a region bucket and utilize a Cloud Dataproc cluster to finish the operation.
Correct Answer
Get All 340 Questions and Answer for Google Professional Cloud Architect
: 4 Explanation: Option-1 & Option–2 says move all data but analysis will try to reveal breaking down after 100K miles so there is no point of transferring data of the vehicles with less than 100K milage. Therefore, transferring all data is just waste of time and money. There is one thing for sure here. If we move/copy data between continents it will cost us money therefore compressing the data before copying to another region/continent makes sense. Preprocessing also makes sense because we probably want to process smaller chunks of data first (remember 100K milage). So now type of target bucket; multi-region or standard? multi-region is good for high-availability and low latency with a little more cost however question doesn't require any of these features. Therefore I think standard storage option is good to go given lower costs are always better. Regional bucket is required, since multi regional bucket will incur additional cost to transfer the data to a centralized location. Launch a cluster in each region to pre-process and compress the raw data, then move the data into a regional bucket and use Cloud Dataproc cluster. Egress rates are most important. It is free inside of region - so make sense to move all data into one region for processing/performance (from all continents). Cross-region cost is 0.01$ per GB, and inter-continent 0.12$ per GB. If to consider just option Option-2 (moving all raw data into one region) then just monthly volume would cost: 900 TB (all 20M units daily) 30 days 0.12 $ = 3.24 M $ (just for data transfer). So, it definitely makes sense to preprocess/compress data per region, and then move all that data into one region for final analysis. That would save up to 10-100 times on egress costs. Also, important aspect is processing time - running it in parallel on all regions accelerates overall analysis effort. Faster result - faster in-field improvements. Look this interesting video about price optimization in GCP.