Question-1: Data that is flawed in some manner is referred to as dirty data or filthy data. For example, dirty data may include duplicates, or it may be out of date, unsecure, incomplete, incorrect, or inconsistent. Incorrectly spelt addresses, fields without values, obsolete phone numbers, and duplicate client records are some examples of filthy data. Please see the TerramEarth case study for more information about this subject. A new architecture has been implemented, and with it comes the capability to write all incoming data to BigQuery. You have seen that the data is unclean and want to maintain its quality on a regular basis via the use of automated processes while also controlling costs. What is it that you ought to do?
A. Establish a streaming Cloud Dataflow job, with the intention of receiving data through the ingestion procedure. The data in a Cloud Dataflow pipeline needs some cleaning.
B. Construct a BigQuery data cleaning Cloud Function that will read the data from BigQuery. The Cloud Function can be triggered from within a Compute Engine instance.
C. Generate an SQL statement that will operate on the data stored in BigQuery and then save the view. The view should be executed every day, and the results should be saved in a new table.
D. Make use of Cloud Dataprep and set up the BigQuery tables as the source of the information. Set up a recurring task to clean the data every day.
Correct Answer

Get All 340 Questions and Answer for Google Professional Cloud Architect

: 4 Explanation: As data needs to be cleaned. Dataprep has the capabilities to clean dirty data. Dataprep is GUI driven process to analyse adhoc data dumped on GCS, it has not place in this use case. 1. Cost in Option-4 would be higher. e.g. First load dirty data into DB and then run Data Prep Jobs to clean the data and load into some different target Data . Overall cost of scanning the data and the loading is like double the cost. Then identifying already clean data and dirty data is again a challenge on a daily basis after the data growth is significant 2. Data Stream can be utilized to cleanse the data while loading. Dataprep uses a UI to perform the cleaning process and under the hood it is using Dataflow to perform the process. Option-1 and Option-4 are both will solve the purpose. Option-1 is more expensive and ask is daily basis clean-up of data. Use Cloud Dataprep and configure the BigQuery tables as the source. Schedule daily jobs to clean the data. Cloud Dataprep – is for fast exploration and anomaly detection. It supports scheduling (as Q asks): “Schedule the execution of recipes in your flows on a recurring or as-needed basis. When the scheduled job successfully executes, you can collect the wrangled output in the specified output location, where it is available in the published form you specify�. Also, Dataprep integrates naturally with BigQuery (and Cloud Storage, upload file directly). And it uses Dataflow under the hood. Option-1 & Option-2 – are about real-time processing, which is not needed (per req on a daily basis). Also, to find if data is dirty you may analyze several adjacent rows, so real-time processing may not physically solve a problem; Option-3 – is about just SQL processing, which is likely not enough to fix algorithmically data problems. Also, it is not automated, and requires new table for cleaned data. Pricing-wise, option-1 and Option-4 should be comparable, since Dataprep uses underneath Dataflow workers. Dataprep pricing (0.056$ vCPU per hour – batch worker) Dataflow pricing (0.069$ x 4 vCPU per hour, for streaming worker) In general, batch processing is more price-effective, since it eliminates potential data-waiting cycles. So, batched vCPUs should be 100% busy with your work, though streaming worker can be idle, but you still need to pay for its time.