Question-41: Can graph analytics can be used for recommendations?

Answer: Yes, it can be used. Let’s say, you are developing a search engine for your internal website in organization. And you want to arrange the pages based on their importance. You can create a graph between the page links. And the page which has highest link coming to it from other different pages, can be consider the most important page and can be recommended on top as part of search results.

 

Question-42: In general, what all steps, you have to do as part of Machine Learning advanced analytics?

Answer: When you have to do advanced analytics using Machine Learning, you have to perform below tasks at least. In future Question, you would get in more detail.

  • Preparing your data (Data collections and format the data).
  • Select Machine Learning Model.
  • Testing Machine Learning Model in various different ways.
  • Model Tuning.
  • Productionizing the Model for newer of future data.

 

Question-43: Can you again tell me in detail the steps you have take in your previous advanced analytics projects?

Answer: For any new project I have to take following steps to complete any Machine Learning Advanced analytics.

  1. Data Collection: Collect and find all the available historical data. You may already have it in your organization data lake.
  2. Data Inspection: You need to clean this data as per your need. Also inspect the data, whether it fulfills our need or not.
  3. Feature engineering: Now you need to convert your data in proper format which can be used by your algorithm or model. Like converting text data into numerical vectors.
  4. Divide data: You need to divide your available data in two parts at least. First would be used to train your model. And rest of the data can be used for testing your model. And based on the testing you can find what all candidate models you want to consider. For example you divide 60% your data for testing and 40% data for model evaluation.
  5. Model evaluation: Now you have defined the success criteria for your model. And you would evaluate each candidate model based on your success criteria. To measure the success criteria you would use for your model. You should use 40% data from your previous step which you kept aside to evaluate your model.
  6. Productionize model: Now you have selected your correct model. You need to put them on actual work to work on new future data.

However, please note that, you would not require all these steps for every advanced analytics. Its all depend on what data you have and what stage or format your data is available.

 

Question-44: What do you mean by data collection in Advanced Analytics?

Answer: This is one of the first step in your data analytics and also hardest steps compare to any other step. If you are looking for overall end to end project delivery. You need to gather the data from various places and sometime you need to buy from third party vendor if you don’t have this data. And this data you would use to test and evaluate your Machine Learning Model. Popular tool which are used for data collection specially in Big Data world are below

  • Apache Spark: It can help you to process and collect data from various sources and store them in one central place like HDFS.
  • Hadoop framework: Hadoop a has various different kind of components which can help to bring the data and save them into the HDFS or cloud buckets.

 

Question-45: What do you mean by Data Cleaning step?

Answer: Once you have purchased or gathered the data which you want to use for your model building. The next step is to clean the data. And this step ins statistics is known as Exploratory Data Analysis or EDA. You should be able to run adhoc query on your data and also use some visualizations tool.  And you try to understand the relationships in the data with the help of data distributions and co-relations in the data. And you want to remove some of the data which you don’t need or you want to fill in some missing data etc. And this one of the critical steps to understand your data in detail. You may want to spend more time at this step to avoid mistakes in further steps or repeat this step again. If you are using Apache Spark then using Spark SQL for this.