
[2024] Pass Snowflake DSA-C02 Exam in First Attempt Easily
The Most Efficient DSA-C02 Pdf Dumps For Assured Success
NEW QUESTION # 34
Which one of the following is not the key component while designing External functions within Snowflake?
- A. Remote Service
- B. API Integration
- C. Proxy Service
- D. UDF Service
Answer: D
Explanation:
Explanation
What is an External Function?
An external function calls code that is executed outside Snowflake.
The remotely executed code is known as a remote service.
Information sent to a remote service is usually relayed through a proxy service.
Snowflake stores security-related external function information in an API integration.
External Function:
An external function is a type of UDF. Unlike other UDFs, an external function does not contain its own code; instead, the external function calls code that is stored and executed outside Snowflake.
Inside Snowflake, the external function is stored as a database object that contains information that Snowflake uses to call the remote service. This stored information includes the URL of the proxy service that relays information to and from the remote service.
Remote Service:
The remotely executed code is known as a remote service.
The remote service must act like a function. For example, it must return a value.
Snowflake supports scalar external functions; the remote service must return exactly one row for each row received.
Proxy Service:
Snowflake does not call a remote service directly. Instead, Snowflake calls a proxy service, which relays the data to the remote service.
The proxy service can increase security by authenticating requests to the remote service.
The proxy service can support subscription-based billing for a remote service. For example, the proxy service can verify that a caller to the remote service is a paid subscriber.
The proxy service also relays the response from the remote service back to Snowflake.
Examples of proxy services include:
Amazon API Gateway.
Microsoft Azure API Management service.
API Integration:
An integration is a Snowflake object that provides an interface between Snowflake and third-party services.
An API integration stores information, such as security information, that is needed to work with a proxy service or remote service.
An API integration is created with the CREATE API INTEGRATION command.
Users can write and call their own remote services, or call remote services written by third parties. These remote services can be written using any HTTP server stack,including cloud serverless compute services such as AWS Lambda.
NEW QUESTION # 35
Which of the learning methodology applies conditional probability of all the variables with respec-tive the dependent variable?
- A. Supervised learning
- B. Unsupervised learning
- C. Reinforcement learning
- D. Artificial learning
Answer: C
Explanation:
Explanation
Supervised learning methodology applies conditional probability of all the variables with respective the dependent variable and generally conditional probability of variables is nothing but a basic method of estimating the statistics for few random experiments.
Conditional probability is thus the likelihood of an event or outcome occurring based on the occurrence of some other event or prior outcome. Two events are said tobe independent if one event occurring does not affect the probability that the other event will occur.
NEW QUESTION # 36
Which ones are the key actions in the data collection phase of Machine learning included?
- A. Ingest and Aggregate
- B. Label
- C. Measure
- D. Probability
Answer: A,B
Explanation:
Explanation
The key actions in the data collection phase include:
Label: Labeled data is the raw data that was processed by adding one or more meaningful tags so that a model can learn from it. It will take some work to label it if such information is missing (manually or automatically).
Ingest and Aggregate: Incorporating and combining data from many data sources is part of data collection in AI.
Data collection
Collecting data for training the ML model is the basic step in the machine learning pipeline. The predictions made by ML systems can only be as good as the data on which they have been trained. Following are some of the problems that can arise in data collection:
Inaccurate data. The collected data could be unrelated to the problem statement.
Missing data. Sub-data could be missing. That could take the form of empty values in columns or missing images for some class of prediction.
Data imbalance. Some classes or categories in the data may have a disproportionately high or low number of corresponding samples. As a result, they risk being under-represented in the model.
Data bias. Depending on how the data, subjects and labels themselves are chosen, the model could propagate inherent biases on gender, politics, age or region, for example. Data bias is difficult to detect and remove.
Several techniques can be applied to address those problems:
Pre-cleaned, freely available datasets. If the problem statement (for example, image classification, object recognition) aligns with a clean, pre-existing, properly formulated dataset, then take ad-vantage of existing, open-source expertise.
Web crawling and scraping. Automated tools, bots and headless browsers can crawl and scrape websites for data.
Private data. ML engineers can create their own data. This is helpful when the amount of data required to train the model is small and the problem statement is too specific to generalize over an open-source dataset.
Custom data. Agencies can create or crowdsource the data for a fee.
NEW QUESTION # 37
Which of the following is a common evaluation metric for binary classification?
- A. Area under the ROC curve (AUC)
- B. F1 score
- C. Accuracy
- D. Mean squared error (MSE)
Answer: A
Explanation:
Explanation
The area under the ROC curve (AUC) is a common evaluation metric for binary classification, which measures the performance of a classifier at different threshold values for the predicted probabilities. Other common metrics include accuracy, precision, recall, and F1 score, which are based on the confusion matrix of true positives, false positives, true negatives, and false negatives.
NEW QUESTION # 38
Which one is not Types of Feature Scaling?
- A. Economy Scaling
- B. Standard Scaling
- C. Robust Scaling
- D. Min-Max Scaling
Answer: D
Explanation:
ExplanationFeature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale. This is important in machine learning because the scale of the features can affect the performance of the model.
Types of Feature Scaling:
Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by subtracting the minimum value and dividing by the range.
Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of 1 by subtracting the mean and dividing by the standard deviation.
Robust Scaling: Rescaling the features to be robust to outliers by dividing them by the interquartile range.
Benefits of Feature Scaling:
Improves Model Performance: By transforming the features to have a similar scale, the model can learn from all features equally and avoid being dominated by a few large features.
Increases Model Robustness: By transforming the features to be robust to outliers, the model can become more robust to anomalies.
Improves Computational Efficiency: Many machine learning algorithms, such as k-nearest neighbors, are sensitive to the scale of the features and perform better with scaled features.
Improves Model Interpretability: By transforming the features to have a similar scale, it can be easier to understand the model's predictions.
NEW QUESTION # 39
Data Scientist used streams in ELT (extract, load, transform) processes where new data inserted in-to a staging table is tracked by a stream. A set of SQL statements transform and insert the stream contents into a set of production tables. Raw data is coming in the JSON format, but for analysis he needs to transform it into relational columns in the production tables. which of the following Data transformation SQL function he can used to achieve the same?
- A. METADATA$ACTION ()
- B. He could not apply Transformation on Stream table data.
- C. Transpose()
- D. lateral flatten()
Answer: D
Explanation:
Explanation
To know about lateral flatten SQL Function, please refer:
https://docs.snowflake.com/en/sql-reference/constructs/join-lateral#example-of-using-lateral-with-flatten
NEW QUESTION # 40
What Can Snowflake Data Scientist do in the Snowflake Marketplace as Provider?
- A. Eliminate the costs of building and maintaining APIs and data pipelines to deliver data to customers.
- B. Share live datasets securely and in real-time without creating copies of the data or im-posing data integration tasks on the consumer.
- C. Publish listings for datasets that can be customized for the consumer.
- D. Publish listings for free-to-use datasets to generate interest and new opportunities among the Snowflake customer base.
Answer: A,B,C,D
Explanation:
Explanation
All are correct!
About the Snowflake Marketplace
You can use the Snowflake Marketplace to discover and access third-party data and services, as well as market your own data products across the Snowflake Data Cloud.
As a data provider, you can use listings on the Snowflake Marketplace to share curated data offer-ings with many consumers simultaneously, rather than maintain sharing relationships with each indi-vidual consumer.
With Paid Listings, you can also charge for your data products.
As a consumer, you might use the data provided on the Snowflake Marketplace to explore and ac-cess the following:
Historical data for research, forecasting, and machine learning.
Up-to-date streaming data, such as current weather and traffic conditions.
Specialized identity data for understanding subscribers and audience targets.
New insights from unexpected sources of data.
The Snowflake Marketplace is available globally to all non-VPS Snowflake accounts hosted on Amazon Web Services, Google Cloud Platform, and Microsoft Azure, with the exception of Mi-crosoft Azure Government.
Support for Microsoft Azure Government is planned.
NEW QUESTION # 41
Which one is not the feature engineering techniques used in ML data science world?
- A. Imputation
- B. Binning
- C. One hot encoding
- D. Statistical
Answer: D
Explanation:
Explanation
Feature engineering is the pre-processing step of machine learning, which is used to transform raw data into features that can be used for creating a predictive model using Machine learning or statistical Modelling.
What is a feature?
Generally, all machine learning algorithms take input data to generate the output. The input data re-mains in a tabular form consisting of rows (instances or observations) and columns (variable or at-tributes), and these attributes are often known as features. For example, an image is an instance in computer vision, but a line in the image could be the feature. Similarly, in NLP, a document can be an observation, and the word count could be the feature. So, we can say a feature is an attribute that impacts a problem or is useful for the problem.
What is Feature Engineering?
Feature engineering is the pre-processing step of machine learning, which extracts features from raw data. It helps to represent an underlying problem to predictive models in a better way, which as a result, improve the accuracy of the model for unseen data. The predictive model contains predictor variables and an outcome variable, and while the feature engineering process selects the most useful predictor variables for the model.
Some of the popular feature engineering techniques include:
1. Imputation
Feature engineering deals with inappropriate data, missing values,human interruption, general errors, insufficient data sources, etc. Missing values within the dataset highly affect the performance of the algorithm, and to deal with them "Imputation" technique is used. Imputation is responsible for handling irregularities within the dataset.
For example, removing the missing values from the complete row or complete column by a huge percentage of missing values. But at the same time, to maintain the data size, it is required to impute the missing data, which can be done as:
For numerical data imputation, a default value can be imputed in a column, and missing values can be filled with means or medians of the columns.
For categorical data imputation, missing values can be interchanged with the maximum occurred value in a column.
2. Handling Outliers
Outliers are the deviated values or data points that are observed too away from other data points in such a way that they badly affect the performance of the model. Outliers can be handled with this feature engineering technique. This technique first identifies the outliers and then remove them out.
Standard deviation can be used to identify the outliers. For example, each value within a space has a definite to an average distance, but if a value is greater distant than acertain value, it can be considered as an outlier.
Z-score can also be used to detect outliers.
3. Log transform
Logarithm transformation or log transform is one of the commonly used mathematical techniques in machine learning. Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal after transformation. It also reduces the effects of outliers on the data, as because of the normalization of magnitude differences, a model becomes much robust.
4. Binning
In machine learning, overfitting is one of the main issues that degrade the performance of the model and which occurs due to a greater number of parameters and noisydata. However, one of the popular techniques of feature engineering, "binning", can be used to normalize the noisy data. This process involves segmenting different features into bins.
5. Feature Split
As the name suggests, feature split is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset.
The feature splitting process enables the new features to be clustered and binned, which results in extracting useful information and improving the performance of the data models.
6. One hot encoding
One hot encoding is the popular encoding technique in machine learning. It is a technique that converts the categorical data in a form so that they can be easily understood by machine learning algorithms and hence can make a good prediction. It enables group theof categorical data without losing any information.
NEW QUESTION # 42
Which of the following method is used for multiclass classification?
- A. all vs one
- B. one vs another
- C. one vs rest
- D. loocv
Answer: C
Explanation:
Explanation
Binary vs. Multi-Class Classification
Classification problems are common in machine learning. In most cases, developers prefer using a supervised machine-learning approach to predict class tables for a given dataset. Unlike regression, classification involves designing the classifier model and training it to input and categorize the test dataset. For that, you can divide the dataset into either binary or multi-class modules.
As the name suggests, binary classification involves solving a problem with only two class labels. This makes it easy to filter the data, apply classification algorithms, and train the model to predict outcomes. On the other hand, multi-class classification is applicable when there are more than two class labels in the input train data.
The technique enables developers to categorize the test data into multiple binary class labels.
That said, while binary classification requires only one classifier model, the one used in the multi-class approach depends on the classification technique. Below are the two models of the multi-class classification algorithm.
One-Vs-Rest Classification Model for Multi-Class Classification
Also known as one-vs-all, the one-vs-rest model is a defined heuristic method that leverages a binary classification algorithm for multi-class classifications. The technique involves splitting a multi-class dataset into multiple sets of binary problems. Following this, a binary classifier is trained to handle each binary classification model with the most confident one making predictions.
For instance, with a multi-class classification problem with red, green, and blue datasets, binary classification can be categorized as follows:
Problem one: red vs. green/blue
Problem two: blue vs. green/red
Problem three: green vs. blue/red
The only challenge of using this model is that you should create a model for every class. The three classes require three models from the above datasets, which can be challenging for large sets of data with million rows, slow models, such as neural networks and datasets with a significant number of classes.
The one-vs-rest approach requires individual models to prognosticate the probability-like score. The class index with the largest score is then used to predict a class. As such, it is commonly used forclassification algorithms that can naturally predict scores or numerical class membership such as perceptron and logistic regression.
NEW QUESTION # 43
Which Python method can be used to Remove duplicates by Data scientist?
- A. remove_duplicates()
- B. clean_duplicates()
- C. drop_duplicates()
- D. duplicates()
Answer: B
Explanation:
Explanation
The drop_duplicates() method removes duplicate rows.
dataframe.drop_duplicates(subset, keep, inplace, ignore_index)
Remove duplicate rows from the DataFrame:
1.import pandas as pd
2.data = {
3."name": ["Peter", "Mary", "John", "Mary"],
4."age": [50, 40, 30, 40],
5."qualified": [True, False, False, False]
6.}
7.
8.df = pd.DataFrame(data)
9.newdf = df.drop_duplicates()
NEW QUESTION # 44
Which of the following is a Python-based web application framework for visualizing data and analyzing results in a more efficient and flexible way?
- A. Streamsets
- B. Rapter
- C. StreamBI
- D. Streamlit
Answer: D
Explanation:
Explanation
Streamlit is a Python-based web application framework for visualizing data and analyzing results in a more efficient and flexible way. It is an open source library that assists data scientists and academics to develop Machine Learning (ML) visualization dashboards in a short period of time. We can build and deploy powerful data applications with just a few lines of code.
Why Streamlit?
Currently, real-world applications are in high demand and developers are developing new libraries and frameworks to make on-the-go dashboards easier to build and deploy. Streamlit is a library that reduces your dashboard development time from days to hours. Following are some reasons to choose the Streamlit:
It is a free and open-source library.
Installing Streamlit is as simple as installing any other python package It is easy to learn because you won't need any web development experience, only a basic under-standing of Python is enough to build a data application.
It is compatible with almost all machine learning frameworks, including Tensorflow and Pytorch, Scikit-learn, and visualization libraries such as Seaborn, Altair, Plotly, and many others.
NEW QUESTION # 45
Which command is used to install Jupyter Notebook?
- A. pip install notebook
- B. pip install jupyter
- C. pip install jupyter-notebook
- D. pip install nbconvert
Answer: B
Explanation:
Explanation
Jupyter Notebook is a web-based interactive computational environment.
The command used to install Jupyter Notebook is pip install jupyter.
The command used to start Jupyter Notebook is jupyter notebook.
NEW QUESTION # 46
You previously trained a model using a training dataset. You want to detect any data drift in the new data collected since the model was trained.
What should you do?
- A. Create a new version of the dataset using only the new data and retrain the model.
- B. Add the new data to the existing dataset and enable Application Insights for the service where the model is deployed.
- C. Create a new dataset using the new data and a timestamp column and create a data drift monitor that uses the training dataset as a baseline and the new dataset as a target.
- D. Retrained your training dataset after correcting data outliers & no need to introduce new data.
Answer: C
Explanation:
Explanation
To track changing data trends, create a data drift monitor that uses the training data as a baseline and the new data as a target.
Model drift and decay are concepts that describe the process during which the performance of a model deployed to production degrades on new, unseen data or the underlying assumptions about the data change.
These are important metrics to track once models are deployed toproduction. Models must be regularly re-trained on new data. This is referred to as refitting the model. This can be done either on a periodic basis, or, in an ideal scenario, retraining can be triggered when the performance of the model degrades below a certain pre-defined threshold.
NEW QUESTION # 47
Data Scientist can query, process, and transform data in a which of the following ways using Snowpark Python. [Select 2]
- A. Query and process data with a DataFrame object.
- B. Write a user-defined tabular function (UDTF) that processes data and returns data in a set of rows with one or more columns.
- C. Transform Data using DataIKY tool with SnowPark API.
- D. SnowPark currently do not support writing UDTF.
Answer: A,D
Explanation:
Explanation
Query and process data with a DataFrame object. Refer to Working with DataFrames in Snowpark Python.
Convert custom lambdas and functions to user-defined functions(UDFs) that you can call to process data.
Write a user-defined tabular function (UDTF) that processes data and returns data in a set of rows with one or more columns.
Write a stored procedure that you can call to process data, or automate with a task to build a data pipeline.
NEW QUESTION # 48
Which type of Machine learning Data Scientist generally used for solving classification and regression problems?
- A. Regression Learning
- B. Supervised
- C. Instructor Learning
- D. Reinforcement Learning
- E. Unsupervised
Answer: B
Explanation:
Explanation
Supervised Learning
Overview:
Supervised learning is a type of machine learning that uses labeled data to train machine learning models. In labeled data, the output is already known. The model just needs to map the inputs to the respective outputs.
Algorithms:
Some of the most popularly used supervised learning algorithms are:
Linear Regression
Logistic Regression
Support Vector Machine
K Nearest Neighbor
Decision Tree
Random Forest
Naive Bayes
Working:
Supervised learning algorithms take labelled inputs and map them to the known outputs, which means you already know the target variable.
Supervised Learning methods need external supervision to train machine learning models. Hence, the name supervised. They need guidance and additional information to return the desired result.
Applications:
Supervised learning algorithms are generally used for solving classification and regression problems.
Few of the top supervised learning applications are weather prediction, sales forecasting, stock price analysis.
NEW QUESTION # 49
Which metric is not used for evaluating classification models?
- A. Mean absolute error
- B. Recall
- C. Accuracy
- D. Precision
Answer: A
Explanation:
Explanation
The four commonly used metrics for evaluating classifier performance are:
1. Accuracy: The proportion of correct predictions out of the total predictions.
2. Precision: The proportion of true positive predictions out of the total positive predictions (precision = true positives / (true positives + false positives)).
3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total actual positive instances (recall = true positives / (true positives + false negatives)).
4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics (F1 score = 2 * ((precision * recall) / (precision + recall))).
Root Mean Squared Error (RMSE)and Mean Absolute Error (MAE) are metrics used to evaluate a Regression Model. These metrics tell us how accurate our predictions are and, what is the amount of deviation from the actual values.
NEW QUESTION # 50
Which method is used for detecting data outliers in Machine learning?
- A. BOXI
- B. Scaler
- C. CMIYC
- D. Z-Score
Answer: D
Explanation:
Explanation
What are outliers?
Outliers are the values that look different from the other values in the data. Below is a plot high-lighting the outliers in 'red' and outliers can be seen in both the extremes of data.
Reasons for outliers in data
Errors during data entry or a faulty measuring device (a faulty sensor may result in extreme readings).
Natural occurrence (salaries of junior level employees vs C-level employees) Problems caused by outliers Outliers in the data may causes problems during model fitting (esp. linear models).
Outliers may inflate the error metrics which give higher weights to large errors (example, mean squared error, RMSE).
Z-score method is of the method for detecting outliers. This methodis generally used when a variable' distribution looks close to Gaussian. Z-score is the number of standard deviations a value of a variable is away from the variable' mean.
Z-Score = (X-mean) / Standard deviation
IQR method , Box plots are some more example of methods used to detect data outliers in Data science.
NEW QUESTION # 51
How do you handle missing or corrupted data in a dataset?
- A. Drop missing rows or columns
- B. All of the above
- C. Assign a unique category to missing values
- D. Replace missing values with mean/median/mode
Answer: B
NEW QUESTION # 52
Consider a data frame df with columns ['A', 'B', 'C', 'D'] and rows ['r1', 'r2', 'r3']. What does the ex-pression df[lambda x : x.index.str.endswith('3')] do?
- A. Filters the row labelled r3
- B. Results in Error
- C. Returns the row name r3
- D. Returns the third column
Answer: A
Explanation:
Explanation
It will Filters the row labelled r3.
NEW QUESTION # 53
Which tools helps data scientist to manage ML lifecycle & Model versioning?
- A. Albert
- B. MLFlow
- C. Pachyderm
- D. CRUX
Answer: B,C
Explanation:
Explanation
Model versioning in a way involves tracking the changes made toan ML model that has been previously built.
Put differently, it is the process of making changes to the configurations of an ML Model. From another perspective, we can see model versioning as a feature that helps Machine Learning Engineers, Data Scientists, and related personnel create and keep multiple versions of the same model.
Think of it as a way of taking notes of the changes you make to the model through tweaking hyperparameters, retraining the model with more data, and so on.
In model versioning, a number of things need to be versioned, to help us keep track of important changes. I'll list and explain them below:
Implementation code: From the early days of model building to optimization stages, code or in this case source code of the model plays an important role. This code experiences significant changes during optimization stages which can easily be lost if not tracked properly. Because of this, code is one of the things that are taken into consideration during the model versioning process.
Data: In some cases, training data does improve significantly from its initial state during model op-timization phases. This can be as a result of engineering new features from existing ones to train our model on. Also there is metadata (data about your training data and model) to consider versioning. Metadata can change different times over without the training data actually changing. We need to be able to track these changes through versioning Model: The model is a product of the two previous entities and as stated in their explanations, an ML model changes at different points of the optimization phases through hyperparameter setting, model artifacts and learning coefficients. Versioning helps take record of the different versions of a Machine Learning model.
MLFlow & Pachyderm are the tools used to manage ML lifecycle & Model versioning.
NEW QUESTION # 54
Consider a data frame df with 10 rows and index [ 'r1', 'r2', 'r3', 'row4', 'row5', 'row6', 'r7', 'r8', 'r9', 'row10'].
What does the aggregate method shown in below code do?
g = df.groupby(df.index.str.len())
g.aggregate({'A':len, 'B':np.sum})
- A. Computes length of column A
- B. Computes Sum of column A values
- C. Computes length of column A and Sum of Column B values
- D. Computes length of column A and Sum of Column B values of each group
Answer: D
Explanation:
Explanation
Computes length of column A and Sum of Column B values of each group
NEW QUESTION # 55
......
We offers you the latest free online DSA-C02 dumps to practice: https://www.trainingquiz.com/DSA-C02-practice-quiz.html
Snowflake DSA-C02 Real Exam Questions Guaranteed Updated Dump: https://drive.google.com/open?id=1BKM9Y09KcSqC-0mJ2ivxZ6gCfM-wvAEh

