Spark practice questions



Set of questions & answers to test your introductory knowledge on Apache Spark concepts

1. What is the USP for Apache Spark?

    a. It runs programs in-memory up to 100x faster than MapReduce
    b. It offers over 80 high level operators
    c. Can be used from Scala and Python shells
    d. Product is already 5 years old

1. What is the USP for Apache Spark?

Answer: a.
It runs programs in-memory up to 100x faster than MapReduce


Explanation: Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

2. Which of the following is not an associated component of Spark?

    a. Shark for SQL
    b. MLlib for machine learning
    c. Spark Streaming
    d. Giraph

2. Which of the following is not an associated component of Spark?

Answer: d.
Giraph


Explanation: Spark uses GraphX for graph processing. It enables users to easily and interactively build, transform, and reason about graph structured data at scale.

3. What is the status of Apache Spark as an Apache Software Foundation project?

    a. Incubator
    b. Sub project
    c. Top level project
    d. Proposal

3. What is the status of Apache Spark as an Apache Software Foundation project?

Answer: c.
Top Level project


Explanation: Spark has graduated from the Apache Incubator to become a top-level Apache project, signifying that the project’s community and products have been well-governed under the ASF’s meritocratic process and principles.

4. Who among the following offers commercial distribution of Apache Spark?

    a. DataBricks
    b. Cloudera
    c. MapR
    d. All of the above

4. Who among the following offers commercial distribution of Apache Spark?

Answer: a.
All of the above


Explanation: Spark is available in open source as Apache Spark or as commercial distribution by DataBricks, Cloudera and MapR.
Click on page number to flip the pages.

Hadoop security practice questions


Set of questions & answers to test your introductory knowledge on HDFS concepts

1. If one has to define the 4 key pillars of Hadoop security, which one would you pick?

    a. Authentication, Authorization, Accountability and Data protection
    b. Authentication, Authorization, Accountability and Sensitivity
    c. Real time Access, Token Delegation, Data masking and Access Control
    d. Accountability, Token Delegation, Data masking and Access Control

1. If one has to define the 4 key pillars of Hadoop security, which one would you pick?

Answer: a.
Authentication, Authorization, Accountability and Data protection


Explanation: 4 Security Pillars & Rings of Defense considered in today's Hadoop system include: Authentication, Authorization, Audit, Data Protection

2. Which is the most common form of authentication used in Hadoop?

    a. ACL
    b. Kerberos
    c. LDAP
    d. Proxy

2. Which is the most common form of authentication used in Hadoop?

Answer: b.
Kerberos


Explanation: Kerberos in Hadoop:(a)establishes identity for clients, hosts and services; (b)prevents impersonation/passwords are never sent over the wire; (c) Integrates w/ enterprise identity management tools such as LDAP/AD;(d)More granular auditing of data access/job execution

3. Which product among the following is not used for Hadoop Security?

    a. Sentry
    b. Knox
    c. Rhino
    d. Tez

3. Which product among the following is not used for Hadoop Security?

Answer: d.
Tez


Explanation: Sentry, Knox and Rhino are used for Hadoop Security solutions. Tez is for SQL on Hadoop.

4. What is a delegation token?

    a. Data masking token
    b. Two party authentication protocol
    c. Firewall entry code
    d. All of the above

4. What is a delegation token?

Answer: b.
Two party authentication protocol


Explanation: A delegation token is a two-party authentication protocol that lets users authenticate themselves with the Namenode (using Kerberos); on receipt of the delegation token, users can provide the token to the JobTracker.
Click on page number to flip the pages.

HDFS practice questions



Set of questions & answers to test your introductory knowledge on Hadoop security concepts

1. HDFS stands for ____________.

    a. Hadoop Distributed Folder System
    b. Hadoop Distributed Feature System
    c. Hadoop Distributed File System
    d. Hadoop Direct File System

1. HDFS stands for ____________.

Answer: c.
Hadoop Distributed File System

Explanation: Hadoop uses a distributed file system inspired by Google File System. It is called Hadoop Distributed File System (HDFS).

2. Which command is used to upload files in Hadoop environment?

    a. Hadoop fs -ls
    b. Hadoop fs
    c. Hadoop fs –copyFromLocal source destination
    d. cp source destination

2. Which command is used to upload files in Hadoop environment?

Answer: c.
Hadoop fs –copyFromLocal source destination

Explanation: Invoke HDFS file system operations using Hadoop fs –command. For copying form local file system to HDFS, use Hadoop fs –copyFromLocal source destination

3. Which of the following parameters in the HDFS command is used for downloading the data from HDFS to Linux Filesystem?

    a. copyToLocal
    b. cp
    c. mv
    d. copyAtLocal

3. Which of the following parameters in the HDFS command is used for downloading the data from HDFS to Linux Filesystem?

Answer: a.
copyToLocal

Explanation: The command for downloading the data from HDFS to local file system is hadoop fs –copyToLocal source destination

4. What is the recommended data replication topology? (R=Rack, N=Node)

    a. R1N1, R2N1, R2N2
    b. R1N1, R1N2, R2N1
    c. R1N1, R1N1, R1N3
    d. R1N1, R2N1, R1N2

4. What is the recommended data replication topology? (R=Rack, N=Node)

Answer: a.
R1N1, R2N1, R2N2

Explanation: It is recommended that a block B1 is first written to Node N1 on Rack 1. A copy is then written on a Node N1 on different rack – Rack 2. The third and final copy of the block is written to the same rack as the second copy – Rack 2 – but a different node (N2, in this example).
Click on page number to flip the pages.

MapReduce practice questions



Set of questions & answers to test your introductory knowledge on MapReduce concepts

1. Which of the following could be an analogy example for MapReduce?

    a. Feeding the pigeons
    b. People standing in queue for a bus
    c. People going out for collecting donations
    d. Athletes running in a relay-race

1. Which of the following could be an analogy example for MapReduce?

Answer: C.
People going out for collecting donations


Explanation: People going out for donation collection can collect money in parallel like map jobs and finally aggregate the collection in one place like in reduce phase.

2. What is the sequence of a MapReduce Job?

    a. Map, input split, reduce
    b. Input split, map, reduce
    c. Map and then reduce
    d. Map, split, map

2. What is the sequence of a MapReduce Job?

Answer: b.
Input split, map, reduce


Explanation: For MapReduce, first the input data needs to be split on multiple nodes for parallel processing. So 'b' is the correct answer.

3. Where is the Mapper Output (intermediate kay-value data) stored ?

    a. Local File System
    b. HDFS
    c. It is not stored
    d. Developer can specify location

3. Where is the Mapper Output (intermediate kay-value data) stored ?

Answer: a.
Local File System


Explanation: The mapper output (intermediate data) is stored on the Local file system and not on HDFS of each individual mapper nodes. Hadoop administrator can typically configure this location.

4. What is speculative execution?

    a. Delaying Execution of jobs by specific interval
    b. Run multiple copies of same task on different nodes
    c. Running MapReduce with zero reducers
    d. When combiners are required

4. What is speculative execution?

Answer: b.
Run multiple copies of same task on different nodes


Explanation: In large clusters, some machines may not be performaing well. To avoid this, speculative execution can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.
Click on page number to flip the pages.