The Future of QA Engineering: Harnessing Spark Hadoop in Cloud Environments

Posts

Showing posts from July, 2024

Apache Airflow Tutorial

July 13, 2024

Apache Airflow is a tool for authoring, monitoring & scheduling pipelines. As a result, it is an ideal case of ETL & MLOps pipelines. Examples Uses Cases: Extracting data from many sources, aggregating them, transforming them, and store in a data warehouse. Extract insights from data and display them in an analytics dashboard Train, validate, and deploy machine learning models Key Components: 1. WebServer : Webserver is Airflow’s user interface (UI), which allows you to interact with it without the need for a CLI or an API. From there one can execute, and monitor pipelines, create connections with external systems, inspect their datasets, and many more. 2. Schedular : The scheduler is responsible for executing different tasks at the correct time, re-running pipelines, backfilling data, ensuring tasks completion, etc. 3. Executors : Executors are the mechanism by which pipelines run. There are many different types that run pipelines locally, in a sin...

Spark Read Modes & Details on Spark Session

July 13, 2024

Spark processes the files in three modes: Permissive: This is the default read mode in spark. Say we are creating the dataframe in spark and we encountered a datatype mismatch. So, In case of datatype mismatch it will convert the value into NULL without impacting the rest of the results. Drop Malformed: In case of spark mode, Any malformed records will be eliminated and rest of the records in proper shape will be processed. Fail Fast: Errors out on encountering any malformed records. So, It is very important to choose the respective modes based on the business requirement. Creation of Spark Session: - Spark Session acts as an entry point to the Spark Cluster. To run the code on Spark Cluster, a Spark Session has to be created. - In order to work with Higher Level APIs like Dataframes and Spark SQL, ...

SQL questions asked in top product based companies:

July 09, 2024

1. Imagine a table named “Movies” with columns: MovieID, Title, ReleaseDate, GenreID. There’s another table “Genres” with columns: GenreID, GenreName. Write a SQL query to fetch the genres that don’t have any movies associated with them. select genres_id from genres g left join movies m on g.generid = m.generid where m.moviesid is NULL.