Posts

Showing posts from July, 2024

Apache Airflow Tutorial

 Apache Airflow is a tool for authoring, monitoring & scheduling pipelines. As a result, it is an ideal case of ETL & MLOps pipelines.  Examples Uses Cases: Extracting data from many sources, aggregating them, transforming them, and store in a data warehouse. Extract insights from data and display them in an analytics dashboard Train, validate, and deploy machine learning models Key Components: 1. WebServer :  Webserver is Airflow’s user interface (UI), which allows you to interact with it without the need for a CLI or an API. From there one can execute, and monitor pipelines, create connections with external systems, inspect their datasets, and many more. 2. Schedular :  The scheduler is responsible for executing different tasks at the correct time, re-running pipelines, backfilling data, ensuring tasks completion, etc. 3. Executors :  Executors are the mechanism by which pipelines run. There are many different types that run pipelines locally, in a sin...

Spark Read Modes & Details on Spark Session

 Spark processes the files in three modes: Permissive: This is the default read mode in spark. Say we are creating the dataframe in spark and we encountered a datatype mismatch. So, In case of datatype mismatch it will convert the value into NULL without impacting the rest of the results.   Drop Malformed: In case of spark mode, Any malformed records will be eliminated and rest of the records in proper shape will be processed.  Fail Fast: Errors out on encountering any malformed records.  So, It is very important to choose the respective modes based on the business requirement.        Creation of Spark Session:           - Spark Session acts as an entry point to the Spark Cluster. To run the             code on Spark Cluster, a Spark Session has to be created.           - In order to work with Higher Level APIs like Dataframes and Spark SQL,     ...

SQL questions asked in top product based companies:

 1.  Imagine a table named “Movies” with columns: MovieID, Title, ReleaseDate, GenreID. There’s another table “Genres” with columns: GenreID, GenreName. Write a SQL query to fetch the genres that don’t have any movies associated with them. select genres_id from genres g left join movies m on g.generid = m.generid where m.moviesid is NULL.