The Future of QA Engineering: Harnessing Spark Hadoop in Cloud Environments

Posts

Showing posts from June, 2024

SQL Questions - Ready for Interview

June 26, 2024

1. U se case for each of the functions Rank, Dense_Rank & Row_Number Row Number - the ROW_NUMBER() function is used to assign a unique integer to every row that is returned by a query. Syntax: ROW_NUMBER() OVER( [PARTITION BY column_1, column_2, …] [ORDER BY column_3, column_4, …] ) Let’s analyze the above syntax: The set of rows on which the ROW_NUMBER() function operates is called a window. The PARTITION BY clause is used to divide the query set results. The ORDER BY clause inside the OVER clause is used to set the order in which the query result will be displayed. Query Format: SELECT mammal_id, mammal_name, animal_id, ROW_NUMBER () OVER ( ORDER BY mammal_name ) FROM Mammals; Rank: The RANK() function assigns a rank to every row within a partition of a result set. For each partition, the rank of the first row is 1. The RANK() ...

Spark Technique

June 25, 2024

Number of stage is dependent on wide transformations the number of stages = number of wide transformations + 1 Number of tasks = number of partitions Repartitions Vs Coalesce: Repartition can both increase and decrease the number of partitions in the RDD while coalesce can only decrease the partions Full shuffle is involved in case of partitions while in case of coalesce, full shuffle is not performed. Rather it reduces the partitions in much efficient manner say two partitions in one node will be combined together to form one partitions. when you have to increase the number of partitions you should use repartition. This will increase the parallelism