How to - Data Analysis With SQL

A continuation on our earlier SQL tutorial while linking it with Data Analysis

·

7 min read

Are you a Data Enthusiast? Do you want to learn more about Data Analysis & SQL?

Then you have come to the right place! In this blog you will be introduced to practical Data Analysis using SQL, and we will get to dive deeper into SQL and continue our previous SQL tutorial, If you haven't already then please go check How I Met Your Database -& Mainly Relational Using SQL-


Northwind Database

In this blog, we will be working with Northwind Database, which is a sample database created by Microsoft to practice fictional data in a company environment.

The database consists of the following tables:

  • Suppliers: Suppliers and vendors of Northwind

  • Customers: Customers who buy products from Northwind

  • Employees: Employee details of Northwind traders

  • Products: Product information

  • Shippers: The names and contact information for the shippers who transport the goods from the traders to the end-users.

  • Orders and Order_Details: Customers' and the companies' sales order interactions.

To have a broader idea of the database structure, we can check the below Entity Relationship Diagram -ERD-:

Northwind ERD.png *Image credits goes to: yugabyte *


The Power Of SQL Queries

You now have a database that is completely ready for you to explore, so let's start by utilizing SQL and asking ourselves, what insights can we gain from our Database ?

Highest Performing Employees

Let's say that we measure our performance goals solely on the volume of orders, regardless of their monetary value.

We can use our SQL knowledge to know who the highest performing employees based on number of orders.

Based on the ERD above, the reference that links - Foreign Key - between the Orders & the Employees Tables is the Employee ID.

Foreign Keys

A FOREIGN KEY is a field (or set of fields) in one table that points to a PRIMARY KEY in another table. The child table is the table with the foreign key, and the referred or parent table is the table with the primary key.

SQL Query

The Query to get the volume of orders and the employee responsible looks like this:

SELECT E.EmployeeID As Employee_ID, 
COUNT(O.EmployeeID) AS Order_Count, 
CONCAT(E.FirstName, ' ' ,E.LastName) As Full_Name
FROM Orders as O
inner join Employees as E on O.EmployeeID = E.EmployeeID
group by E.EmployeeID, E.FirstName, E.LastName;

Lets dissect the query and understand what each part does, starting with GROUP BY

GROUP BY

The GROUP BY statement is used to groups rows with the same values into summary rows.

To group the result set by one or more columns, the GROUP BY statement is frequently used with aggregate functions (COUNT(), MAX(), MIN(), SUM(), AVG()).

In the above query however we can see that we grouped by The Employee's ID, First Name & Last Name.

Aliases

SQL aliases are used to give a table, or a column in a table, a temporary name, they are often used to make column names more readable.

An alias only exists for the duration of that query and is created with the AS keyword.

Inner Join

The INNER JOIN keyword picks records in both tables that have the same value. You can think about Inner Join in the term of the intersection between two tables

Inner Join Two Tables

If the columns in both tables match, the INNER JOIN keyword selects all rows from both tables. These orders will not be displayed if there are records in the "Orders" table that do not have matches in the "Employees" table!

There are a lot of table join types, we will discuss them as the need arises but you can check the below Venn Diagram as it will provide you with a brief explanation for the use-case of each join type:

SQL_Joins.jpg

Query Output

Our executed query will yield the following result -as shown below- and this provided us with insight on which employees created the most orders for the customers.

Employee_ID    Order_Count     Full_Name
    1              123        Nancy Davolio
    2               96        Andrew Fuller
    3              127        Janet Leverling
    4              156        Margaret Peacock
    5               42        Steven Buchanan
    6               67        Michael Suyama
    7               72        Robert King
    8              104        Laura Callahan
    9               43        Anne Dodsworth

But this is still missing something fundamental, did you notice it?

If your answer was their order then you are absolutely correct!, our query didn't order the elements based on volume of orders, it is still missing the ORDER BY clause.

ORDER BY

As the name suggests, if we want to sort the result set in ascending or descending order, use the ORDER BY keyword.

By default, the ORDER BY keyword organizes the records in ascending order. Use the DESC keyword to sort the records in descending order.

For example, In our query, we should add the order by clause after the group by like this:

ORDER BY COUNT(O.EmployeeID) DESC;

OR

-- Since we defined Order_Count as an Alias for COUNT(O.EmployeeID)
-- we can use them interchangeably
order by Order_Count desc;

And that yeilds the result set organized as below:

Employee_ID    Order_Count     Full_Name
    4              156        Margaret Peacock
    3              127        Janet Leverling
    1              123        Nancy Davolio
    8              104        Laura Callahan
    2               96        Andrew Fuller
    7               72        Robert King
    6               67        Michael Suyama
    9               43        Anne Dodsworth
    5               42        Steven Buchanan

How have discounts affected the number of items sold?

It is a known fact that we all LOVE sales, no matter what the sale is on, we lways manage to find something that we like and buy something new or an even larger quantity of something that we use on a regular basis.

But, this is not a scientific way to prove it, so how can we ?

We can start by getting the number of items sold with a sale and those sold without while comparing them with the number of orders made.

Let's see how we can achieve that with SQL.

SQL Queries

We will be keeping the Queries simple and using only the basic syntax, but I strongly advice you to start reading about Left Join and joins in general and to start experimenting by yourself on various databases and datasets like The Famous Titanic Dataset on Kaggle

The below queries check for the total number of items purchased and the number of orders made.

-- For Items With Discounts:

SELECT SUM(OD.Quantity) AS Total_Items_Sold, Count(OD.OrderID)
FROM [Order Details] as OD
WHERE OD.Discount > 0;



-- For Items Without Discounts:

SELECT SUM(OD.Quantity) AS Total_Items_Sold, Count(OD.OrderID)
FROM [Order Details] as OD
WHERE OD.Discount = 0;

For the Results:

  • Items With Discounts: A total of 22718 items were sold in 838 orders.

    Total_Items_Sold    Number_Of_Orders
          22718               838
    
  • Items Without Discounts: A total of 28599 items in 1317 orders

    Total_Items_Sold    Number_Of_Orders
          28599               1317
    

If we calculate the average number of items per order, We will find that clearly the number of items purchased increases when a sale is present.

What discount percentage should the company offer?

Now that we have seen that discounts affect the volume of our sales, we have to determine the best percentage that will help us maximize our profit & market share.

Let's check the increments of 5% discounts in our database and see what our data is trying to tell us.

SQL QUERY

SELECT SUM(OD.Quantity) AS Total_Items_Sold, Count(OD.OrderID)AS Number_Of_Orders, OD.Discount AS Discount_Percentage
FROM [Order Details] as OD
WHERE OD.Discount = 0.05 or OD.Discount =0.1 or OD.Discount =0.2 or OD.Discount =0.25
GROUP BY OD.Discount
ORDER BY OD.Discount DESC;

This shows the following results, which surprisingly show a very little change in the order sizes and numbers is an indicator for the Northwind company that if the offer a discount as low as 5% , they will notice an increase in sales with no need to decrease their revenue any further.

Total_Items_Sold    Number_Of_Orders    Discount_Percentage
      4349                154                  0.25
      4351                161                  0.2
      4366                173                  0.1
      5182                185                  0.05

We can take this even a step further and see the effect of discounts on different product categories or If discounts affect certain regions more than others, but I will leave that for you to answer and test your skills.


This brings us to the end of today's blog about using SQL for Data Analysis, In the upcoming blog we will be getting to know Python and later on we will create a mini project with full code examples to handle the Northwind Database and other various datasets.

Hope you enjoyed reading this blog as much as I enjoyed writing it. If you have any comments or suggestions, please feel free to reach out ❤