A Beginner’s guide to EDA Tutorial — Country Sales data

7 min readOct 12, 2021

Without data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.– Geoffrey Moore

The amount of data generated in real time is immense. This has created oceans of data from which companies can derive real business value and make better business decisions. In today’s world, data analysis is the key point for any business and in this blog, I will show you how data analysis get works…

Introduction:

EDA (Exploratory Data Analysis) is an approach to analyzing datasets to summarize their main characteristics. It is used to understand data, get some context regarding it, understand the variables and the relationships between them.

In EDA, data visualization techniques are used to draw meaningful patterns and insights. Based on the results of EDA, companies make business decisions. For example, an e-commerce company might be interested in analyzing customer attributes in order to display targeted ads for improving sales. Data analysis can be applied to almost any aspect of a business if one understands the tools available to process information.

In this blog, I will be analysing a dataset based on Online Retail management. Basically, It is a transnational data set that contains all the transactions occurring between 01-DEC-2010 to 09-DEC-2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers. The attributes of the dataset are Invoice No, Stock Code, Description( Item Name), Quantity, Invoice Date, Unit Price, Customer Id, Country.

The main goal of analysing this dataset is to find the insides like in which month sales is maximum?, which is the maximum selling product?, In which country selling is more? Is there any impact of the unit price of the product on the quantity sold? This helps us to increase the sales of the items and revenue. Here, I will show step by step procedure of data analysis. So, Let’s start…

1. Load the dataset- In this step we load the dataset and find out basic statistical details for each attribute as you see in the following figures.

**Fig-2 Find out the info and description of the dataset**

Before starting further procedure one question that arrived in your mind is which attributes help us in analysis and how? Basically, it depend on our goal or need and our approach for analysis. If our goal is clear then we can easily choose the important attributes for analysis like in this problem I want to find total revenue by day/week/month and maximum sold product-related details so the attributes that help will be description(product name), quantity, unit price, Invoicedate, and country.

As you see in the description, this dataset has negative values for quantity and unitprice that represent the items that had returned. For our analysis we need not use negative values, if we use then it will act as an outlier. So, I have removed negative values from the dataset. The attribute CustomerID is the unique id for each customer since it is numerical, python describe() function consider it and show basic statistical details.

2. Data Cleaning- After loading the data, the second step is cleaning the dataset. The main aim of data cleaning is to identify and remove errors & duplicate data, in order to create a reliable dataset. This improves the quality of the training data for analysis and enables accurate decision-making. In this step, we find the missing values and then remove that value’s row from the dataset. In EDA, there is no good way to deal with missing data but different data imputation methods are available depending on the kind of problem.

**Fig-3 Find null values and remove them**

Now, we have to derive some more important attributes like date from InvoiceDate (timestamp), the total amount from quantity and unit price. These attributes help us to understand more about the dataset and also help in visualization. With this, I also filter positive values from the dataset as shown in fig-4.

**Fig-4 Adding new attributes Date and TotalAmount**

3. EDA (Exploratory Data Analysis): Now, the last step is to perform EDA and find out the insides of the dataset. For data visualization, I am using Plotly because plotly has several advantages over matplotlib. One of the main advantages is that only a few lines of code are necessary to create interactive and more elaborative plots. It is also capable to handle geographical, scientific, statistical, and financial data. It also saves time when initially exploring our dataset and makes it easy to modify and export plots.
Firstly, I have plotted the histogram to find out how many customers belong to which country and found 89% of customers belong to the UK since this data is UK-oriented.

Then, I wanted to find out which item has maximum selling for that I have plotted bar graph and find top-20 maximum sold products and what I found the maximum sold item is Paper Craft, Little Birdie. Basically, Little Birdie gives a beautiful 3D look to any of your projects. They can be very well used with any cards, scrapbooks, altered arts, mixed media, key ring making, jewelry making, miniature set up, hair accessories, mobile accessories, collage clay, decoration, or for any kind of paper craft.

Then, I was very excited to find out in which month retailers had maximum sales. So, to find out that I have plotted bar graph between months and total Amount and found in the month of October and November selling was maximum but why these two months because in October 2010, there were Halloween, Diwali and in November 2010, there was Bonfire Night, Black Friday, Cyber Monday, Hanukkah festivals had celebrated. Due to these festivals selling was more. But why not December since this dataset contains sales till the 9th of December, due to this reason we can’t clearly visualize December sales.

Since in October and November revenue was more. So, It clearly said that in these months customers and selling were also more.

After this, I was going to find Is there any impact of the product price on quantity sold and plotted histogram between quantity and unitprice and found that Items having less price had sold more as compare to items having more price.

**Fig-8 Maximum Quantity sold w.r.t unit prices**

Next, I found out which specific week has maximum selling and get from 5-Dec-2011 to 11-Dec-2011 selling was maximum.

After this, I found in which weekday selling was maximum and got Thursday as maximum selling day because on Friday there were many festivals.

Then, I found the top-10 countries by sales since it is UK oriented dataset therefore it is obvious that the UK had maximum sales.

**Fig-11 Top-10 Best countries by sales**

Then, the top-10 worst countries by sales include those countries which are very far away from the UK like Saudi Arabia.

**Fig-12 Top-10 Worst countries by sales**

So, After this EDA what do we get….
1. Retailers have to increase the production of Paper Craft, Little Birdie since it is the maximum sold product.
2. Retailers have to increase the production of items in October and November.
3. Items having less price had sold more as compare to items having more price. So, companies have to more focus on providing affordable gift items.
4. On a day before a festival selling of gift items is more.

You could see my work through this link below…

GitHub Repository Link: https://github.com/yash2arma/EDA

Dataset Link: https://archive.ics.uci.edu/ml/datasets/online+retail#

References-

Exploratory data analysis - Wikipedia

In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics…

en.wikipedia.org

Plotting Data in Python: matplotlib vs plotly - ActiveState

Matplotlib is quite possibly the simplest way to plot data in Python. It is similar to plotting in MATLAB, allowing…

www.activestate.com

What is Exploratory Data Analysis?

As I was contemplating what could be the maiden topic I should begin writing my blog with, in no time EDA popped up to…

towardsdatascience.com

Author:

Yash Sharma

is a member of HEXABERRY DATA SCIENCE COMMUNITY.