Linear regression with machine learning (scikit-learn) Part 1¶

Testing the machine learning algorithm for regression, dataset populated with fake data.

Task:¶

You are owner of imaginary shop that sells stuff online via two channels: mobile app and through website. You are given dataset with customer info, based on such info we should decide whether it is better to invest more to enhancing the website or the mobile app to drive sales.

In this section we will inspect the data to get a feel for it, the next section will deal with machine learning.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# load fake customer data
customers = pd.read_csv('Ecommerce Customers')
customers.head()

customers.describe()

customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email                   500 non-null object
Address                 500 non-null object
Avatar                  500 non-null object
Avg. Session Length     500 non-null float64
Time on App             500 non-null float64
Time on Website         500 non-null float64
Length of Membership    500 non-null float64
Yearly Amount Spent     500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.3+ KB

Time spent on website and amount of money spent yearly do not correlate much

sns.jointplot(data=customers, x="Time on Website", y="Yearly Amount Spent", kind="hex")

<seaborn.axisgrid.JointGrid at 0x7f9e63fc7ba8>

On the other hand the amount of time spent on the app correlates more to yearly revenues, so the more time is spent on app the higher revenues we have in our eshop.

sns.jointplot(data=customers, x="Time on App", y="Yearly Amount Spent", kind="hex")

<seaborn.axisgrid.JointGrid at 0x7f9e63e6e390>

Let's see if there are any other interesting correlations. We will see that length of membership is strongly correlated to the amount that is spent yearly.

sns.pairplot(customers)

<seaborn.axisgrid.PairGrid at 0x7f9e63a93a58>

sns.lmplot(data=customers, x="Length of Membership", y="Yearly Amount Spent")

<seaborn.axisgrid.FacetGrid at 0x7f9e623c6fd0>

Linear fit is having rather narrow error range, indicating nice accuracy of the linear model.

Source:¶

code snippets from Jose Portilla Udemy course "Python for Data Science and Machine Learning Bootcamp" https://www.udemy.com/python-for-data-science-and-machine-learning-bootcamp/

	Email	Address	Avatar	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092

	Avg. Session Length	Time on App	Time on Website	Length of Membership	Yearly Amount Spent
count	500.000000	500.000000	500.000000	500.000000	500.000000
mean	33.053194	12.052488	37.060445	3.533462	499.314038
std	0.992563	0.994216	1.010489	0.999278	79.314782
min	29.532429	8.508152	33.913847	0.269901	256.670582
25%	32.341822	11.388153	36.349257	2.930450	445.038277
50%	33.082008	11.983231	37.069367	3.533975	498.887875
75%	33.711985	12.753850	37.716432	4.126502	549.313828
max	36.139662	15.126994	40.005182	6.922689	765.518462