NEMISA DATA SCIENCE HACKATHON 2021

TEAM: DJ PATIL'S APPRENTICE

HACKER(S): TINASHE NDEMERA

sadman.jpg

How best can resources be allocated in order to reduce unemployment in South Africa in a post Covid-era?

- EDUCATION

- ENTREPRENEURSHIP

- INDUSTRY

- CORRUPTION

In [29]:
from __future__ import print_function # For Python 2 / 3 compatability
%matplotlib inline 
import numpy as np
import pandas as pd
import seaborn as sns
import chardet
import math
import matplotlib.pyplot as plt
#np.random.seed(0)

#%pip install cchardet
#import cchardet as chardet

saved excel spreadsheet as a CSV

file = open(r'C:\Users\user\Downloads\ethekwini_skills.xls',encoding='utf-8') file

In [30]:
data=pd.read_csv(file)
data.head()
---------------------------------------------------------------------------
EmptyDataError                            Traceback (most recent call last)
<ipython-input-30-e6445f6374f3> in <module>
----> 1 data=pd.read_csv(file)
      2 data.head()

~\anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    446 
    447     # Create the parser.
--> 448     parser = TextFileReader(fp_or_buf, **kwds)
    449 
    450     if chunksize or iterator:

~\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    878             self.options["has_index_names"] = kwds["has_index_names"]
    879 
--> 880         self._make_engine(self.engine)
    881 
    882     def close(self):

~\anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1112     def _make_engine(self, engine="c"):
   1113         if engine == "c":
-> 1114             self._engine = CParserWrapper(self.f, **self.options)
   1115         else:
   1116             if engine == "python":

~\anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1889         kwds["usecols"] = self.usecols
   1890 
-> 1891         self._reader = parsers.TextReader(src, **kwds)
   1892         self.unnamed_cols = self._reader.unnamed_cols
   1893 

pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

EmptyDataError: No columns to parse from file
In [31]:
# saved excel spreadsheet as a CSV
file = open(r'C:\Users\user\Downloads\ethekwini_skill.csv')
file
Out[31]:
<_io.TextIOWrapper name='C:\\Users\\user\\Downloads\\ethekwini_skill.csv' mode='r' encoding='cp1252'>
In [32]:
data=pd.read_csv(file)
data.head()
Out[32]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 Gender Disability Drivers Licence Ward No Suburb Skill Sector Skill Description Skill Experience Duration Age Highest Grade Passed Informal Training Sector Informal Training Description Informal Experience Duration Qualification
1 FEMALE NO NO 888 NOT SPECIFIED No Skill Sector NaN NaN 29 Grade 12/std 10 No Informal Training Sector NaN No Qualification
2 MALE NO NO 72 Risecliff, Demat No Skill Sector NaN NaN 57 Grade 10/std 8 Other Sectors and Industries Security 1 - 3 years No Qualification
3 FEMALE NO NO 72 Risecliff, Demat No Skill Sector NaN NaN 28 Grade 11/std 9 Other Sectors and Industries Home Base Care NOT SPECIFIED No Qualification
4 FEMALE NO NO 76 Umlazi V Other Sectors and Industries Security 1 - 3 years 34 Grade 10/std 8 No Informal Training Sector NaN No Qualification
In [33]:
data.tail()
Out[33]:
Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13
22935 FEMALE NO NO 63 Malven, Escombe Food & Beverages Cooking. NOT SPECIFIED 24 Grade 11/std 9 No Informal Training Sector NaN No Qualification
22936 FEMALE NO NO 103 Drummond, Peacevale, Cliffdale Art, Culture, Entertainment & Sport Art 5 years and above 23 Grade 12/std 10 No Informal Training Sector NaN Arts and Design
22937 MALE NO NO 103 Drummond, Peacevale, Cliffdale Distribution, Transport & Logistics Driving 5 years and above 25 Grade 12/std 10 No Informal Training Sector NaN No Qualification
22938 FEMALE NO NO 63 Malven, Escombe Other Business Activities & Services Hair Stylist. 5 years and above 27 Grade 10/std 8 No Informal Training Sector NaN No Qualification
22939 MALE NO NO 63 Malven, Escombe Electricity, Gas & Water Supply Electrical. NOT SPECIFIED 29 Grade 10/std 8 No Informal Training Sector NaN No Qualification
In [34]:
new_header = data.iloc[0] #grab the first row for the header
data = data[1:] #take the data less the header row. Essentially taking all the data going downwards
data.columns = new_header #set the header row as the df header
data.head()
Out[34]:
Gender Disability Drivers Licence Ward No Suburb Skill Sector Skill Description Skill Experience Duration Age Highest Grade Passed Informal Training Sector Informal Training Description Informal Experience Duration Qualification
1 FEMALE NO NO 888 NOT SPECIFIED No Skill Sector NaN NaN 29 Grade 12/std 10 No Informal Training Sector NaN No Qualification
2 MALE NO NO 72 Risecliff, Demat No Skill Sector NaN NaN 57 Grade 10/std 8 Other Sectors and Industries Security 1 - 3 years No Qualification
3 FEMALE NO NO 72 Risecliff, Demat No Skill Sector NaN NaN 28 Grade 11/std 9 Other Sectors and Industries Home Base Care NOT SPECIFIED No Qualification
4 FEMALE NO NO 76 Umlazi V Other Sectors and Industries Security 1 - 3 years 34 Grade 10/std 8 No Informal Training Sector NaN No Qualification
5 FEMALE NO NO 72 Risecliff, Demat No Skill Sector NaN NaN 28 Grade 11/std 9 Hotels, Restaurants & Catering Catering 1 - 3 years No Qualification
In [35]:
data.shape
Out[35]:
(22939, 14)
In [36]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22939 entries, 1 to 22939
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Gender                         22939 non-null  object
 1   Disability                     22939 non-null  object
 2   Drivers Licence                22939 non-null  object
 3   Ward No                        22926 non-null  object
 4   Suburb                         22939 non-null  object
 5   Skill Sector                   22939 non-null  object
 6   Skill Description              10538 non-null  object
 7   Skill Experience Duration      10540 non-null  object
 8   Age                            22939 non-null  object
 9   Highest Grade Passed           22939 non-null  object
 10  Informal Training Sector       22939 non-null  object
 11  Informal Training Description  5291 non-null   object
 12  Informal Experience Duration   22939 non-null  object
 13  Qualification                  22939 non-null  object
dtypes: object(14)
memory usage: 2.5+ MB
In [37]:
sns.heatmap(data.isnull(),yticklabels=False,cmap='viridis')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x254e2fcfc48>
In [38]:
data.columns
Out[38]:
Index(['Gender', 'Disability', 'Drivers Licence', 'Ward No', 'Suburb',
       'Skill Sector', 'Skill Description', 'Skill Experience Duration', 'Age',
       'Highest Grade Passed', 'Informal Training Sector',
       'Informal Training Description', 'Informal Experience Duration',
       'Qualification'],
      dtype='object', name=0)
In [39]:
data.nunique()
Out[39]:
0
Gender                              3
Disability                          3
Drivers Licence                     3
Ward No                           104
Suburb                            104
Skill Sector                       47
Skill Description                4060
Skill Experience Duration           6
Age                                74
Highest Grade Passed               10
Informal Training Sector           45
Informal Training Description    2421
Informal Experience Duration        7
Qualification                      11
dtype: int64
In [40]:
data['Gender'].unique()
Out[40]:
array(['FEMALE', 'MALE', 'NOT SPECIFIED'], dtype=object)
In [41]:
data['Disability'].unique()
Out[41]:
array(['NO', 'YES', 'NOT SPECIFIED'], dtype=object)
In [42]:
data['Skill Sector'].unique()
Out[42]:
array(['No Skill Sector', 'Other Sectors and Industries', 'Handcraft',
       'Other Business Activities & Services', 'Building & Construction',
       'Electricity, Gas & Water Supply',
       'Art, Culture, Entertainment & Sport', 'Banking',
       'Hotels, Restaurants & Catering',
       'Textiles, Clothing & Leather, Fashion',
       'Agriculture and Forestry', 'Wholesale, Retail Trade',
       'IT & Internet', 'Steel Industry',
       'Distribution, Transport & Logistics', 'Auditing/Accounting',
       'HR Services, Recruitment & Selection',
       'Medical, Health & Social Care', 'Food & Beverages',
       'Non-profit & voluntary sector', 'Health & Fitness', 'Consulting',
       'Advertising, Communication & PR',
       'Recreational, Cultural & Sporting Activities',
       'Conservation & Environment', 'Financial Services',
       'Public Services', 'Leisure & Tourism',
       'Media, Audiovisual & Publishing', 'Telecommunication Services',
       'Public Administration', 'Call & Contact Centers Industry',
       'Education & Training', 'Armed Forces', 'Manufacturing', 'Legal',
       'Public Administration & Defence', 'Mining & Quarrying',
       'Chemical & Petrochemical Industry', 'Automotive Sector',
       'Fast Moving Consumer Goods/ Durables', 'Medical Technology',
       'Fishing', 'Research & Development', 'Insurances',
       'Publishing, Printing & Reproduction', 'Real Estate'], dtype=object)
In [43]:
for col_name in data.columns:
    if data[col_name].dtypes == 'object':
        unique_cat = len(data[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} unique categories".format(col_name=col_name,unique_cat=unique_cat))
Feature 'Gender' has 3 unique categories
Feature 'Disability' has 3 unique categories
Feature 'Drivers Licence' has 3 unique categories
Feature 'Ward No' has 105 unique categories
Feature 'Suburb' has 104 unique categories
Feature 'Skill Sector' has 47 unique categories
Feature 'Skill Description' has 4061 unique categories
Feature 'Skill Experience Duration' has 7 unique categories
Feature 'Age' has 74 unique categories
Feature 'Highest Grade Passed' has 10 unique categories
Feature 'Informal Training Sector' has 45 unique categories
Feature 'Informal Training Description' has 2422 unique categories
Feature 'Informal Experience Duration' has 7 unique categories
Feature 'Qualification' has 11 unique categories
In [44]:
# first three for incomplete data and the next two for irrelevance/redundancy
ndata = data.drop(['Skill Description','Skill Experience Duration','Informal Training Description','Suburb','Informal Experience Duration'], axis = 1) # Policy can be made with regards to wards and not suburbs
ndata.head()
Out[44]:
Gender Disability Drivers Licence Ward No Skill Sector Age Highest Grade Passed Informal Training Sector Qualification
1 FEMALE NO NO 888 No Skill Sector 29 Grade 12/std 10 No Informal Training Sector No Qualification
2 MALE NO NO 72 No Skill Sector 57 Grade 10/std 8 Other Sectors and Industries No Qualification
3 FEMALE NO NO 72 No Skill Sector 28 Grade 11/std 9 Other Sectors and Industries No Qualification
4 FEMALE NO NO 76 Other Sectors and Industries 34 Grade 10/std 8 No Informal Training Sector No Qualification
5 FEMALE NO NO 72 No Skill Sector 28 Grade 11/std 9 Hotels, Restaurants & Catering No Qualification
In [45]:
sns.heatmap(ndata.isnull(),yticklabels=False,cmap='viridis')
Out[45]:
<matplotlib.axes._subplots.AxesSubplot at 0x254e2bf9f48>
In [46]:
dataTypeSeries = ndata.dtypes
print('Data type of each column of Dataframe :')

print(dataTypeSeries)
Data type of each column of Dataframe :
0
Gender                      object
Disability                  object
Drivers Licence             object
Ward No                     object
Skill Sector                object
Age                         object
Highest Grade Passed        object
Informal Training Sector    object
Qualification               object
dtype: object
In [47]:
ndata['Age'] = ndata['Age'].astype(int)
print (ndata['Age'].dtypes)
int32
In [48]:
ndata['Age'].plot.hist()
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x254e0d8b748>
In [49]:
sns.distplot(ndata['Age'])
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x254e2ac0908>
In [50]:
ndata['Age'].describe()
Out[50]:
count    22939.000000
mean        30.133833
std          8.790107
min          0.000000
25%         24.000000
50%         28.000000
75%         34.000000
max        116.000000
Name: Age, dtype: float64
In [51]:
ndata['Age'] = pd.to_numeric(ndata['Age'])
ndata['Age'] = np.where(ndata['Age'] > 15, '15-20', ndata['Age'])
ndata.head()
Out[51]:
Gender Disability Drivers Licence Ward No Skill Sector Age Highest Grade Passed Informal Training Sector Qualification
1 FEMALE NO NO 888 No Skill Sector 15-20 Grade 12/std 10 No Informal Training Sector No Qualification
2 MALE NO NO 72 No Skill Sector 15-20 Grade 10/std 8 Other Sectors and Industries No Qualification
3 FEMALE NO NO 72 No Skill Sector 15-20 Grade 11/std 9 Other Sectors and Industries No Qualification
4 FEMALE NO NO 76 Other Sectors and Industries 15-20 Grade 10/std 8 No Informal Training Sector No Qualification
5 FEMALE NO NO 72 No Skill Sector 15-20 Grade 11/std 9 Hotels, Restaurants & Catering No Qualification
In [52]:
nndata = ndata.loc[ndata['Ward No']=='28']
nndata.head(5)
Out[52]:
Gender Disability Drivers Licence Ward No Skill Sector Age Highest Grade Passed Informal Training Sector Qualification
1122 FEMALE NO NO 28 Other Business Activities & Services 15-20 Grade 11/std 9 Other Business Activities & Services No Qualification
1133 FEMALE NO NO 28 Handcraft 15-20 Grade 10/std 8 Public Services No Qualification
1189 FEMALE NO NO 28 Art, Culture, Entertainment & Sport 15-20 Grade 12/std 10 No Informal Training Sector No Qualification
1203 FEMALE NO NO 28 No Skill Sector 15-20 Grade 11/std 9 Other Business Activities & Services No Qualification
1221 FEMALE NO NO 28 No Skill Sector 15-20 Grade 12/std 10 Other Business Activities & Services No Qualification
In [53]:
nndata.shape
Out[53]:
(114, 9)
In [54]:
print (ndata['Age'].dtypes)
object
In [55]:
plt.figure(figsize=(16, 16))
ax1 = sns.countplot(x="Skill Sector",data=nndata)

ax1.set_xticklabels(ax1.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
In [56]:
plt.figure(figsize=(16, 16))
ax2 = sns.countplot(x="Skill Sector",data=ndata)

ax2.set_xticklabels(ax2.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()

Primarily no skills, Unspecified and construction.

Perhaps theer needss to be more investment with regards to upskilling

In [58]:
plt.figure(figsize=(8, 8))
ax4 = sns.countplot(x='Qualification', data=ndata)

ax4.set_xticklabels(ax4.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
In [59]:
sns.countplot(x="Disability",data=ndata)
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x254e340cd88>
In [60]:
ndata['Disability'].value_counts()
Out[60]:
NO               22742
NOT SPECIFIED      101
YES                 96
Name: Disability, dtype: int64
In [61]:
ndata['Gender'].value_counts() #less than 1% unspecified and 44% male
Out[61]:
FEMALE           12754
MALE             10128
NOT SPECIFIED       57
Name: Gender, dtype: int64
In [62]:
genders = pd.get_dummies(ndata['Gender'],drop_first=True)
license = pd.get_dummies(ndata['Drivers Licence'],drop_first=True)
disability = pd.get_dummies(ndata['Disability'],drop_first=True)
In [63]:
genders.head(5)
Out[63]:
MALE NOT SPECIFIED
1 0 0
2 1 0
3 0 0
4 0 0
5 0 0
In [64]:
ndata.drop(['Gender'],axis=1,inplace=True)
In [65]:
ndata.drop(['Disability'],axis=1,inplace=True)
In [66]:
ndata.drop(['Drivers Licence'],axis=1,inplace=True)
In [67]:
ndata.head()
Out[67]:
Ward No Skill Sector Age Highest Grade Passed Informal Training Sector Qualification
1 888 No Skill Sector 15-20 Grade 12/std 10 No Informal Training Sector No Qualification
2 72 No Skill Sector 15-20 Grade 10/std 8 Other Sectors and Industries No Qualification
3 72 No Skill Sector 15-20 Grade 11/std 9 Other Sectors and Industries No Qualification
4 76 Other Sectors and Industries 15-20 Grade 10/std 8 No Informal Training Sector No Qualification
5 72 No Skill Sector 15-20 Grade 11/std 9 Hotels, Restaurants & Catering No Qualification
In [68]:
ndata=pd.concat([ndata,genders,license,disability],axis=1)
In [69]:
ndata
Out[69]:
Ward No Skill Sector Age Highest Grade Passed Informal Training Sector Qualification MALE NOT SPECIFIED NOT SPECIFIED YES NOT SPECIFIED YES
1 888 No Skill Sector 15-20 Grade 12/std 10 No Informal Training Sector No Qualification 0 0 0 0 0 0
2 72 No Skill Sector 15-20 Grade 10/std 8 Other Sectors and Industries No Qualification 1 0 0 0 0 0
3 72 No Skill Sector 15-20 Grade 11/std 9 Other Sectors and Industries No Qualification 0 0 0 0 0 0
4 76 Other Sectors and Industries 15-20 Grade 10/std 8 No Informal Training Sector No Qualification 0 0 0 0 0 0
5 72 No Skill Sector 15-20 Grade 11/std 9 Hotels, Restaurants & Catering No Qualification 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
22935 63 Food & Beverages 15-20 Grade 11/std 9 No Informal Training Sector No Qualification 0 0 0 0 0 0
22936 103 Art, Culture, Entertainment & Sport 15-20 Grade 12/std 10 No Informal Training Sector Arts and Design 0 0 0 0 0 0
22937 103 Distribution, Transport & Logistics 15-20 Grade 12/std 10 No Informal Training Sector No Qualification 1 0 0 0 0 0
22938 63 Other Business Activities & Services 15-20 Grade 10/std 8 No Informal Training Sector No Qualification 0 0 0 0 0 0
22939 63 Electricity, Gas & Water Supply 15-20 Grade 10/std 8 No Informal Training Sector No Qualification 1 0 0 0 0 0

22939 rows × 12 columns

In [70]:
# Qualitative study by Moodley,S 2017 GIBS for MBA 
gibs_study = {"Formal,Informal & Non-formal": 5, "Formal & Non-formal": 4, "Formal & Informal": 2, "Formal": 1, "Informal": 1, "Non-formal": 0}
plt.pie([float(v) for v in gibs_study.values()], labels=[k for k in gibs_study.keys()],
           autopct='%1.0f%%', pctdistance=0.8, labeldistance=1.2)
Out[70]:
([<matplotlib.patches.Wedge at 0x254e3497e48>,
  <matplotlib.patches.Wedge at 0x254e34a3a88>,
  <matplotlib.patches.Wedge at 0x254e34a7908>,
  <matplotlib.patches.Wedge at 0x254e34ae7c8>,
  <matplotlib.patches.Wedge at 0x254e34b3a88>,
  <matplotlib.patches.Wedge at 0x254e34baa88>],
 [Text(0.42552584020849354, 1.1220195004164837, 'Formal,Informal & Non-formal'),
  Text(-1.165130158160196, -0.28717888944975795, 'Formal & Non-formal'),
  Text(0.14464417502633278, -1.1912506296455634, 'Formal & Informal'),
  Text(0.8982130253204513, -0.7957470459541035, 'Formal'),
  Text(1.1651302321011348, -0.2871785894595135, 'Informal'),
  Text(1.1999999999999789, 2.2470421653959904e-07, 'Non-formal')],
 [Text(0.2836838934723291, 0.7480130002776558, '38%'),
  Text(-0.7767534387734641, -0.1914525929665053, '31%'),
  Text(0.09642945001755521, -0.7941670864303757, '15%'),
  Text(0.5988086835469676, -0.530498030636069, '8%'),
  Text(0.7767534880674232, -0.19145239297300903, '8%'),
  Text(0.7999999999999861, 1.4980281102639938e-07, '0%')])
In [71]:
dataz = pd.read_csv(r'C:\Users\user\Downloads\ethekwini-business-license-data-in-all-regions.csv')
dataz.head(10)
Out[71]:
RefNo BusinessName PhysicalAddress Proprietor Employer Telephone PostalAddress LicenseType LicenseSubType CurrentDate AnnualNotification LicenseIssueDate NotificationUpdateDate Unnamed: 13 Unnamed: 14
0 B00009 ELSIE MAGDALENA MARIA DU PREEZ N/A, N/A ELSIE MAGDALENA MARIA DU PREEZ NaN NaN 5 Mellissa Crescent, New Germany, 3610 Hawking in meals or perishable foodstuffs J 2006-03-07 0 1994-09-05 __/__/____ NaN NaN
1 B00013 TONY'S PLACE N/A, N/A ANTHONY BARNETT LEUW NaN NaN P O Box 2896, Durban, 4000 Hawking in meals or perishable foodstuffs J 2006-05-07 0 1994-03-10 __/__/____ NaN NaN
2 B00014 SAMEERA'S HALAAL FOODS UKZN HOWARD COLLEGE CAMPUS, JUBILEE GARDENS, D... SHER BANU MAHOMED SHER BANU MAHOMED 823968696 10 Evergreen Terrace, Havenside, Chatsworth 4092 Sale/Supply of perishable foodstuffs / 2020-01-27 0 1994-06-14 27/01/2020 NaN NaN
3 B00019 LOURICH DUSCHA N/a, N/a LOURICH DUSHCA NaN NaN 15 Grand Birches, Entabeni Road, Paradise Vall... Hawking in meals or perishable foodstuffs P 1998-05-13 0 1994-05-08 NaN NaN NaN
4 B00031 AFRIKAN CONNECTION ENTERPRISE 59 Clayside Crescent, Caneside, Phoenix NUNDLALL GUNGADEEN NaN 5056076 59 Clayside Crescent, Caneside, 4068 Phoenix Sale/Supply of perishable foodstuffs 1 2017-10-01 0 1996-02-28 1/10/2017 NaN NaN
5 B00033 ANDREW ANDERSON STANLEY N/a, N/A ANDREW ANDERSON STANLEY NaN 2090008 9 Saunders Road, 4091 Sparks Estate, N/A Hawking in meals or perishable foodstuffs J 2013-07-01 0 1994-08-05 1/7/2013 NaN NaN
6 B00037 NEL'S KIOSK N/a, N/a GEORGE PHILIPPES NEL NaN 031-2056748 15 Ormiston Place, Glenwood, 4001 Hawking in meals or perishable foodstuffs J 2008-02-20 0 1994-06-14 20/02/2008 NaN NaN
7 B00047 DRY DOCK KIOSK Bayhead Road (graving Dock), Bayhead KUMRESEN BONNY NAICKER NaN 2059103/ 2058884/2033720 145 Silvermount Circle, Moorton, 4092 Chatsworth Sale/Supply of perishable foodstuffs 1 2017-04-04 0 1994-04-25 4/4/2017 NaN NaN
8 B00073 RAMS TUCK SHOP 8 Crestone Road, Whetstone, Phoenix RAMPHAL SAMLALL NaN 0315079718//0315006671 8 Crestone Road, Whetstone, Phoenix 4068 Sale/Supply of perishable foodstuffs 1 2018-04-24 0 1995-09-26 24/04/2018 NaN NaN
9 B00076 ZIPPY'S CONES N/a, N/a AYOB KHAN NaN 031-2693326 34 Cartmel Road, Clare Estate, 4091 Hawking in meals or perishable foodstuffs J 2005-01-18 0 1994-08-29 18/01/2005 NaN NaN
In [72]:
dataz.drop(['Proprietor', 'BusinessName','RefNo','PhysicalAddress', 'Telephone','PostalAddress','LicenseSubType','CurrentDate'],axis=1,inplace=True)
dataz.head()
Out[72]:
Employer LicenseType AnnualNotification LicenseIssueDate NotificationUpdateDate Unnamed: 13 Unnamed: 14
0 NaN Hawking in meals or perishable foodstuffs 0 1994-09-05 __/__/____ NaN NaN
1 NaN Hawking in meals or perishable foodstuffs 0 1994-03-10 __/__/____ NaN NaN
2 SHER BANU MAHOMED Sale/Supply of perishable foodstuffs 0 1994-06-14 27/01/2020 NaN NaN
3 NaN Hawking in meals or perishable foodstuffs 0 1994-05-08 NaN NaN NaN
4 NaN Sale/Supply of perishable foodstuffs 0 1996-02-28 1/10/2017 NaN NaN
In [73]:
dataz.drop(['Employer', 'AnnualNotification','LicenseIssueDate','NotificationUpdateDate', 'Unnamed: 13','Unnamed: 14'],axis=1,inplace=True)
In [74]:
dataz.head()
Out[74]:
LicenseType
0 Hawking in meals or perishable foodstuffs
1 Hawking in meals or perishable foodstuffs
2 Sale/Supply of perishable foodstuffs
3 Hawking in meals or perishable foodstuffs
4 Sale/Supply of perishable foodstuffs
In [75]:
dataz.tail()
Out[75]:
LicenseType
9582 Sale/Supply of perishable foodstuffs
9583 Sale/Supply of perishable foodstuffs
9584 Sale/Supply of perishable foodstuffs
9585 Sale/Supply of perishable foodstuffs
9586 Sale/Supply of perishable foodstuffs
In [76]:
dataz.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9587 entries, 0 to 9586
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   LicenseType  9585 non-null   object
dtypes: object(1)
memory usage: 75.0+ KB
In [77]:
dataz.nunique()
Out[77]:
LicenseType    10
dtype: int64
In [79]:
plt.figure(figsize=(8, 8))
ax3 = sns.countplot(x='LicenseType',data=dataz)

ax3.set_xticklabels(ax3.get_xticklabels(), rotation=40, ha="right")
plt.tight_layout()
plt.show()
In [80]:
dataz.head(50)
Out[80]:
LicenseType
0 Hawking in meals or perishable foodstuffs
1 Hawking in meals or perishable foodstuffs
2 Sale/Supply of perishable foodstuffs
3 Hawking in meals or perishable foodstuffs
4 Sale/Supply of perishable foodstuffs
5 Hawking in meals or perishable foodstuffs
6 Hawking in meals or perishable foodstuffs
7 Sale/Supply of perishable foodstuffs
8 Sale/Supply of perishable foodstuffs
9 Hawking in meals or perishable foodstuffs
10 Sale/Supply of perishable foodstuffs
11 Hawking in meals or perishable foodstuffs
12 Sale/Supply of perishable foodstuffs
13 Sale/Supply of perishable foodstuffs
14 Sale/Supply of perishable foodstuffs
15 Sale/Supply of perishable foodstuffs
16 Sale/Supply of perishable foodstuffs
17 Hawking in meals or perishable foodstuffs
18 Sale/Supply of perishable foodstuffs
19 Sale/Supply of perishable foodstuffs
20 Sale/Supply of perishable foodstuffs
21 Sale/Supply of perishable foodstuffs
22 Hawking in meals or perishable foodstuffs
23 Provision of health facilities or entertainment
24 Provision of health facilities or entertainment
25 Sale/Supply of perishable foodstuffs
26 Hawking in meals or perishable foodstuffs
27 Hawking in meals or perishable foodstuffs
28 Sale/Supply of perishable foodstuffs
29 Sale/Supply of perishable foodstuffs
30 Hawking in meals or perishable foodstuffs
31 Sale/Supply of perishable foodstuffs
32 Sale/Supply of perishable foodstuffs
33 Sale/Supply of perishable foodstuffs
34 Sale/Supply of perishable foodstuffs
35 Sale/Supply of perishable foodstuffs
36 Sale/Supply of perishable foodstuffs
37 Sale/Supply of perishable foodstuffs
38 Sale/Supply of perishable foodstuffs
39 Sale/Supply of perishable foodstuffs
40 Provision of health facilities or entertainment
41 Sale/Supply of perishable foodstuffs
42 Sale/Supply of perishable foodstuffs
43 Provision of health facilities or entertainment
44 Sale/Supply of perishable foodstuffs
45 Sale/Supply of perishable foodstuffs
46 Hawking in meals or perishable foodstuffs
47 Sale/Supply of perishable foodstuffs
48 Sale/Supply of perishable foodstuffs
49 Sale/Supply of perishable foodstuffs
In [81]:
datas = pd.read_csv(r'C:\Users\user\Downloads\unemploymentrates.csv')
datas.head(10)
Out[81]:
Province Corruption reports Unemployment (%)
0 Western Cape 6.5 23.716012
1 Eastern Cape 6.5 57.484663
2 Northern Cape 2.0 47.723577
3 Free State 4.6 45.084746
4 KwaZulu-Natal 9.6 53.815789
5 North West 1.7 50.628141
6 Gauteng 39.8 30.394558
7 Mpumalanga 4.7 45.973451
8 Limpopo 4.5 49.689655
9 Unspecified 20.1 0.000000
In [82]:
datas = datas.iloc[:-1] #take the data less the header row. Essentially taking all the data going downwards
datas.head(10)
Out[82]:
Province Corruption reports Unemployment (%)
0 Western Cape 6.5 23.716012
1 Eastern Cape 6.5 57.484663
2 Northern Cape 2.0 47.723577
3 Free State 4.6 45.084746
4 KwaZulu-Natal 9.6 53.815789
5 North West 1.7 50.628141
6 Gauteng 39.8 30.394558
7 Mpumalanga 4.7 45.973451
8 Limpopo 4.5 49.689655
In [83]:
corelation = datas.corr()
sns.heatmap(corelation, xticklabels=corelation.columns, yticklabels=corelation,annot=True)
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0x254e7182bc8>
In [ ]:
from google 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: