My Youtube Channel

Please Subscribe

Flag of Nepal

Built in OpenGL

Word Cloud in Python

With masked image

Showing posts with label Data Visulaization. Show all posts
Showing posts with label Data Visulaization. Show all posts

Tuesday, December 6, 2022

Topic Modeling: Working of LDA (Latent Dirichlet Allocation) in simple terms

 Topic modeling is an unsupervised method used to perform text analysis. When we are given large sets of unlabeled documents, it is very difficult to get an insight into the discussions upon which the documents are based upon. Here comes the role of topic modeling. It helps to identify a number of hidden topics within a set of documents. Based on those identified topics, the entire sets of documents can be classified. Also, it can be used to predict the topic of upcoming documents. It can have various applications like customer reviews, story genre prediction, tweets classification, and so on. 

LDA (Latent Dirichlet Allocation) is one of the most used and well-known techniques to perform topic modeling.

 


The word latent means hidden because we don’t know what topics a set of documents contain. Based on the probability distribution of the occurrence of words in each document, they are allocated to defined topics.

Working of LDA:

  • What we want for LDA is to learn the topic mix in each document and also learn the word mix in each document.

  • We choose a random number of topics for the given dataset.

  • Assign each word in each given document to one of the defined topics, randomly.

  • Now, we go through each and every word and to which topic those words are assigned in each document. Then, it is analyzed how often the topic occurs in the document and how often the word occurs in the topic as a whole. Based on this analysis, the new topic is assigned to the given word.

  • It goes through a number of such iterations and finally, the topic will start making sense. We can analyze those found topics and assign a suitable name to those topics which best describe them. 



Fig: Schematic diagram of LDA algorithm


Topic Modeling

 Topic modeling is an unsupervised technique as it is used to perform analysis on text data having no label attached to it. As the name suggests, it is used to discover a number of topics within the given sets of text like tweets, books, articles, and so on. Each topic consists of words where the order of the words does not matter. It performs automatic clustering of words that best describe a set of documents. It gives us insight into a number of issues or features users are talking about, as a group. 

Topic Modelling With LDA -A Hands-on Introduction - Analytics Vidhya

For example, let us say a company has launched the software in the market and receives a number of feedback regarding various product features within a specified time period. Now, rather than going through each review one by one, if we apply topic modeling, we will come to know how users have perceived the various features of the product very quickly. It is one of the essential techniques to perform text analysis on unstructured data. After performing topic modeling, we can even perform topic classification to predict under which topic the upcoming reviews fall. There are various techniques to perform topic modeling, among which LDA is considered to be the most effective one.


Challenges of Sentiment analysis

 

  • Lack of availability of enough domain-specific dataset

  • Multi-class classification of the data

  • Difficult to extract the context of the post made by the users

  • Handling of neutral posts

  • Analyzing the posts of multiple languages

Saturday, November 28, 2020

Visualizing Decision tree in Python (with codes included)

List of libraries required to be installed (if not already installed). Here installation is done through Jupyter Notebook. For terminal use only "pip install library-name".

#import sys

#!{sys.executable} -m pip install numpy

#!{sys.executable} -m pip install pandas

#!{sys.executable} -m pip install sklearn

#!{sys.executable} -m pip install hvplot

#!{sys.executable} -m pip install six

#!{sys.executable} -m pip install pydotplus

#!{sys.executable} -m pip install  python-graphviz

Importing libraries:

import numpy as np 

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

import hvplot.pandas

About the dataset

Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.

It is a sample of binary classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.

Let's see the sample of dataset:

The shape of dataset is (200,6)

Pre-processing

Using my_data as the Drug.csv data read by pandas, declare the following variables:

  • as the Feature Matrix (data of my_data)
  • as the response vector (target)
  • Remove the column containing the target name since it doesn't contain numeric values.

As you may figure out, some features in this dataset are categorical such as Sex or BP. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. pandas.get_dummies() Convert categorical variable into dummy/indicator variables.

Now we can fill the target variable.











Setting up the Decision Tree


We will be using train/test split on our decision tree. Let's import train_test_split from sklearn.cross_validation.


from sklearn.model_selection import train_test_split


Now train_test_split will return 4 different parameters. We will name them:
X_trainset, X_testset, y_trainset, y_testset

The train_test_split will need the parameters:
X, y, test_size=0.3, and random_state=3.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.




Modeling


We will first create an instance of the DecisionTreeClassifier called drugTree.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.

We can use "gini" criterion as well. Result will be the same. Gini is actually a default criteria for decision tree classifier.












From the graph below, we can see the accuracy score is highest at max-depth=4 and it remains constant

thereafter so we have used max-depth=4 in this case.















We can also plot max-depth vs accuracy score for both testset and trainset:

For testset:


















For trainset:


















Now, plotting the graph of max-depth vs accuracy score for both trainset and testset:


















Visualization

















The final image of decision tree is given below:






























Another easy method to generate decision tree:
























Drawing Decision path (more readable form of decision tree):

























Important features for classification

Listing the features in their rank of importance for classifying the data.












































The github link to the program can be found here.

Sunday, August 16, 2020

Build a colorful Word Cloud in python using mask image

Word cloud is a data visualization tool in data science. It is very efficient to visualize various words in a text according to the quantum of their repetition within the text. The stopwords have been ignored while visualization. A text file called "skill.txt" has been used to visualize. Mask image of map of Nepal has been used to visualize the word cloud.

The libraries required are:

Reading the text file "alice.txt" whose word cloud will be formed. After reading text file, setting the stopwords.


Generating a word cloud and storing it into "skillwc" variable.

Importing libraries and creating a simple image of word cloud (without using mask image).



Now, using mask image of map of Nepal to create word cloud. First of all, we will open the image and save it in a variable "mask_image" and then view the mask image without superimposing the text onto it.






Click here to download the collection of mask image.

Finally, we will impose the text file 'alice.txt' onto the image shown above with adding original color of image to the word cloud instead of default color.




Get the Github link here.

Friday, August 14, 2020

Covid-19 Data Visualization across the World using Choropleth map

Introduction

This project visualizes the Covid-19 data (i.e. Total cases, Deaths and Recoveries) across various Provinces and Districts of Nepal as of 12th August, 2020. Geojson file of Nepal's states and districts have been used. Also python library i.e. Folium has been used to generate Choropleth map whose geo_data value is the geojson of Nepal.

The libraries imported are:

Data description:

Covid-19 data of  Countries across the world were scrapped from wikipedia.

Click here to go to the wikipedia page.

Simple one line code can be used to scrap the table of wikipedia. We will store the scrapped data into a dataframe called 'df'.

df = pd.read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory')[1]

View of original data:

Data Wrangling/ Cleaning:

Step 1: 

Selecting only required columns into new dataframe df1 from the data above.



Step 2:

 Converting the multi-index column into single-index colums.

Step 3:

Removing the index attached with the name of each countries in the dataframe above.

df1['Countries'] = df1['Countries'].str.replace(r'\[.*?\]$', '') 

Step 4:

Changing the country name 'United States' to 'United States of America' to match with the name in Geojson file.

df1['Countries'].replace('United States', 'United States of America', inplace=True)

Step 5:

We can see the last 3 rows of dataframe are not required so dropping them out.

df1=df1[:-3]

Step 6:

Replacing the value 'No data' to 0 (Zero) in each column.

df1['Recovered'].replace('No data', '0', inplace=True)


Step 7:

Changing the data type of columns Cases, Recovered and Deaths to integer.

After changing the datatypes:

Visualizing the data across world:

For cases:


Similarly, it can be done for Recovered and Deaths.

For recovered:

For deaths:


Get the Github link here.

Covid-19 Data Visualization of Nepal using Choropleth map

Introduction

This project visualizes the Covid-19 data (i.e. Total cases, Deaths and Recoveries) across various Provinces and Districts of Nepal as of 9th August, 2020. Geojson file of Nepal's states and districts have been used. Also python library i.e. Folium has been used to generate Choropleth map whose geo_data value is the geojson of Nepal.

The libraries imported are:

Data description:

Covid-19 data of various provinces and districts were scrapped from wikipedia.

Click here to go to the wikipedia page.

Simple one line code can be used to scrap the table of wikipedia. We will store the scrapped data into a dataframe called 'df'.

df = pd.read_html('https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/Nepal_medical_cases_by_province_and_district')[1]

Original view of data:

Data Wrangling/ Cleaning:

Step 1: 

We can see in data above that the columns are of multi-index. So, converting it into single index columns.

Step 2:

Dropping the 'Index Case column'.

df.drop(columns=['Index Case'], axis=1, inplace=True)

Step 3:

We can see the rows with index 84 (it's a grand total case in Nepal) and 85 are not required so dropping them out.

df=df[:-2]   # Getting all rows except the last two rows 

Step 4:

We can see in above image that the data types of columns Cases, Recovered and Deaths are not in desired form. So, converting them into 'integer'.

Step 5:

Our dataframe (df) after cleaning looks like this:


We can see the data of Provinces and Districts are together in single dataframe. So, we need to separate them into different dataframes.

Creating a dataframe of Provinces only:

Step 1:

Extracting the data of only provinces into a new dataframe called df_prov.

We can use two methods to do so.

Method 1

#df_prov=df.iloc[[0, 15, 24, 38, 50, 63, 74],:]


Method 2    (More robust method as in above method the index of provinces may change in original link of data)

df_prov=df[df['Location'].str.contains('Province') | df['Location'].str.contains('Bagmati') | df['Location'].str.contains('Gandaki') | df['Location'].str.contains('Karnali') | df['Location'].str.contains('Sudurpashchim') ]

Step 2:

Resetting the index of newly formed dataframe.

df_prov.reset_index(drop=True, inplace=True)

View of new dataframe:
Step 3:
Creating a copy of this dataframe to use it while creating a dataframe for districts.

df_backup=df_prov.copy()

Step 4:
Renaming the Provinces to match them with the name of Provinces mentioned in Geojson file of Nepal.

Step 5:
Renaming the column 'Location' to 'Province'.

df_prov.rename(columns={'Location':'Province'}, inplace=True)

Final view of dataframe of Provinces:

Visualizing the data of provinces:

For Cases

Reading the Geojson file and creating a plain map of Nepal:

Defining a Choropleth map in a plain map created above:

Map of Nepal as seen for Cases in various Provinces:

Similarly, we can visualize map for Recovered and Deaths in various Provinces of Nepal as follow:

Map for recovered:


Map for deaths:


Creating a dataframe of districts only:


1. Creating a new dataframe for districts called 'df_dist' by concatenating dataframes df and df_prov, and removing the duplicates rows between them.

df_dist=pd.concat([df, df_backup]).drop_duplicates(keep=False)

2. Renaming the column 'Location' to 'District'. 

df_dist.rename(columns={"Location":"District"}, inplace=True)

3. Resetting the index of dataframe df_dist.

df_dist.reset_index(drop=True, inplace=True)

4. Final dataframe of districts i.e. df_dist looks like this:

Visualizing data of districts:

For cases:


Map for cases:

Map for recovered:

Map for deaths:

Get the Github link here.