My Youtube Channel

Please Subscribe

Flag of Nepal

Built in OpenGL

Word Cloud in Python

With masked image

Monday, July 27, 2020

Data Science Capstone project - The battle of neighborhoods in Dubai


1. Introduction/ Business understanding

1.1 Description of problem

Recently, the 13th edition of IPL (Indian Premier League) has been announced amid coronavirus pandemic and UAE has been chosen as the host country. The league is slated to commence from 19th of September, 2020. There has been ongoing discussion regarding the entry of audience in the stadium. The dataset of Dubai has been used to help the visitors of Dubai find places suitable for restaurant, hotel and so on during the IPL season.

1.2 Background of problem

Indian Premier League (IPL) is one of the most popular and highly valued league across the world particularly within cricket playing nation like India, Australia, England and so on. It is an India’s version of T20 cricket league tournament. It gathers large audience in stadium and has huge viewership across cable TV and digital platform. Since, it is India’s tournament it is mostly played in India. But in some extra ordinary condition, it is played in some other countries. This time it is UAE. UAE is also a cricket playing nation and has similar time zone as of India. Many games are slotted to be played in the stadium of Dubai as well. Dubai is located in the eastern part of Arabian Peninsula on the coast of Persian Gulf. Dubai aims to be the business hub of western Asia. It is also a major global transport hub for passengers and cargo.

It is difficult for new travelers to find best place suited for them. So, using the foursquare API, I have performed various analysis on the data set of Dubai to find the best place for restaurant, hotels, parks and so on. This could help the new visitors of Dubai to get the overview of the places.

2. Data description

The dataset that used in this project is of Dubai scrapped from Wikipedia. This dataset contains the list of 131 communities of Dubai.

Data source:

https://en.wikipedia.org/wiki/List_of_communities_in_Dubai

We scrapped the data from the table of Wikipedia using a python library called ‘Beautiful soup’. We will use only 3 columns of the dataset i.e. Community Number, Community (English) and Community (Arabic).

Example of dataset:

I used ‘geopy’ library to find the latitude and longitude of each community. And then using foursquare API I found the venues in each community and what is each community famous for.

3. Methodology

3.1 Scrapping table of list of communities of Dubai from Wikipedia


I first read the table of Wikipedia and then iterating through data of each rows, a new data frame was created with 6 columns.

 Only 3 columns were kept and the rest were dropped. The columns were renamed. The new data frame looked in this way:

3.2 Adding geospatial data

Using geopy library location i.e. latitude and longitude of each communities were retrieved. The communities whose location could not be found were left out. Hence, I was left with 65 communities out of 131.

The outlook of data after adding location:

3.3 Finding the venues of neighborhood within a radius of 500 meters using Foursquare API:

Defining the credentials to connect to Foursquare API:

 First exploring the venues of neighborhood ‘Abu Hail’:


we can see a total of four venues were returned by foursquare.

Exploring the venues of all 65 neighborhood of Dubai:

I was having problem when trying to explore the venues of all 65 neighborhoods at one go. So, what I did was I divided the 65 neighborhoods into 4 groups and then explored the venues of each neighborhood separately. When the venues of all four groups were returned, they were concatenated together.

The data frame after concatenation along with venues of each neighborhood is as follow:

The number of venues in each neighborhood returned by foursquare can be viewed as:

 

From above table we can see Abu Hail has 4 venues, Al Baraha, Al Buteen, Al Garhoud has 40 venues and so on.


Analyzing each neighborhood using one hot encoding:



Displaying each neighborhood with top 5 most common venues:

 

From above figure, we can see the top 5 venues of Al Baraha are Hotel, Middle Eastern Restaurant, cafe, American Restaurant and Spa. The frequency above represents that among 100% venues in Al Baraha, 20% are Hotel, 20% are Middle Eastern Restaurant, 10% are Café, 10% are American restaurant, 10% are Spa and the remaining 30% are venues other than these.

The top 10 venues of each neighborhood are displayed in below table:



Clustering the neighborhoods i.e. communities of Dubai based on the similarities of their venues using K- Means algorithm:

The neighborhoods have been grouped into five clusters.

The K-Means label for each neighborhood:

Now, plotting each neighborhood into map using folium library:

Folium is an essential library to visualize locations on a map. It also allows to zoom in and zoom out the map. With very lines of code, it, does amazing piece of work for visualization of data.

 

 

 4. Results/ Discussions

From the study of venues of each neighborhood we got some results. Lets discuss those results here:

Finding 1:

As we are discussing about IPL going to be held in UAE and being an Indian league tournament more Indians are expected to visit this place. From above data we see plenty of Indian restaurant available here. So, people from India will probably face no problem finding the restaurant of their kind. Also, cricket is mainly considered an Asian game. So, for people from across the Asia visiting the place can also find plenty of Asian restaurant.

The places where one can find Indian restaurant easily are:

 

From above data, we can see Emirates Hill Third, Marsa Dubai, Al Raffa, Al Karama are more famous for Indian restaurant.

Finding 2:

Places where hotel can be found easily are:

So, if someone in Dubai is looking for place with more options available for hotel, they can choose from the places above.

 

Finding 3:

Places with most parks are given below:


So, people fond of parks can choose to stay in the communities/ neighborhoods mentioned above.

Finding 4:

Someone fond of beach can choose to stay in the given below:

Finding 5:

Many people love to have coffee frequently and it becomes for them when they don’t find a coffee shop easily. So, here are the list of places more famous for having coffee shops.

So, these were some findings which I felt were more necessary to be known to people traveling to Dubai.

5. Conclusion

In today's time of digital world, data science plays vital role. It increases the capability of the businesses, medical instruments. It helps the businesses to analyze the behavior of their customers, and also compete with their counterpart in a fast-changing world. With an exponential increase in the use of digital instruments in various sectors, lots of data are being generated and stored every day.  Hence, it becomes quite instrumental and essential to analyze those data to gain information which could help in the improvement of various sectors by taking right decision at right time.

With this project I have made an effort to help the first time travelers to Dubai especially during the season of IPL. I have used some common libraries like geopy, folium to find the location and plot those locations on map respectively. Also, I have made use of foursquare API to explore the venues of each neighborhoods. Despite all these efforts, there are still some areas of improvements which could help in providing even more essential and realistic information from the data.

 


Link to Github


Thursday, July 2, 2020

How to open cmd and powershell in folder in windows

Create virtual environment in python and install django

How to uninstall jupyter notebook?

Install jupyter notebook in windows both with and without virtual enviro...

Install python package using jupyter notebook

Open a file in jupyter notebook

Monday, April 20, 2020

How to permanently disable and clear windows 10 activity history?


How to delete all the data collected by google/youtube ?


Solved: Fatal error: Uncaught Error: Call to undefined function mysqli_connect()


Cricket database project (with CRUD operations) developed using php and MYSQL database


Saturday, April 11, 2020

How to clear the search history of file explorer?


100% working: How to recover the permanently deleted files for free?


Thursday, April 2, 2020

How to get detailed battery report in windows 10?


Wednesday, April 1, 2020

How to add idm extension to tor browser?


Monday, March 30, 2020

Jobs portal web application using django framework

Solved! search.yahoo.com browser hijacker

Friday, September 27, 2019

Summary on “An integrated cost model for software reuse”

Image result for software reuse
Today, complex, high quality computer based-systems must be built in a very short time period. This results in an organized approach to reuse. Sometime the potential gains from reusing specifications may be greater than reusing code component because code contains low level details. There are various features distinguishing between different cost models. Investment cycle which has four distinct cycles as corporate, domain engineering, application engineering and component engineering investment cycle. Economic function having five different functions as Net Present Value, Payback Value, Average Return on Book Value, Internal Rate of Return, Profitability Index. Cost factors which specifies what aspects of reuse decision we want to consider. Reuse organization- the organizational structure has some impact on how costs are determined, charged, and accounted for. Scope where some models consider a short-term decision whereas others consider a long-term investment cycle. Some cost models neglect integration costs. Some cost models fail to take into account the discount rate of resources and so on.  There are variety of viewpoints involved in software reuse initiatives like corporate executives, the producer staff, the consumer staff, library managers, and component providers. A generic software reuse model can be classified as Variety of investment cycles, Variety of cost factors, Variety of economic functions, Variety of viewpoints, Variety of hypotheses. Variety of investment cycles involve four different cycles as Corporate level decides Whether to introduce reuse in the practice of software development, Domain engineering decides whether to initiate a domain analysis/domain engineering initiative, Application engineering decides whether to introduce reuse practices for development project, Component engineering decides whether it is worthwhile to develop a specific component to serve a group of project teams. Variety of cost factors include investment Cycle, denoted by Y, measured in number of years, typically ranging between 3 and 5, discount Rate, denoted by d, is an abstract quantity, that typically ranges between 0.10 and 0.20. It reflects the time value of money, investment Costs, denoted by IC, and measured in person months because most costs that arise can best be quantified as personnel effort, episodic Benefits, at year y, for 1 <= y <= Y , denoted by B(y), and measured in person months, episodic Costs, at year y, for 1 <= y <= Y , denoted by C(y), and measured in person months. Variety of economic function include Net Present Value, denoted by N P V, measured in person months, Return on Investment, denoted by ROI. ROI recognizes that investments involve risks, Profitability Index, denoted by PI. This quantity allocates the potential profit with respect to the investment cost. An investment is worthwhile whenever P I exceed 1. Internal Rate of Return, denoted by IRR, and defined as the value of d that makes the net present value zero. Payback Value, denoted by P B and defined as the shortest investment cycle that makes the net present value non-negative. Average Rate of Return, denoted by ARR. It prorates the profitability index by the number of years in the investment cycle. Variety of viewpoint includes Component Engineering Viewpoint where We are deal with here is whether or not to develop a reusable asset to satisfy a tentatively specified set of requirements. Application Engineering Viewpoint where we deal whether or not to adopt reuse in a given development project. Domain Engineering Viewpoint where we deal whether or not to initiate a domain engineering effort in a tentatively specified application domain. Corporate Viewpoint where the investment decision we deal with is whether or not to initiate a corporate software reuse program. Variety of hypotheses includes Non-Linear Cost Effects, Integration Costs, Quantifying Quality Gains, Code Inflation. In automated support, the proposed has two main functions. In archival function, the purpose of this function is to keep track of costs and benefits as they arise. In analytical function, the purpose of this function is analyze investment cycles by producing any combinations of functions. There are various existing cost models. The model proposed by Ganey and Cruickshank combines domain engineering costs and application engineering costs in a single equation, it does not take into account considerations and assumes that the number of applications that make up the domain engineering effort is predetermined. Kain proposes a return on investment model that is especially geared towards ob ject oriented programming (where reusable assets are objects at various levels of abstraction). There are various other models.

Summary of NO silver bullet

Image result for no silver bullet
There are two types of complexity in software development process. They are essence and accidents complexity. Essence complexity are inherent in the nature of the software whereas accident complexity is such, that today attend its production but that are not inherent. Accidental complexity arises due to hardware constraints, awkward programming languages. Most of the programmers are still devoted towards solving accidental problem but now the time has come to address essential complexity. Essential complexity include conformity, changeability, flexibility, invisibility. In conformity, the challenge is to run new software on a system where other software systems were running. Changeability makes software dependent on the technological advancement of its surroundings. Because software is invisible, it becomes very difficult to sketch the clear idea of how the final product looks or perform. The development of high level languages like Fortran, C, C++ have solved the accidental difficulties to a great extent. There is no one silver bullet but a series of innovations could lead to a significant improvement. Ada language, object-oriented programming, artificial intelligence could play vital role to achieve those improvements and are hopes for the silver. Building expert systems is not an easy task but systems like graphical programming, program verification, automatic programming play vital role in software development. As large number of software engineers are involved in software development, large number is developed daily. It leads to an attack on conceptual essence and sometime it is more reliable to buy the already built software rather than developing the new one. There is difference between good designers and great designers. There is as much as tenfold difference between an ordinary designer and great one so they should be treated well and provided with higher status.