Navigation Menu
Search code, repositories, users, issues, pull requests..., provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
- Notifications You must be signed in to change notification settings
Master Thesis project of the Masters in Data Science of the University of Barcelona
jordisc97/MSc_Data_Science-Master_Thesis
Folders and files, repository files navigation, apply machine learning in the company to predict the quality of sales leads.
By Jordi Solé Casaramona.
This repository is part of the thesis of the Master's in Data Science by the University of Barcelona 2019/2020.
This work has been done with the collaboration of the EMEA 3D sales team in HP. The dataset in which the project is based on is currently more than 40,500 leads from the fiscal year 2016 quarter 1 to the fiscal year 2020 quarter 4. On average, every week 257 new leads are entered into the system.
The main objective of this project is to develop a data science pipeline, capable of predicting, for every lead, the quality of it. Meaning quality, the probability in which the lead is going to become a possible sell, and to advance to the next sale stages. By doing this, we want to achieve a transition from decisions based on intuition from the salesman, to more data-driven decision-making with the use of a score, from 0 to 1, that will be an indication of the quality of the lead.
The pipeline developed is widely explained in the thesis file found in this repository. But the main structure of the pipeline is exposed below:
The pipeline consist of the following processes:
- Joining: First joining all the sources in one table, this is done in a Z8 server inside HP.
- Scraping: Web scraping techniques were used to enhance the information received from the company CRM.
- Preprocessing: Some of the scraped pages were not in English and hence, they needed to be translated. Along with the translation, other tasks such as data cleaning, feature engineering, and encoding were needed to assure the best possible dataset to feed the algorithm.
- Training: The model can be periodically retrained with a pipeline parameter. This training was developed on all the data to predict the score for just the leads that were not assigned to any state yet. The algorithm selected to do the predictions was an Extreme Gradient Boosting. The output of this training is a pickle file used to do faster predictions.
- Prediction: In this step, the score for each lead was output along with the explainability of the most important attribute for the decision with the LIME package.
- PowerBi: The file resulting from the prediction was retrieved from the server and put to a PowerBi so the whole organization can use the data from the scarping, scoring, and explainability algorithms.
Many organizations are still driven by intuition and experience-based decision making. With this type of decision-making, problems such as human bias, loss of experienced workers, and the reluctance to use more sophisticated information systems can be a severe problem. With the arrival of the era of data, companies have at their disposal more information than never before, but not many know how to use this resource to its full potential. In this work, we are going to develop a data science pipeline to predict the quality of the sales leads for the EMEA 3D sales department in HP, a project that aims to enhance the transition to a data-driven decision-making organization.
In order to solve this problem, the developed pipeline was focused on two tasks. The first, involved developing a web scraping tool to obtain information not previously available on the company database or that was very time consuming to acquire due to the size of the database, of more than 40,000 leads. And second, the training of a machine learning algorithm to predict a score quality together with an explainability of the main features of the decision for every lead.
The result of this process greatly impacted the business, all the knowledge was kept always in the company inside the machine learning model, and the explanations of each decision are making gain confidence in the model. Furthermore, the sales team used the score to make more data-driven decisions and save time by prioritizing the best quality leads. The accuracy of the trained Extreme Gradient Boosting algorithm to do the predictions proved to be a 13.45% improvement over the baseline model with a total accuracy of 0.94282 when tested on the test set.
Lastly, all these tasks were put together as a pipeline and uploaded to a server inside HP to execute the process automatically every day with minimal human intervention. The pipeline developed proved to give very positive results for the organization and further developments are being made to enhance the results.
Disclaimer: The data used for this project is highly sensitive and as part of the confidentiality agreement signed with HP, only minimal amounts of data with no customer details can be taken out of the company nor be externally used. As a result, no data exploration can be publicly shown. Due to this, only the output file with the predictions and explainability with no customer data can be exposed in this public domain. The other files have been left just with one or two lines to get the grasp of the fields in the dataset.
IMAGES
COMMENTS
Master Thesis: Data Science and Marketing Analytics Interpretable Machine Learning for Attribution Modeling A Machine Learning Approach for Conversion Attribution in Digital Marketing Student name: Jordy Martodipoetro Student number: 454072 Supervisor: Dr. Kathrin Gruber Second assessor: Prof. Bas Donkers Date final version: 15 July 2021
This master's thesis was undertaken as a part of the Applied Data Science program at Utrecht University. Two teams of master students, collaborated with Inversable BV and Intergas Verwarming BV, which provided the team with the dataset collected for the Demoproject Hybride.
8-1 Three different data science competitions held during the period of 2014-2015. On the left is the data model for KDD Cup 2014, at the bottom center is the data model for IJCAI, and on the right is the data model of KDD Cup 2015. A total of 906 teams took part in these competitions. We note that two out of three competitions are
Dec 17, 2019 · The Master of Science in Data Science program requires the successful completion of 12 courses to obtain a degree. These requirements cover six core courses, a leadership or project management course, two required courses corresponding to a declared specialization, two electives, and a capstone project or thesis.
This repository is part of the thesis of the Master's in Data Science by the University of Barcelona 2019/2020. This work has been done with the collaboration of the EMEA 3D sales team in HP. The dataset in which the project is based on is currently more than 40,500 leads from the fiscal year 2016 quarter 1 to the fiscal year 2020 quarter 4.
Data Computing: Creating revolutionary breakthroughs in commerce, science, and society’ drew the context in which computational technologies such as (1) digital sensors, (2) computer networks, (3) data storage, (4) cluster computing systems, (5) cloud computing facilities and (6) data analysis algorithms; have changed the way in which