Home » Data Analyst Project For Beginner : Analysis of Lung Cancer

Data Analyst Project For Beginner : Analysis of Lung Cancer

Data Analyst Project For Beginner : Analysis of Lung Cancer


Lung cancer remains one of the leading causes of cancer-related deaths worldwide. Early detection and accurate prediction of lung cancer can significantly improve patient outcomes and survival rates. The Lung Cancer Prediction dataset, available on Kaggle, offers a wealth of information on various factors associated with lung cancer. This article explores the process of analyzing this dataset to uncover patterns, identify key risk factors, and offer actionable insights for improving lung cancer prediction and management using advanced data analytics techniques and tools.

Overview of the Lung Cancer Prediction Dataset

The Lung Cancer Prediction dataset encompasses detailed information about patients, capturing essential parameters such as:

  • Age: Age of the patient.
  • Gender: Gender of the patient.
  • Smoking: Smoking history (0 = non-smoker, 1 = smoker).
  • Yellow Fingers: Presence of yellow fingers (0 = no, 1 = yes).
  • Anxiety: Presence of anxiety (0 = no, 1 = yes).
  • Peer Pressure: Exposure to peer pressure (0 = no, 1 = yes).
  • Chronic Disease: Presence of chronic disease (0 = no, 1 = yes).
  • Fatigue: Experience of fatigue (0 = no, 1 = yes).
  • Allergy: Presence of allergy (0 = no, 1 = yes).
  • Wheezing: Experience of wheezing (0 = no, 1 = yes).
  • Alcohol Consumption: History of alcohol consumption (0 = no, 1 = yes).
  • Coughing: Experience of coughing (0 = no, 1 = yes).
  • Shortness of Breath: Experience of shortness of breath (0 = no, 1 = yes).
  • Swallowing Difficulty: Experience of difficulty swallowing (0 = no, 1 = yes).
  • Chest Pain: Experience of chest pain (0 = no, 1 = yes).
  • Lung Cancer: Presence of lung cancer (0 = no, 1 = yes).


The primary objectives of this analysis are:

  1. Understanding Risk Factors: Investigating how different factors correlate with the presence of lung cancer.
  2. Predicting Lung Cancer: Building predictive models to accurately identify patients at risk of lung cancer.
  3. Optimizing Prevention Strategies: Developing strategies for early detection and prevention of lung cancer.


  • H1: Smoking Influence: Smoking is significantly associated with an increased risk of lung cancer.
  • H2: Age and Gender Impact: Older age and male gender are associated with higher lung cancer risk.
  • H3: Chronic Diseases Correlation: The presence of chronic diseases increases the likelihood of lung cancer.
  • H4: Symptom Indicators: Symptoms like coughing, wheezing, and chest pain are strong indicators of lung cancer.
  • H5: Combined Factors: A combination of multiple factors provides a better prediction of lung cancer risk.

Analytical Process

1. Preliminary Exploration using Google Sheets

The initial step involves importing the Lung Cancer Prediction dataset into Google Sheets for a high-level overview. This phase focuses on:

  • Data Structuring: Understanding the dataset’s structure and dimensions.
  • Basic Statistics: Calculating summary statistics such as average age, gender distribution, and prevalence of various symptoms.
  • Identifying Data Quality Issues: Flagging missing values, outliers, and inconsistencies that may require further cleaning.

2. Data Cleaning and Analysis with Python

Transitioning to Python, the dataset undergoes rigorous cleaning and transformation steps using libraries such as pandas, numpy, and matplotlib:

  • Cleaning Data: Handling missing values, duplicates, and correcting data types for accurate analysis.
  • Feature Engineering: Creating new features such as age groups and symptom scores.
  • Exploratory Data Analysis (EDA): Visualizing distributions, trends, and relationships between variables using seaborn and matplotlib to uncover insights.

3. Machine Learning Modeling

Building and evaluating machine learning models to predict lung cancer risk based on patient data:

  • Model Selection: Evaluating different algorithms such as logistic regression, decision trees, random forests, and gradient boosting.
  • Training and Testing: Splitting the dataset into training and testing sets, and using cross-validation to ensure model robustness.
  • Performance Metrics: Assessing model performance using metrics such as accuracy, precision, recall, and AUC-ROC.

4. Visualization and Reporting with Power BI

For comprehensive visualization and reporting, the cleaned dataset is imported into an SQL database and connected to Power BI:

  • Interactive Dashboards: Creating dynamic dashboards in Power BI to visualize:
    • Distribution of risk factors such as age, smoking, and symptoms.
    • Correlations between various factors and lung cancer presence.
    • Gender and age distributions among lung cancer patients.
    • Patterns in lung cancer risk associated with lifestyle factors and symptoms.

Insights and Applications

The insights derived from this analysis can offer substantial benefits to lung cancer prediction, public health strategies, and individual health awareness:

  • Enhanced Screening Programs: Developing targeted screening programs to identify high-risk individuals.
  • Informed Health Strategies: Informing public health policies and initiatives to combat lung cancer.
  • Public Awareness: Raising awareness about the risk factors associated with lung cancer and promoting preventive measures.
  • Personalized Health Plans: Helping healthcare providers develop personalized health plans for patients at risk of lung cancer.


Analyzing the Lung Cancer Prediction dataset provides a comprehensive understanding of the dynamics between various risk factors and lung cancer. By leveraging data analytics techniques—from initial exploration and cleaning to advanced machine learning modeling and visualization—this analysis not only uncovers actionable insights but also demonstrates the power of data-driven decision-making in enhancing lung cancer prediction and management.

Whether you’re a data analyst, healthcare provider, or public health official, exploring such datasets offers invaluable opportunities to understand and improve the way we manage and prevent lung cancer.

Frequently Asked Questions

1. What is the Lung Cancer Prediction dataset, and why is it significant?

The Lung Cancer Prediction dataset contains detailed information on various factors associated with lung cancer. This dataset is significant as it provides insights into the risk factors and management strategies for lung cancer, helping to improve healthcare outcomes and inform public health policies.

2. What tools and technologies are used for analyzing the Lung Cancer Prediction dataset?

Tools commonly used include:
Python: For data cleaning, analysis (using libraries like pandas, numpy), and visualization (matplotlib, seaborn).
SQL: To manage and query data when working with large datasets or relational databases.
Power BI or Tableau: For creating interactive visualizations and dashboards to present insights.
Google Sheets: For preliminary data exploration and basic analysis.

3. How can insights from analyzing the Lung Cancer Prediction dataset benefit lung cancer care?

Insights derived can help:
Enhance Screening Programs: Develop targeted screening programs to identify high-risk individuals.
Inform Health Strategies: Provide data to inform public health policies and initiatives.
Raise Public Awareness: Educate the public on risk factors associated with lung cancer and promote preventive measures.
Develop Personalized Health Plans: Help healthcare providers develop personalized health plans for patients at risk of lung cancer.