Vehicle Fuel Economy
Transportation energy use affects operating cost, petroleum consumption, and greenhouse gas emissions. Vehicle design choices such as drivetrain, transmission, engine displacement, and fuel type influence efficiency in ways that are both physical and behavioral. Because the fuel economy data spans conventional gasoline cars, diesel vehicles, hybrids, and battery electric vehicles, it provides a useful problem statement that combines data engineering, machine learning, and AI deployment.
The fuel economy data set in vehicles.csv contains 49,846 records, 84 columns, and model years from 1984 to 2026. The data includes categorical fields such as make, model, drive, transmission, and vehicle class, along with continuous measurements such as combined MPG, annual fuel cost, and tailpipe CO2 emissions. Some fields are sparse, some are duplicated across trims, and several variables are derived from one another, so thoughtful data cleansing and leakage prevention are essential.
Objective: Develop an end-to-end machine learning solution using the fuel economy data. Cleanse the data set, create useful visualizations, engineer features, classify whether a vehicle is high efficiency, regress fuel efficiency and tailpipe CO2 emissions, verify a vehicle image with YOLO using COCO object ID 2 (car), and design a Streamlit app with local Retrieval-Augmented Generation (RAG) using Ollama through the Python ollama package. Randomly split the data into a train (80%) and test (20%) set. Discuss model performance on the training and test data, rank the most important features, and submit source code plus a short summary memo (max 2 pages).
Classification: Use at least 2 classification methods to predict whether a vehicle is high efficiency. A suggested target is a binary label where high efficiency means combined MPG >= 24. Evaluate the methods with accuracy, F1 score, and a confusion matrix. Use SelectKBest to rank the candidate predictors for classification.
Regression: Use at least 2 regression methods to predict:
- Combined fuel economy (comb08)
- Tailpipe CO2 emissions (co2TailpipeGpm)
Recommended regression methods include Linear Regression, Random Forest, Gradient Boosting, or XGBoost. Use SelectKBest to rank the strongest MPG predictors for regression.

Computer Vision / LLM / Deployment:
- Use a pretrained YOLO model and confirm at least one detection with COCO class ID 2 (car). Use the photo above to test.
- Build a Streamlit app that selects from year, make, and model in the data set. If multiple trims remain after the first three selections, add a final trim selector.
- Predict the selected vehicle's combined MPG and tailpipe CO2 emissions.
- Show the measured combined MPG relative to the predicted combined MPG on the dashboard for the selected vehicle.
- Build a local RAG workflow from the data set and use the Python ollama package with a local model to answer questions about the selected vehicle and its fuel efficiency.
- Show the retrieved context used by the LLM.
Data: Data from the U.S. Department of Energy and the Environmental Protection Agency is available from FuelEconomy.gov.
url = 'http://apmonitor.com/pds/uploads/Main/')
file = 'vehicles.zip'
data = pd.read_csv(url+file)
data.head()
| Data | Description |
|---|---|
| year | Model year |
| make | Vehicle manufacturer |
| model | Vehicle model name |
| VClass | Vehicle class |
| drive | Drive configuration |
| trany | Transmission description |
| fuelType1 | Primary fuel type |
| cylinders | Number of cylinders |
| displ | Engine displacement in liters |
| comb08 | Combined fuel economy (MPG) |
| fuelCost08 | Annual fuel cost estimate (USD) |
| co2TailpipeGpm | Tailpipe CO2 emissions (g/mi) |
The data set requires cleansing and feature engineering before modeling. Some rows have missing engine or drivetrain information, multiple trims share the same year / make / model combination, and several response variables are directly related to one another. Carefully justify which fields are retained for classification and regression, and discuss potential data leakage when using derived targets such as fuel cost, petroleum barrels, and emissions.
Potential predictive features include model year or vehicle age, make, vehicle class, grouped fuel type, grouped drivetrain, grouped transmission type, gear count extracted from the transmission label, number of cylinders, engine displacement, and engineered binary indicators such as turbocharger, supercharger, start-stop, electrified, and AWD / 4WD flags. Rank these features with SelectKBest for both the classification target and the MPG regression target to highlight which variables are most informative.
Suggestions
- Inspect the target distributions before fitting regressors. If your MPG model produces unrealistic values, first verify that the label range is reasonable and that bad or unsupported rows are removed from the regression training set.
- Be careful with electrified vehicles because they live on a different MPG / MPGe scale and often have structurally missing engine fields such as cylinders or displacement. If a single regression model performs poorly, separate electrified vehicles from internal combustion vehicles or engineer powertrain-aware features.
- Avoid leakage by excluding direct or derived target variables when appropriate. For example, fuel cost, petroleum barrels, and emissions can leak information when predicting MPG or related responses.
- Duplicate trims may share the same year, make, and model. Make sure the app handles this ambiguity explicitly.
- Use the measured-versus-predicted comparison on the dashboard as a debugging tool. If the delta is consistently biased for one vehicle type, revisit the feature engineering or segmentation strategy.
Deliverables
- Completed Jupyter notebook with analysis, figures, and code
- Streamlit application
- Short summary memo (max 2 pages)
- Run instructions for the local Ollama model and Streamlit app
Starter Files
References
- U.S. Department of Energy / Environmental Protection Agency. FuelEconomy.gov.
- Ultralytics. YOLO Documentation.
- Ollama. Run local large language models.