What are the factors that define the price of a car
Find out how the different features are influencing the price of a car
Introduction
When buying a car every client faces the problem that the prices differ strongly even in the same category of vehicles. So what is causing that variation?
Today’s market offers an enormous variety. The vehicles themselves have become a complex technical systems containing multiple subsystems which offer different features. Additionally there is a possibility for customization of some of those by the client. On top of that we need also to consider the advertising campaigns and the reputation that some manufacturers have.
Though there is a huge variety of features in a car, there are some basic ones, which are valid for both old and new models. Their effect on the price will be analyzed in this post. We will try to find the answers of the following three questions:
- Which factors or basic features are mainly influencing the price of car?
- How much does the manufacturer’s name/brand effect the price of a car and would it be possible to group car makers according to the price?
- What should the price of a conventional middle class family car be according to the model created?
In trying to find the answers of those questions we will examine some car selling data visually, mathematically and at the end we will build a machine learning regression model that would utilize the data used and help in predicting the price of a vehicle with specific features.
Data overview
The used data has been taken from kaggle and stems originally from “Edmunds.com Inc.”. It comprises car prices from year 1990 to 2017. There are overall 16 columns and 11914 rows containing both numerical and categorical data. The target variable that we aim to analyze here is the price of a car, represented with the abbreviation “MSRP”.
As stated in the introduction, the data here does not contain optional features, but focuses on basic ones found in any conventional vehicle like:
- Type of fuel used by the engine
- Engine power
- Transmission type
- Driven wheels
- Vehicle size and style
- Fuel consumption
This analysis focuses only on conventional internal combustion engine driven vehicles as most of the basic car features described here are closely related or influenced by the engine. The entries for electrical vehicles in the data set will therefore be removed. Because of the different physics and mechanisms of those vehicles the factors affecting their price differ strongly from those affecting the conventional vehicles and thus electrical vehicles would only input distortion when modelling the price dependencies. The comparison between electrical and conventional vehicles and their prices could be itself a topic of another analysis.
Part I: Which car features influence its price?
In order to analyze the effect of the different features on the car price, we will make a basic separation of them into numerical and categorical. Then using scatter plots and bar plots we can get a rough notion how significant the influence is and whether it needs to be further taken into account.
Lets start with the numerical features. Their effect on the car price can be seen on the scatter plots below.
From the first look it is clear that there are some outliers which need to be taken care of. Additionally we can see the monotony in the dependence between the price and the engine power, price and engine cylinders, as well as between the price and the city mileage. The same also applies for the highway mileage, but due to an outlier, the scaling is distorted. Another thing that is clear is that “Number of Doors” can not be directly used as a numerical feature in a regression model as it demonstrates no monotony at all and would be better to interpret it as an ordinal category feature. To get a better notion of the dependence between the price and the different numerical features and find possible interdependence, we can check the correlation between them.
As can be seen ‘Engine Cylinders’ has a high correlation with many of the other numerical features, thus it will be excluded. ‘highway MPG’ and ‘city MPG’ have also high mutual correlation, so ‘highway mpg’ will be excluded (‘city mpg’ has a higher correlation coefficient with the target). Thus we can say that from the numerical features the engine power, model year and city mileage have the most significant influence on the price.
The next thing to check is the categorical features. We will use bar diagrams for that purpose.
Comparing the different plots, we can see that in all cases there is a significant variability in the price. That is why we would need to consider all of them, as obviously they are having an effect.
As a result of the data analysis it turned out that among all basic car features the ones influencing mainly the price are:
- Engine horse power
- Year of the model
- Car efficiency — city mpg
- Fuel type used
- Transmission type
- Driven wheels
- Number of doors
- Vehicle size and style
Part II: How much the manufacturer’s name determine the selling price?
In determining the influence of the car make we will exclude the different models. It would be impractical to use them, as there are literally thousands of different cases and that would not give a valuable statistical insight. We will create a bar plot out of the different car makers, depicting the average price.
From the diagram above it is clear that there is a big deviation in the mean prices for the different car makers. It is also clear that there is an enormous deviation in the prices for the same manufacturer(e.g. ‘Genesis’, ‘Porsche’, ‘Alfa Romeo’ and ‘Lotus’). So the car make do have a significant influence on the price and should be taken into account when modelling.
The brand of the car maker plays a role in price formation. Here must be noted though, that the different producers can be grouped in several groups:
- Lower middle class producers
- Upper middle class producers
- High class luxury producers
As can be seen from the bar diagram of the price dependency on the car maker, there is a big difference in the price comparison between e.g. “Volkswagen” <-> “Chevrolet” and “Volkswagen” <-> “Lincoln”. In the first case the average price is merely the same, whereas in the second case there is a substantial difference in the average price (more than 10000). That implies that within the same car maker class the price difference is not so big, but it becomes significant when moving to another car brand class.
Part III: Can we predict the price of middle class vehicle using the statistical data from the sample?
We can analyze the statistics of the sample and use the data to build a model. When entering the parameters of a typical middle class vehicle as input to that model, we should be able to predict its price. That predicted price would refer to the “system conditions” in which the statistic samples were gathered.
For the car price prediction of a middle class family car we’ve chosen the Vollswagen Golf VII. That is one of the most sold family cars of the lower middle class in the European market. Here should be noted that the characteristics of the car are the same as the ones of Skoda Octavia III which won the “Best and most safe family car of 2013” price. We’ve chosen Volkswagen Golf VII here just because Skoda as a car maker is missing in the statistics here, but both are part of the same concern.
Specifications:
- Engine HP: 110 kW(147 HP)
- Vehicle size: middle
- Vehicle style: hatchback
- Make: Volkwagen
- Engine fuel: regular unleaded
- city mpg: 37
- Transmission: manual
- Number of doors: 4
- Model year: 2013
According to the model such a car should cost around 20.000 dollar in the US market, which can also be seen here https://cars.usnews.com/cars-trucks/volkswagen/golf.
Conclusion
There are some basic car features which determine the vehicle’s price. After analyzing the gathered data we can conclude that among the raw technical specifications the car brand class should also be taken into account when buying a vehicle. For the same characteristics switching to a different car maker can result in a couple of thousands dollar saved.
Under that github link you can find the whole python notebook with the analysis.