New York Property Sales Part 2 – Nauman Yawar Butt
We’ll be talking about machine learning and how it was implemented in our project to find predicted sale price of properties in New York City using test input set.
Data set: NYC Property Sales
Goal is to make a model that an interested buyer or seller can use to find the expected price of a property by entering details about its square footage, location, etc.
We use columns from the cleaned data frame (from Part 1) to be used as features in training data and test data.
Training data and test data is divided with 8:2 ratio from the overall data set. There are a total of 3 features that have continuous (non-categorical) data:
- LAND SQUARE FEET (with range from 33 to 4,252,327)
- GROSS SQUARE FEET (with range from 0 to 3,750,565)
- YEAR BUILT (with range from 1800 to 2017)
The target of the machine learning model is to find the SALE PRICE. To train the Model we are are converting the 3 features stated above and the target of SALES PRICE into categorical data. The new classes for the features and target are:
- LAND SQUARE FEET (with range from 0 to 4,252 : each class has capacity of 1,000 units)
- GROSS SQUARE FEET (with range from 0 to 3,750 : each class has capacity of 1,000 units)
- YEAR BUILT (with range from 0 to 4 : each class has capacity of 50 years)
- SALES PRICE (with range 0 to 2,210 : each class has capacity of 1 Million Dollars)
We used SPARK, a general-purpose distributed data processing engine, in the DataBricks python environment. Choosing the optimal depth by checking accuracy for each depth from 1 to 20. The optimal depth came out to be 15 with an accuracy of approximately 78%.
featureIndexer = VectorIndexer(inputCol="features", outputCol="IndexedFeatures")
(train, test) = df.randomSplit([0.8, 0.2])
dt = DecisionTreeClassifier(labelCol="Indexed", featuresCol="features", maxDepth=15)
pipeline = Pipeline(stages=[Indexer, vect, dt])
model = pipeline.fit(train)
predictions = model.transform(test)
predictions.select("prediction", "Indexed", "features").show(5)
evaluator = MulticlassClassificationEvaluator(
labelCol="Indexed", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = ", accuracy, "%")
print("Test Error = " , (1.0 - accuracy), "%")
Accuracy = 78.37%
Test Error = 21.63% of depth 15 with 6695 nodes.