NLP — Feedback on English Language Learning
Objective: To use machine learning to provide more accurate feedback on English language development and expedite the grading cycle for teachers.
Table of Contents
- Understanding of the problem
- Exploratory Data Analysis
- Data Cleaning and Processing
- Feature Engineering
- Modelling and Model Evaluation
- Make Predictions on Test Dataset and Make Submission to Kaggle
- Possible Future Improvements
1. Understanding of the Problem
Dataset: The dataset presented here (the ELLIPSE corpus) comprises argumentative essays written by 8th-12th grade English Language Learners (ELLs). The essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5.
Analysis
Are we able to predict these six analytic measures based on this dataset?
- How are we going to define the target?
Assume each measure as independent and split each measure as a target and combine the results at the end
- What type of regression to use on each of the category?
We will using the following types of regression methods and find out the accuracy score for each:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest
- XG Boost
First, we import all the relevant libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import sklearn
import nltk
import string, re
import matplotlib.pyplot as plt
import seaborn as sns
# nltk.download("all")
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
2. Exploratory Data Analysis
First, we read the data from the source using the read_csv function, and look at the sample data:
train = pd.read_csv('../input/feedback-prize-english-language-learning/train.csv')
test = pd.read_csv('../input/feedback-prize-english-language-learning/test.csv')
train.sample(5)
text_id full_text cohesion syntax vocabulary phraseology grammar conventions
2669 C537E278BAE6 No they shouldn't add an hour more because we ... 2.5 2.5 3.0 2.5 2.5 3.0
3658 F5B510BCCA2D Enjoyable activity is a good thing when it com... 3.5 3.5 3.0 3.0 3.5 3.0
2154 A0D47C0DD67F Individuality is the idea of freedom of though... 3.0 3.0 3.0 2.0 2.0 2.0
3139 DD91A86810BD March 2020/10\n\nGeneric_Name\n\nMs. Generic_N... 2.5 2.5 3.0 3.0 3.0 3.0
3387 E91F2E8986B2 If you master a subject you like and dont try ... 3.0 3.5 3.5 3.5 3.0
We then check training dataset info to see if it shows some missing values:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text_id 3911 non-null object
1 full_text 3911 non-null object
2 cohesion 3911 non-null float64
3 syntax 3911 non-null float64
4 vocabulary 3911 non-null float64
5 phraseology 3911 non-null float64
6 grammar 3911 non-null float64
7 conventions 3911 non-null float64
dtypes: float64(6), object(2)
memory usage: 244.6+ KB
Based on the above info, there are no missing values for all columns.
Now, we can split the predictors and seperate target columns for each measure.
pred = train['full_text']
target1 = train['cohesion']
target2 = train['syntax']
target3 = train['vocabulary']
target4 = train['phraseology']
target5 = train['grammar']
target6 = train['conventions']
Next, we take a look at how the distribution for each of the target looks like:
fig, ax = plt.subplots(3,2)
fig.set_size_inches(16,15)
ax[0][0].set_ylabel("Normal Frequency")
ax[0][0].set_xlabel("Count")
ax[0][0].set_title("Histogram and Density Plot of Response Variable 'cohesion'")
sns.histplot(train['cohesion'], kde=True, ax=ax[0][0])
ax[0][1].set_ylabel("Normal Frequency")
ax[0][1].set_xlabel("Count")
ax[0][1].set_title("Histogram and Density Plot of Response Variable 'syntax'")
sns.histplot(train['syntax'], kde=True, ax=ax[0][1])
ax[1][0].set_ylabel("Normal Frequency")
ax[1][0].set_xlabel("Count")
ax[1][0].set_title("Histogram and Density Plot of Response Variable 'vocabulary'")
sns.histplot(train['vocabulary'], kde=True, ax=ax[1][0])
ax[1][1].set_ylabel("Normal Frequency")
ax[1][1].set_xlabel("Count")
ax[1][1].set_title("Histogram and Density Plot of Response Variable 'phraseology'")
sns.histplot(train['phraseology'], kde=True, ax=ax[1][1])
ax[2][0].set_ylabel("Normal Frequency")
ax[2][0].set_xlabel("Count")
ax[2][0].set_title("Histogram and Density Plot of Response Variable 'grammar'")
sns.histplot(train['grammar'], kde=True, ax=ax[2][0])
ax[2][1].set_ylabel("Normal Frequency")
ax[2][1].set_xlabel("Count")
ax[2][1].set_title("Histogram and Density Plot of Response Variable 'conventions'")
sns.histplot(train['conventions'], kde=True, ax=ax[2][1])
print("Untransformed Skew:", train['cohesion'].skew())
print("Untransformed Skew:", train['syntax'].skew())
print("Untransformed Skew:", train['vocabulary'].skew())
print("Untransformed Skew:", train['phraseology'].skew())
print("Untransformed Skew:", train['grammar'].skew())
print("Untransformed Skew:", train['conventions'].skew())
Untransformed Skew: 0.03532450308440966
Untransformed Skew: 0.1256697860122495
Untransformed Skew: 0.2246989736265894
Untransformed Skew: 0.06700892334988967
Untransformed Skew: 0.20154509709382515
Untransformed Skew: 0.07694042278988233
Based on the above histogram and calculation on skew for each of the target variables, we can see that the variables are fairly normally distributed. Hence, we will not do any transformation to simplify the process.
3. Data Cleaning and Processing
Next, we define functions to process the words to a consistent and usable form.
We consider the following common preprocessing techniques for NLP:
- remove punctuations
- lower case
- tokenization
- remove stopwords
- lemmatization / stemming
For the first 3 processes (remove punctuations, lower case and tokenization), there should not be any impact on each of the 6 measures. However, the last 2 processes (remove stopwords and lemmatization / stemming) will possibly impact targets such as cohesion, vocab, grammer etc. Hence, we shall not remove stopwords and do lemmatization / stemming for our predictors.
#Define the functions for various text preprocessing methods
def remove_punc(data_string):
return re.sub("[^\s\w]|\n", "", data_string)
def lower_case(data_string):
return data_string.lower()
def tokenizer(data_string):
return word_tokenize(data_string)
def word_pro(data_string):
return ' '.join(tokenizer(lower_case(remove_punc(data_string))))
# def rm_stopwords(tokenized_list):
# stop_words = stopwords.words('english')
# res = []
# for i in tokenized_list:
# if i not in stop_words:
# res.append(i)
# return res
# def lemm(tokenized_list):
# lemmer = WordNetLemmatizer()
# lem_words = []
# for word in tokenized_list:
# lem_words.append(lemmer.lemmatize(word))
# return lem_words
# def word_pro(data_string):
# return ' '.join(lemm(rm_stopwords(tokenizer(lower_case(remove_punc(data_string))))))
Lets first take a look at pred column before it has been pre-processed with word_pro function:
pred
0 I think that students would benefit from learn...
1 When a problem is a change you have to let it ...
2 Dear, Principal\n\nIf u change the school poli...
3 The best time in life is when you become yours...
4 Small act of kindness can impact in other peop...
...
3906 I believe using cellphones in class for educat...
3907 Working alone, students do not have to argue w...
3908 "A problem is a chance for you to do your best...
3909 Many people disagree with Albert Schweitzer's ...
3910 Do you think that failure is the main thing fo...
Name: full_text, Length: 3911, dtype: object
Next we process the essays in the predictors using our word_pro function
def word_pro_trans(pred):
i = 0
while i < len(pred):
pred[i] = word_pro(pred[i])
i += 1
return pred
Now, lets take a look at pred column after it has been pre-processed with word_pro function. With the preprocessing done, the data are now cleaned for the next step.
word_pro_trans(pred)
pred
0 i think that students would benefit from learn...
1 when a problem is a change you have to let it ...
2 dear principalif u change the school policy of...
3 the best time in life is when you become yours...
4 small act of kindness can impact in other peop...
...
3906 i believe using cellphones in class for educat...
3907 working alone students do not have to argue wi...
3908 a problem is a chance for you to do your best ...
3909 many people disagree with albert schweitzers q...
3910 do you think that failure is the main thing fo...
Name: full_text, Length: 3911, dtype: object
4. Feature Engineering
We now define a vectorizer using the word_pro text processor as below:
def vector(text_preprocessor_fn):
return TfidfVectorizer(max_df = 0.7, min_df = 5, preprocessor=text_preprocessor_fn)
vectorizer = vector(word_pro)
4.1. Train, check split for Target1 — Cohesion
Next we proceed to do a train, check split for the first target variable Cohesion
X_train, X_check, y_train1, y_check1 = train_test_split(pred, target1, test_size=.2, random_state = 1)
Check the shape of the train, check datasets
print(X_train.shape, y_train1.shape)
print(X_check.shape, y_check1.shape)
(3128,) (3128,)
(783,) (783,)
Fitting the training dataset
def vec_fitter(X_train, vectorizer):
return vectorizer.fit_transform(X_train)
X_train_vec = vec_fitter(X_train, vectorizer)
X_train_vec
<3128x4633 sparse matrix of type '<class 'numpy.float64'>'
with 370564 stored elements in Compressed Sparse Row format>
Looking at the dataframe:
vec = vectorizer.fit_transform(X_train)
df = pd.DataFrame.sparse.from_spmatrix(vec, columns=vectorizer.get_feature_names())
df.sample(10)
Similarly, we need to transform the checking dataset:
def vec_trans(X_test, vectorizer):
return vectorizer.transform(X_test)
X_check_vec = vec_trans(X_check, vectorizer)
X_check_vec
<783x4633 sparse matrix of type '<class 'numpy.float64'>'
with 93933 stored elements in Compressed Sparse Row format>
4.2. Train, check split for Target2 — Syntax
Next we proceed to do a train, check split for the target variable Syntax
X_train, X_check, y_train2, y_check2 = train_test_split(pred, target2, test_size=.2,random_state = 1)
4.3. Train, check split for Target3 — Vocabulary
Then we proceed to do a train, check split for the target variable Vocabulary
X_train, X_check, y_train3, y_check3 = train_test_split(pred, target3, test_size=.2,random_state = 1)
4.4. Train, check split for Target4 — Phraseology
Then we proceed to do a train, check split for the target variable Phraseology
X_train, X_check, y_train4, y_check4 = train_test_split(pred, target4, test_size=.2,random_state = 1)
4.5. Train, check split for Target5 — Grammar
We then proceed to do a train, check split for the target variable Grammar
X_train, X_check, y_train5, y_check5 = train_test_split(pred, target5, test_size=.2,random_state = 1)
4.6. Train, check split for Target6 — Conventions
Finally, we proceed to do a train, check split for the first target variable Conventions
X_train, X_check, y_train6, y_check6 = train_test_split(pred, target6, test_size=.2,random_state = 1)
5. Modeling and Model Evaluation
5.1 Model Selection and Evaluation (Target1 — Cohesion)
We are evaluating the r2 scores of Linear and Ridge regression for Target1 and will perform similar steps for the other 5 targets.
(Note: We have also tried Lasso, Random Forest and XG Boost for regression separately but the results were worst that that of Linear / Ridge regression. Hence, these have been excluded here)
print('---------------------Linear-----------------------')
lr_reg1 = LinearRegression()
lr_reg1.fit(X_train_vec, y_train1)
lr_r2_score_train1=r2_score(y_train1, lr_reg1.predict(X_train_vec))
lr_r2_score_check1=r2_score(y_check1, lr_reg1.predict(X_check_vec))
print('lr_r2_score_train=',lr_r2_score_train1)
print('lr_r2_score_check=',lr_r2_score_check1)
print('rmse=', mean_squared_error(y_check1, lr_reg1.predict(X_check_vec), squared=False))
print('---------------------Ridge-----------------------')
ridge_reg1=GridSearchCV(Ridge(), param_grid={'alpha':np.linspace(-2.0,50.0,20)}, cv=5)
ridge_model1=ridge_reg1.fit(X_train_vec, y_train1)
print('ridge best_para', ridge_reg1.best_params_)
print('ridge best_score', ridge_reg1.best_score_)
ridge_r2_score_train1=r2_score(y_train1,ridge_model1.predict(X_train_vec))
ridge_r2_score_check1=r2_score(y_check1,ridge_model1.predict(X_check_vec))
print('ridge_r2_score_train=',ridge_r2_score_train1)
print('ridge_r2_score_check=',ridge_r2_score_check1)
print('rmse=', mean_squared_error(y_check1, ridge_model1.predict(X_check_vec), squared=False))
---------------------Linear-----------------------
lr_r2_score_train= 0.9999999999883544
lr_r2_score_check= -1.6828017064903529
rmse= 1.1149797894257567
---------------------Ridge-----------------------
ridge best_para {'alpha': 3.473684210526316}
ridge best_score 0.24978137399859368
ridge_r2_score_train= 0.43927684324891325
ridge_r2_score_check= 0.23701998762855747
rmse= 0.5946060021324642
5.2 Model Selection and Evaluation (Target2 — Syntax)
print('---------------------Linear-----------------------')
lr_reg2 = LinearRegression()
lr_reg2.fit(X_train_vec, y_train2)
lr_r2_score_train2=r2_score(y_train2, lr_reg2.predict(X_train_vec))
lr_r2_score_check2=r2_score(y_check2, lr_reg2.predict(X_check_vec))
print('lr_r2_score_train=',lr_r2_score_train2)
print('lr_r2_score_check=',lr_r2_score_check2)
print('rmse=', mean_squared_error(y_check2, lr_reg2.predict(X_check_vec), squared=False))
print('---------------------Ridge-----------------------')
ridge_reg2=GridSearchCV(Ridge(), param_grid={'alpha':np.linspace(-2.0,50.0,20)}, cv=5)
ridge_model2=ridge_reg2.fit(X_train_vec, y_train2)
print('ridge best_para', ridge_reg2.best_params_)
print('ridge best_score', ridge_reg2.best_score_)
ridge_r2_score_train2=r2_score(y_train2,ridge_model2.predict(X_train_vec))
ridge_r2_score_check2=r2_score(y_check2,ridge_model2.predict(X_check_vec))
print('ridge_r2_score_train=',ridge_r2_score_train2)
print('ridge_r2_score_check=',ridge_r2_score_check2)
print('rmse=', mean_squared_error(y_check2, ridge_model2.predict(X_check_vec), squared=False))
---------------------Linear-----------------------
lr_r2_score_train= 0.9999999999894454
lr_r2_score_check= -1.6859968519501582
rmse= 1.0601550324794395
---------------------Ridge-----------------------
ridge best_para {'alpha': 3.473684210526316}
ridge best_score 0.27281984750820276
ridge_r2_score_train= 0.45261725204715597
ridge_r2_score_check= 0.2686178392321682
rmse= 0.5532084577134734
5.3 Model Selection and Evaluation (Target3 — Vocabulary)
print('---------------------Linear-----------------------')
lr_reg3 = LinearRegression()
lr_reg3.fit(X_train_vec, y_train3)
lr_r2_score_train3=r2_score(y_train3, lr_reg3.predict(X_train_vec))
lr_r2_score_check3=r2_score(y_check3, lr_reg3.predict(X_check_vec))
print('lr_r2_score_train=',lr_r2_score_train3)
print('lr_r2_score_check=',lr_r2_score_check3)
print('rmse=', mean_squared_error(y_check3, lr_reg3.predict(X_check_vec), squared=False))
print('---------------------Ridge-----------------------')
ridge_reg3=GridSearchCV(Ridge(), param_grid={'alpha':np.linspace(-2.0,50.0,20)}, cv=5)
ridge_model3=ridge_reg3.fit(X_train_vec, y_train3)
print('ridge best_para', ridge_reg3.best_params_)
print('ridge best_score', ridge_reg3.best_score_)
ridge_r2_score_train3=r2_score(y_train3,ridge_model3.predict(X_train_vec))
ridge_r2_score_check3=r2_score(y_check3,ridge_model3.predict(X_check_vec))
print('ridge_r2_score_train=',ridge_r2_score_train3)
print('ridge_r2_score_check=',ridge_r2_score_check3)
print('rmse=', mean_squared_error(y_check3, ridge_model3.predict(X_check_vec), squared=False))
---------------------Linear-----------------------
lr_r2_score_train= 0.9999999999900201
lr_r2_score_check= -1.5661362571273072
rmse= 0.9531889473526833
---------------------Ridge-----------------------
ridge best_para {'alpha': 0.736842105263158}
ridge best_score 0.298860313512476
ridge_r2_score_train= 0.6722306700603291
ridge_r2_score_check= 0.26660766881160647
rmse= 0.5095740736117805
5.4 Model Selection and Evaluation (Target4 — Phraseology)
print('---------------------Linear-----------------------')
lr_reg4 = LinearRegression()
lr_reg4.fit(X_train_vec, y_train4)
lr_r2_score_train4=r2_score(y_train4, lr_reg4.predict(X_train_vec))
lr_r2_score_check4=r2_score(y_check4, lr_reg4.predict(X_check_vec))
print('lr_r2_score_train=',lr_r2_score_train4)
print('lr_r2_score_check=',lr_r2_score_check4)
print('rmse=', mean_squared_error(y_check4, lr_reg4.predict(X_check_vec), squared=False))
print('---------------------Ridge-----------------------')
ridge_reg4=GridSearchCV(Ridge(), param_grid={'alpha':np.linspace(-2.0,50.0,20)}, cv=5)
ridge_model4=ridge_reg4.fit(X_train_vec, y_train4)
print('ridge best_para', ridge_reg4.best_params_)
print('ridge best_score', ridge_reg4.best_score_)
ridge_r2_score_train4=r2_score(y_train4,ridge_model4.predict(X_train_vec))
ridge_r2_score_check4=r2_score(y_check4,ridge_model4.predict(X_check_vec))
print('ridge_r2_score_train=',ridge_r2_score_train4)
print('ridge_r2_score_check=',ridge_r2_score_check4)
print('rmse=', mean_squared_error(y_check4, ridge_model4.predict(X_check_vec), squared=False))
---------------------Linear-----------------------
lr_r2_score_train= 0.9999999999899288
lr_r2_score_check= -1.3227679295300807
rmse= 1.0118130933983809
---------------------Ridge-----------------------
ridge best_para {'alpha': 3.473684210526316}
ridge best_score 0.295626308190107
ridge_r2_score_train= 0.47367657568116983
ridge_r2_score_check= 0.3115300187991693
rmse= 0.5508582891743112
5.5 Model Selection and Evaluation (Target5 — Grammar)
print('---------------------Linear-----------------------')
lr_reg5 = LinearRegression()
lr_reg5.fit(X_train_vec, y_train5)
lr_r2_score_train5=r2_score(y_train5, lr_reg5.predict(X_train_vec))
lr_r2_score_check5=r2_score(y_check5, lr_reg5.predict(X_check_vec))
print('lr_r2_score_train=',lr_r2_score_train5)
print('lr_r2_score_check=',lr_r2_score_check5)
print('rmse=', mean_squared_error(y_check5, lr_reg5.predict(X_check_vec), squared=False))
print('---------------------Ridge-----------------------')
ridge_reg5=GridSearchCV(Ridge(), param_grid={'alpha':np.linspace(-2.0,50.0,20)}, cv=5)
ridge_model5=ridge_reg5.fit(X_train_vec, y_train5)
print('ridge best_para', ridge_reg5.best_params_)
print('ridge best_score', ridge_reg5.best_score_)
ridge_r2_score_train5=r2_score(y_train5,ridge_model5.predict(X_train_vec))
ridge_r2_score_check5=r2_score(y_check5,ridge_model5.predict(X_check_vec))
print('ridge_r2_score_train=',ridge_r2_score_train5)
print('ridge_r2_score_check=',ridge_r2_score_check5)
print('rmse=', mean_squared_error(y_check5, ridge_model5.predict(X_check_vec), squared=False))
---------------------Linear-----------------------
lr_r2_score_train= 0.9999999999887267
lr_r2_score_check= -1.7380551146632888
rmse= 1.1661920963185597
---------------------Ridge-----------------------
ridge best_para {'alpha': 3.473684210526316}
ridge best_score 0.2679621559876142
ridge_r2_score_train= 0.4563506863931118
ridge_r2_score_check= 0.27918705088677453
rmse= 0.5983568824924924
5.6 Model Selection and Evaluation (Target6 — Conventions)
print('---------------------Linear-----------------------')
lr_reg6 = LinearRegression()
lr_reg6.fit(X_train_vec, y_train6)
lr_r2_score_train6=r2_score(y_train6, lr_reg6.predict(X_train_vec))
lr_r2_score_check6=r2_score(y_check6, lr_reg6.predict(X_check_vec))
print('lr_r2_score_train=',lr_r2_score_train6)
print('lr_r2_score_check=',lr_r2_score_check6)
print('rmse=', mean_squared_error(y_check6, lr_reg6.predict(X_check_vec), squared=False))
print('---------------------Ridge-----------------------')
ridge_reg6=GridSearchCV(Ridge(), param_grid={'alpha':np.linspace(-2.0,50.0,20)}, cv=5)
ridge_model6=ridge_reg6.fit(X_train_vec, y_train6)
print('ridge best_para', ridge_reg6.best_params_)
print('ridge best_score', ridge_reg6.best_score_)
ridge_r2_score_train6=r2_score(y_train6,ridge_model6.predict(X_train_vec))
ridge_r2_score_check6=r2_score(y_check6,ridge_model6.predict(X_check_vec))
print('ridge_r2_score_train=',ridge_r2_score_train6)
print('ridge_r2_score_check=',ridge_r2_score_check6)
print('rmse=', mean_squared_error(y_check6, ridge_model6.predict(X_check_vec), squared=False))
---------------------Linear-----------------------
lr_r2_score_train= 0.99999999998875
lr_r2_score_check= -1.6931481786988987
rmse= 1.1140337330992856
---------------------Ridge-----------------------
ridge best_para {'alpha': 0.736842105263158}
ridge best_score 0.2413117433011474
ridge_r2_score_train= 0.6404360906986227
ridge_r2_score_check= 0.2509944713620459
rmse= 0.5875038657732393
6. Make Predictions on Test Dataset and Make Submission to Kaggle
Based on the r2 scores evaluations for each of the 6 target variables, the ridge regression model seems to be performing better than the linear regression model. Hence, we will use to predict our test targets using the ridge model for each target.
Lets take a relook at the test dataset and look at the information of the dataset:
test
text_id full_text
0 0000C359D63E when a person has no experience on a job their...
1 000BAD50D026 Do you think students would benefit from being...
2 00367BB2546B Thomas Jefferson once states that "it is wonde...
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text_id 3 non-null object
1 full_text 3 non-null object
dtypes: object(2)
memory usage: 176.0+ bytes
Next, we do the same preprocessing steps as that of the training dataset:
- Selecting the predictor column
- Doing the word processing and vectorizer
X_test = test['full_text']
X_test_vec = vec_trans(X_test, vectorizer)
6.1. Make predictions on target variable 1 — Cohesion using ridge regression model
y_pred1 = ridge_model1.predict(X_test_vec)
test['cohesion'] = y_pred1
6.2. Make predictions on target variable 2 — Syntax using ridge regression model
y_pred2 = ridge_model2.predict(X_test_vec)
test['syntax'] = y_pred2
6.3. Make predictions on target variable 3 — Vocabulary using ridge regression model
y_pred3 = ridge_model3.predict(X_test_vec)
test['vocabulary'] = y_pred3
6.4. Make predictions on target variable 4 — Phraseology using ridge regression model
y_pred4 = ridge_model4.predict(X_test_vec)
test['phraseology'] = y_pred4
6.5. Make predictions on target variable 5 — Grammar using ridge regression model
y_pred5 = ridge_model5.predict(X_test_vec)
test['grammar'] = y_pred5
6.6. Make predictions on target variable 6 — Conventions using ridge regression model
y_pred6 = ridge_model6.predict(X_test_vec)
test['conventions'] = y_pred6
Output the submission file:
submission = test[['text_id','cohesion','syntax', 'vocabulary','phraseology','grammar','conventions']]
submission.to_csv('submission.csv', index = False)
Preview the submission file:
submission
text_id cohesion syntax vocabulary phraseology grammar conventions
0 0000C359D63E 3.003285 2.862320 3.074138 3.018515 2.753672 3.026574
1 000BAD50D026 2.996941 2.856068 2.962797 2.833268 2.692806 3.190103
2 00367BB2546B 3.400035 3.343927 3.425858 3.415096 3.258788 3.090802
The final submitted score of 0.53, while not among some of the best models was only 0.1 MCRMSE (mean column-wise root mean squared error) away from the top model, and had made use of very simple regression methods to achieve this.
7. Future Enhancements
- All the models used in this submission are basic simple models. The results are not the best but performance is quick with minimal CPU requirement.
- More complex models like tensorflow may be used to improve the models significantly