CS638

CS 638 Fall 2016 Data Science Project

Team

Karan Talreja talreja2@wisc.edu
Pavan Kemparaju pavank@cs.wisc.edu
Zachary Rottier zrottier@wisc.edu

Project Stages

Deliverable 1

Site 1 SPOJ

5,158 pages parsed

Generated CSV
HTML Files

Table structure

UserRanks, SolvedBy, Name, Tags, Url, TimeLimit, AddedBy, Languages,
MemoryLimit, Code, SourceLimit, Output, Accuracy, Input, Id, DateAdded, Description

Site 2 CodeChef

5395 pages parsed

Generated CSV
HTML Files

Table structure

time_limit, source_limit, description, index, input_text, title,
problem_code, success_count, submissions_page, link, date_added,
output_text, editorial_page, tags/0, tags/1, tags/2, tags/3, accuracy, tags/4, tags/5, tags/6, tags/7

Deliverable 2

Attributes Selected for Analysis

title, added_by, description, input, output, upper_time_limit, accuracy, solved_by, tags, difficulty
CSV used for analysis

Steps in Analysis

For each attribute, we preformed the following steps:

Find Missing Values (number, percentage)
Strategy to Fill Missing Values
Classify the Attribute (numeric, textual, categorical)
Summarize Textual Attributes (character mean, min and max lengths)
Find Outliers and Anomalies
Format Standardziation (ensure all values under a specific attribute follow the specified format)
Find Potential Synomyms
Find Potential Sprinkled Values
Note Other Data Quality Concerns

Attribute Analysis - PDF
Attribute Analysis - Python Notebook

Deliverable 3

Steps in Blocking

Pre-processing on both tables
Generated features "solve_rate_normalized" and "difficulty_normalized"
Attempted Blocking on "solve_rate_normalized"
Converted "description", "input" and "output" into bag of words using nltk
Generated an inverted index on both tables based on a similarity score on overlap
Blocked on the inverted index table to find out probable candidates for matching

Result of Blocking

Table A (spoj problems) : 5158 problems
Table B (code chef problems) : 5395 problems
Cartesian Product for matching : 27827410 pairs
Table C (Probable matching pairs) : 153929 pairs
Table C: Probable Candidates
Blocking Attempt#1 - Python Notebook
Blocking Final - Python Notebook
Blocking - Preprocessing Spoj - Python Notebook
Blocking - Preprocessing Codechef - Python Notebook
Blocking Explanation PDF

Deliverable 4

Steps in Matching

Labeled 400 instances of the blocked data in order to create "Golden Data"
Create features to compare instances across tables

Exact match of problem titles (binary 0/1)
Absolute differences in the word counts of the two instances’ descriptions (integer)
Jaccard similarity of the two instances’ descriptions
Edit distance for between the two instances’ titles

Train 5 Machine Learning Algorithms on 350 training instances and evaluate on precision, recall and F-1 scores

Logistic Regression Classifier
SVM Classifier
Decision Tree Classifier
Random Forest Classifier
Naïve Bayes Classifier

Tune model parameters and re-evaluate precision, recall and F-1 scores
Use remaining 50 labeled instances and evaluate each model's final precision, recall and F-1 scores

Final Results of Matching (on 50 testing instances)

Logistic Regression Classifier

Precision: 0.98
Recall: 0.98
F-1 Score: 0.98

SVM Classifier

Precision: 0.96
Recall: 0.96
F-1 Score: 0.96

Decision Tree Classifier

Precision: 0.96
Recall: 0.96
F-1 Score: 0.96

Random Forest Classifier

Precision: 0.95
Recall: 0.94
F-1 Score: 0.94

Naïve Bayes Classifier

Precision: 0.95
Recall: 0.94
F-1 Score: 0.94

Matching Python Notebook PDF
Matching Python Notebook
Labeled Dataset (i.e. "Golden Data")
Matching Report

Deliverable 5

Merging Tables and Analysis of Results

Used both Spoj and Codechef tables to load in required raw data
Applied data transformations to prepare data for trained matching algorithm. We needed to create the following columns:

exact_title_match - Exact match of problem titles (binary 0/1)
diff_in_words_len - Absolute differences in the word counts of the two instances’ descriptions (integer)
jaccard_sim - Jaccard similarity of the two instances’ descriptions
edit_dist - Edit distance for between the two instances’ titles

Applied our Machine Learning Algorithm (Logistic Regression Classifier) on 1200 pre-blocked instances to predict whether they were matches or not.
Once we had the matches, we were able to preform analyses on the results.

Calculate the correlation between matched problems' solved rates - found out that the solved rates (i.e. number of correct solutions over the number of days since the problem was created) were not strongly correlated. In doing this calculation, we decided to remove the TEST problem from the dataset since it was such an outlier and was swaying our results.
Create word cloud and histogram of word frequency for the least solved problems - we chose to preform this task in order to compare the similarities of the most difficult problems (top 25%). With 9/10 of the top words overlapping, it seems that these problems may be somewhat similar.

Table_E and Analysis Python Notebook PDF

Table_E and Analysis Python Notebook

Table_E - All found Matches

Stage 5 Report

Cs638

Data Science Project

CS638

Team

Project Stages

Deliverable 1

Site 1 SPOJ

5,158 pages parsed

Table structure

Site 2 CodeChef

5395 pages parsed

Table structure

Deliverable 2

Attributes Selected for Analysis

Steps in Analysis

For each attribute, we preformed the following steps:

Deliverable 3

Steps in Blocking

Result of Blocking

Deliverable 4

Steps in Matching

Final Results of Matching (on 50 testing instances)

Deliverable 5

Merging Tables and Analysis of Results