CS638
CS 638 Fall 2016 Data Science Project
Team
- Karan Talreja talreja2@wisc.edu
- Pavan Kemparaju pavank@cs.wisc.edu
- Zachary Rottier zrottier@wisc.edu
Project Stages
Deliverable 1
Site 1 SPOJ
5,158 pages parsed
Generated CSV HTML FilesTable structure
UserRanks, SolvedBy, Name, Tags, Url, TimeLimit, AddedBy, Languages, MemoryLimit, Code, SourceLimit, Output, Accuracy, Input, Id, DateAdded, DescriptionSite 2 CodeChef
5395 pages parsed
Generated CSV HTML FilesTable structure
time_limit, source_limit, description, index, input_text, title, problem_code, success_count, submissions_page, link, date_added, output_text, editorial_page, tags/0, tags/1, tags/2, tags/3, accuracy, tags/4, tags/5, tags/6, tags/7Deliverable 2
Attributes Selected for Analysis
title, added_by, description, input, output, upper_time_limit, accuracy, solved_by, tags, difficulty CSV used for analysisSteps in Analysis
For each attribute, we preformed the following steps:
- Find Missing Values (number, percentage)
- Strategy to Fill Missing Values
- Classify the Attribute (numeric, textual, categorical)
- Summarize Textual Attributes (character mean, min and max lengths)
- Find Outliers and Anomalies
- Format Standardziation (ensure all values under a specific attribute follow the specified format)
- Find Potential Synomyms
- Find Potential Sprinkled Values
- Note Other Data Quality Concerns
Deliverable 3
Steps in Blocking
- Pre-processing on both tables
- Generated features "solve_rate_normalized" and "difficulty_normalized"
- Attempted Blocking on "solve_rate_normalized"
- Converted "description", "input" and "output" into bag of words using nltk
- Generated an inverted index on both tables based on a similarity score on overlap
- Blocked on the inverted index table to find out probable candidates for matching
Result of Blocking
Table A (spoj problems) : 5158 problems Table B (code chef problems) : 5395 problems Cartesian Product for matching : 27827410 pairs Table C (Probable matching pairs) : 153929 pairs Table C: Probable Candidates Blocking Attempt#1 - Python Notebook Blocking Final - Python Notebook Blocking - Preprocessing Spoj - Python Notebook Blocking - Preprocessing Codechef - Python Notebook Blocking Explanation PDFDeliverable 4
Steps in Matching
- Labeled 400 instances of the blocked data in order to create "Golden Data"
- Create features to compare instances across tables
- Exact match of problem titles (binary 0/1)
- Absolute differences in the word counts of the two instances’ descriptions (integer)
- Jaccard similarity of the two instances’ descriptions
- Edit distance for between the two instances’ titles
- Train 5 Machine Learning Algorithms on 350 training instances and evaluate on precision, recall and F-1 scores
- Logistic Regression Classifier
- SVM Classifier
- Decision Tree Classifier
- Random Forest Classifier
- Naïve Bayes Classifier
- Tune model parameters and re-evaluate precision, recall and F-1 scores
- Use remaining 50 labeled instances and evaluate each model's final precision, recall and F-1 scores
Final Results of Matching (on 50 testing instances)
- Logistic Regression Classifier
- Precision: 0.98
- Recall: 0.98
- F-1 Score: 0.98
- SVM Classifier
- Precision: 0.96
- Recall: 0.96
- F-1 Score: 0.96
- Decision Tree Classifier
- Precision: 0.96
- Recall: 0.96
- F-1 Score: 0.96
- Random Forest Classifier
- Precision: 0.95
- Recall: 0.94
- F-1 Score: 0.94
- Naïve Bayes Classifier
- Precision: 0.95
- Recall: 0.94
- F-1 Score: 0.94
Deliverable 5
Merging Tables and Analysis of Results
- Used both Spoj and Codechef tables to load in required raw data
- Applied data transformations to prepare data for trained matching algorithm. We needed to create the following columns:
- exact_title_match - Exact match of problem titles (binary 0/1)
- diff_in_words_len - Absolute differences in the word counts of the two instances’ descriptions (integer)
- jaccard_sim - Jaccard similarity of the two instances’ descriptions
- edit_dist - Edit distance for between the two instances’ titles
- Applied our Machine Learning Algorithm (Logistic Regression Classifier) on 1200 pre-blocked instances to predict whether they were matches or not.
- Once we had the matches, we were able to preform analyses on the results.
- Calculate the correlation between matched problems' solved rates - found out that the solved rates (i.e. number of correct solutions over the number of days since the problem was created) were not strongly correlated. In doing this calculation, we decided to remove the TEST problem from the dataset since it was such an outlier and was swaying our results.
- Create word cloud and histogram of word frequency for the least solved problems - we chose to preform this task in order to compare the similarities of the most difficult problems (top 25%). With 9/10 of the top words overlapping, it seems that these problems may be somewhat similar.
-
Table_E and Analysis Python Notebook PDF
Table_E and Analysis Python Notebook
Table_E - All found Matches
Stage 5 Report