CS638

CS 638 Fall 2016 Data Science Project

Team

Project Stages

Deliverable 1

Site 1 SPOJ

5,158 pages parsed

Generated CSV
HTML Files

Table structure

UserRanks, SolvedBy, Name, Tags, Url, TimeLimit, AddedBy, Languages,
MemoryLimit, Code, SourceLimit, Output, Accuracy, Input, Id, DateAdded, Description

Site 2 CodeChef

5395 pages parsed

Generated CSV
HTML Files

Table structure

time_limit, source_limit, description, index, input_text, title,
problem_code, success_count, submissions_page, link, date_added,
output_text, editorial_page, tags/0, tags/1, tags/2, tags/3, accuracy, tags/4, tags/5, tags/6, tags/7

Deliverable 2

Attributes Selected for Analysis

title, added_by, description, input, output, upper_time_limit, accuracy, solved_by, tags, difficulty
CSV used for analysis

Steps in Analysis

For each attribute, we preformed the following steps:

  1. Find Missing Values (number, percentage)
  2. Strategy to Fill Missing Values
  3. Classify the Attribute (numeric, textual, categorical)
  4. Summarize Textual Attributes (character mean, min and max lengths)
  5. Find Outliers and Anomalies
  6. Format Standardziation (ensure all values under a specific attribute follow the specified format)
  7. Find Potential Synomyms
  8. Find Potential Sprinkled Values
  9. Note Other Data Quality Concerns
Attribute Analysis - PDF
Attribute Analysis - Python Notebook

Deliverable 3

Steps in Blocking

  1. Pre-processing on both tables
  2. Generated features "solve_rate_normalized" and "difficulty_normalized"
  3. Attempted Blocking on "solve_rate_normalized"
  4. Converted "description", "input" and "output" into bag of words using nltk
  5. Generated an inverted index on both tables based on a similarity score on overlap
  6. Blocked on the inverted index table to find out probable candidates for matching

Result of Blocking

Table A (spoj problems) : 5158 problems
Table B (code chef problems) : 5395 problems
Cartesian Product for matching : 27827410 pairs
Table C (Probable matching pairs) : 153929 pairs
Table C: Probable Candidates
Blocking Attempt#1 - Python Notebook
Blocking Final - Python Notebook
Blocking - Preprocessing Spoj - Python Notebook
Blocking - Preprocessing Codechef - Python Notebook
Blocking Explanation PDF

Deliverable 4

Steps in Matching

  1. Labeled 400 instances of the blocked data in order to create "Golden Data"
  2. Create features to compare instances across tables
  3. Train 5 Machine Learning Algorithms on 350 training instances and evaluate on precision, recall and F-1 scores
  4. Tune model parameters and re-evaluate precision, recall and F-1 scores
  5. Use remaining 50 labeled instances and evaluate each model's final precision, recall and F-1 scores

Final Results of Matching (on 50 testing instances)

Matching Python Notebook PDF
Matching Python Notebook
Labeled Dataset (i.e. "Golden Data")
Matching Report

Deliverable 5

Merging Tables and Analysis of Results

  1. Used both Spoj and Codechef tables to load in required raw data
  2. Applied data transformations to prepare data for trained matching algorithm. We needed to create the following columns:
  3. Applied our Machine Learning Algorithm (Logistic Regression Classifier) on 1200 pre-blocked instances to predict whether they were matches or not.
  4. Once we had the matches, we were able to preform analyses on the results.