DATA - PHASE 1

Phase 1 - Borderlands Science Documentation

About Dataset

All data can be downloaded from https://dataset.cs.mcgill.ca/bls/data_release/

Context

This dataset contains information on puzzles and solutions used in Borderlands Science Game from its release (April 7, 2020) until June 19, 2021 (BLS Phase 1). The puzzle dataset(s) are produced by Waldispuhl group, School of Computer Science, McGill University in collaboration with Gearbox Software and MMOS. The solution dataset(s) are collected from Borderlands Science game (Gearbox).

Content

Puzzle and solution datasets are split into more manageable parts (that can be used both individually and collectively).

Puzzles

ATTRIBUTE	TYPE	DESCRIPTION
id	[str]	ID of the puzzle within BLS system.
source_puzzle_id	[str]	ID of the group of the puzzles that consist of the same sequences however use different number of max_gap_count (allowed gaps in a game).
assets	[list]	List of sequences that have been used to produce a puzzle. Generally, it is a literal sequence followed by a set of gaps. Aphabet is [A,C,G,T,-].
solution	[list]	List of aligned sequences that were produced using PASTA software. Alphabet is [A,C,G,T,-]. Used as a general reference.
difficulty	[int]	Difficulty level of the puzzle, manually curated to optimize user experience during the game. Puzzle difficulty varies in range [1, 9] from easiest to hardest.
par_score	[float]	Score of a solution (see puzzle.solution) using BLS scoring function.
max_gap_count	[int]	Maximum number of the moves allowed in the game.
accepted_pairs	[list]	Accepted pairs of nucleotides. Length of the list is equal to the maximum size (width) of solution. Used to find matches for scoring function. [Alternative name: guides]
skips	[int]	Number of times puzzle was skipped by players.
high_score	[float]	Highest score achieved by players (per scope).
classification_count	[int]	Number of player submissions received. Attention: both valid solutions and skips are counted towards classification_count. To find number of valid solutions - deduct skips.

Solutions

ATTRIBUTE	TYPE	DESCRIPTION
id	[str]	ID of the solution within BLS system
puzzle_id	[str]	ID of the puzzle within BLS system.
player_id	[str]	ID of the player within Borderlands and BLS system.
gaps	[int]	Number of gaps used in a solution. Limited by puzzle.max_gap_count
deshuffled_solution	[list]	List of aligned by a player sequences, deshuffled to the original state of the puzzle. Alphabet [A,C,G,T,-]. Attention: tailing gaps (-) are removed.
solution_code	[str]	String containing shortened identificator of the solution (deshufled) for simple comparison between solutions. Separated by comma, digits determine unique positions of the gaps within the sequence. For example: string "0,,34,," means gaps were added at 0th position in the 0th sequence (e.g. -ACGCT), as well as at 3th and 4th positions in the 2nd sequence (e.g. TAG--T). The other sequences are not modified.
score	[float]	Score of a solution (see solution.solution) using BLS scoring funciton.
created_at	[Timestamp]	Date when solution was submitted by a player.
moves	[list]	List of the moves (in order) player has undertaken (shuffled context). Every item of the list consists of: 1) ID of the column/sequence in alphabetical order 2) Numeric position within the sequence where move was performed 3) +/- if gap was added or removed 4) Time since last move in milliseconds For example: ['A0+;2600', 'E2+;1200', 'E2-;1367', 'F3+;2015']
shuffle	[list]	Shuffle pattern applied to original puzzle to mix columns/sequences to remove/reduce bias. List consists of pairs defining(0) source position in unshuffled puzzle (see puzzle.assets) and (1) target position in the shuffled state. Attention: Pairs with same source and target positions are omitted. Example, [['A', 'C'], ['B', 'E'], ['C', 'F'], ['D', 'A'], ['E', 'B'], ['F', 'D']]
solution	[list]	List of sequences alligned by a player (shuffled). Alphabet [A,C,G,T,-] Attention: tailing gaps (-) are removed

Code

Attention: pandas library is required

Load Puzzles file

As an example, “puzzles.0.csv” is loaded.

#python
import json
import pandas as pd

dtypes = {
    "id": str,
    "source_puzzle_id": str,
    "assets": str, # string containing list
    "solution": str, # string containing list
    "difficulty": int,
    "par_score": float,
    "max_gap_count": int, 
    "accepted_pairs": str, # string containing list
    "skips": int,
    "high_score": float,
    "classification_count": int
}

df = pd.read_csv("puzzles.0.csv", dtype=dtypes)

list_column_names = ["assets", "solution", "accepted_pairs"]

for col_name in list_column_names:
    df[col_name] = df[col_name].str.replace("'", "\"").apply(json.loads)

Load Solutions File

As an example, “solutions.0.csv” is loaded.

#python
import json
import pandas as pd

dtypes = {
    "id": str,
    "puzzle_id": str,
    "player_id": str,
    "gaps_used": int,
    "deshuffled_solution": str,  # string containing list
    "solution_code": str,
    "score": float,
    "created_at": str,
    "moves": str,  # string containg list
    "shuffle": str,  # listring containing listst
    "solution": str  # string containing list
}

df = pd.read_csv("solutions.0.csv",
                 dtype=dtypes, parse_dates=['created_at'])

list_column_names = ["solution", "moves", "shuffle", ]

for col_name in list_column_names:
    df[col_name] = df[col_name].str.replace("'", "\"").apply(json.loads)