Phase 1 - Borderlands Science Documentation
About Dataset
All data can be downloaded from https://dataset.cs.mcgill.ca/bls/data_release/
Context
This dataset contains information on puzzles and solutions used in Borderlands Science Game from its release (April 7, 2020) until June 19, 2021 (BLS Phase 1). The puzzle dataset(s) are produced by Waldispuhl group, School of Computer Science, McGill University in collaboration with Gearbox Software and MMOS. The solution dataset(s) are collected from Borderlands Science game (Gearbox).
Content
Puzzle and solution datasets are split into more manageable parts (that can be used both individually and collectively).
Puzzles
ATTRIBUTE | TYPE | DESCRIPTION |
id | [str] | ID of the puzzle within BLS system.
|
source_puzzle_id | [str] | ID of the group of the puzzles that consist of the same sequences however use different number of max_gap_count (allowed gaps in a game).
|
assets | [list] | List of sequences that have been used to produce a puzzle. Generally, it is a literal sequence followed by a set of gaps. Aphabet is [A,C,G,T,-].
|
solution | [list] | List of aligned sequences that were produced using PASTA software. Alphabet is [A,C,G,T,-]. Used as a general reference.
|
difficulty | [int] | Difficulty level of the puzzle, manually curated to optimize user experience during the game. Puzzle difficulty varies in range [1, 9] from easiest to hardest.
|
par_score | [float] | Score of a solution (see puzzle.solution) using BLS scoring function.
|
max_gap_count | [int] | Maximum number of the moves allowed in the game.
|
accepted_pairs | [list] | Accepted pairs of nucleotides. Length of the list is equal to the maximum size (width) of solution. Used to find matches for scoring function. [Alternative name: guides]
|
skips | [int] | Number of times puzzle was skipped by players.
|
high_score | [float] | Highest score achieved by players (per scope).
|
classification_count | [int] | Number of player submissions received. Attention: both valid solutions and skips are counted towards *classification_count*. To find number of valid solutions - deduct skips.
|
Solutions
ATTRIBUTE | TYPE | DESCRIPTION |
id | [str] | ID of the solution within BLS system
|
puzzle_id | [str] | ID of the puzzle within BLS system.
|
player_id | [str] | ID of the player within Borderlands and BLS system.
|
gaps | [int] | Number of gaps used in a solution. Limited by *puzzle.max_gap_count*
|
deshuffled_solution | [list] | List of aligned by a player sequences, deshuffled to the original state of the puzzle. Alphabet [A,C,G,T,-]. Attention: tailing gaps (-) are removed.
|
solution_code | [str] | String containing shortened identificator of the solution (**deshufled**) for simple comparison between solutions. Separated by comma, digits determine unique positions of the gaps within the sequence. For example: string "0,,34,," means gaps were added at 0th position in the 0th sequence (e.g. -ACGCT), as well as at 3th and 4th positions in the 2nd sequence (e.g. TAG--T). The other sequences are not modified.
|
score | [float] | Score of a solution (see solution.solution) using BLS scoring funciton.
|
created_at | [Timestamp] | Date when solution was submitted by a player.
|
moves | [list] | List of the moves (in order) player has undertaken (shuffled context). Every item of the list consists of:
1) ID of the column/sequence in alphabetical order
2) Numeric position within the sequence where move was performed
3) +/- if gap was added or removed
4) Time since last move in milliseconds For example: ['A0+;2600', 'E2+;1200', 'E2-;1367', 'F3+;2015'] |
shuffle | [list] | Shuffle pattern applied to original puzzle to mix columns/sequences to remove/reduce bias. List consists of pairs defining(0) source position in unshuffled puzzle (see puzzle.assets) and (1) target position in the shuffled state. Attention: Pairs with same source and target positions are omitted. Example, [['A', 'C'], ['B', 'E'], ['C', 'F'], ['D', 'A'], ['E', 'B'], ['F', 'D']] |
solution | [list] | List of sequences alligned by a player (shuffled). Alphabet [A,C,G,T,-] Attention: tailing gaps (-) are removed |
Code
Attention: pandas library is required
Load Puzzles file
As an example, “puzzles.0.csv” is loaded.
#python
import json
import pandas as pd
dtypes = {
"id": str,
"source_puzzle_id": str,
"assets": str, # string containing list
"solution": str, # string containing list
"difficulty": int,
"par_score": float,
"max_gap_count": int,
"accepted_pairs": str, # string containing list
"skips": int,
"high_score": float,
"classification_count": int
}
df = pd.read_csv("puzzles.0.csv", dtype=dtypes)
list_column_names = ["assets", "solution", "accepted_pairs"]
for col_name in list_column_names:
df[col_name] = df[col_name].str.replace("'", "\"").apply(json.loads)
Load Solutions File
As an example, “solutions.0.csv” is loaded.
#python
import json
import pandas as pd
dtypes = {
"id": str,
"puzzle_id": str,
"player_id": str,
"gaps_used": int,
"deshuffled_solution": str, # string containing list
"solution_code": str,
"score": float,
"created_at": str,
"moves": str, # string containg list
"shuffle": str, # listring containing listst
"solution": str # string containing list
}
df = pd.read_csv("solutions.0.csv",
dtype=dtypes, parse_dates=['created_at'])
list_column_names = ["solution", "moves", "shuffle", ]
for col_name in list_column_names:
df[col_name] = df[col_name].str.replace("'", "\"").apply(json.loads)