DATA - PHASE 1

Phase 1 - Borderlands Science Documentation

Processing output files

File containing general information about puzzles, its location within initial alignment, and solutions. Partially contains “pikdik” information for a quick lookup.

Location: https://dataset.cs.mcgill.ca/bls/data_release/phase1/processing_output/

Files

Prerequisites

All the pickle files described in this section use dnapuzzle.py .
Make sure you have “dnapuzzle.py” in the working directory.
Pandas library is used in the Code section, however it is not required to open the file.

Code

To open the pickle file (replace filepath with appropriate string):

#python
import pandas as pd
from dnapuzzle import Puzzle
filepath = "extract_output_1_2021-06-19.pickle"
obj = pd.read_pickle(filepath)

Format

Pickle file encodes a dictionary where:

ATTRIBUTE	TYPE	DESCRIPTION
keys	[str]	Stringified list of column indexes in original alignment (e.g '[99, 100, 101, 102, 103, 104, 105]')
values	[str]	List of dictionaries that describe distinct puzzles (see below)

Puzzle description dictionary

Pickle values list element contains the following attributes:

ATTRIBUTE	TYPE	DESCRIPTION
originalCode	[str]	puzzle ID in BLS system (e.g. "VeoWZMUzLmzgbU3e")
pikdik	[dnapuzzle.Puzzle]	Puzzle Object (see below)
nGaps	[int]	Number of gaps allowed in a puzzle (e.g. 6)
score	[float]	Expected score for a particular puzzle (e.g. 17.0)
pareto	[str]	Type of strategy used (e.g. "FalseSubopt")
playerSolutions	[list]	List of players' solutions (e.g. [['TTG-A', 'GT--AA', 'GCGA', '-TGCA', 'CT', 'CT-GA']]),
playerIDs	[list]	List of player IDs that submitted solutions described in playerSolutions (e.g. ['1269979'])

pikdik: Puzzle Object

ATTRIBUTE	TYPE	DESCRIPTION
pikdik.puzzle	[list]	Collapsed symbol-wise split puzzle sequences (e.g. [['T', 'T', 'G', 'A', '-', '-', '-'], ['G', 'T', 'A', 'A', '-', '-', '-'], ['G', 'C', 'G', 'A', '-', '-', '-'], ['T', 'G', 'C', 'A', '-', '-', '-'], ['C', 'T', '-', '-', '-', '-', '-'], ['C', 'T', 'G', 'A', '-', '-', '-']] )
pikdik.par_puzzle	[list]	Symbol-wise split pareto solution of the puzzle (e.g. [['T', 'T', '-', 'G', 'A', '-', '-'], ['G', 'T', '-', 'A', 'A', '-', '-'], ['G', 'C', '-', 'G', 'A', '-', '-'], ['T', 'G', '-', 'C', 'A', '-', '-'], ['C', 'T', '-', '-', '-', '-', '-'], ['C', 'T', '-', 'G', 'A', '-', '-']] )
pikdik.columns	[str(list)]	Column indexes in original alignment (e.g. '[99, 100, 101, 102, 103, 104, 105]' )
pikdik.flanks	[list]	Column indexes used as a flanks for a puzzle (e.g [105, 106])
pikdik.consensus	[list]	Predefined guides for a puzzle (used in scoring) (e.g. [('C', 'T'), ('C', 'T'), ('-', 'C'), ('C', 'T'), ('-', 'A'), ('G', 'A'), ('-', 'C'), ('-', 'G')] )
pikdik.cons_scores	[list]	Counter for each nucleotide from consenus to appear in the alignment (e.g. [(6287, 2552), (3416, 2688), (9666, 1), (2938, 2669), (9545, 55)] )
pikdik.n	[int]	Number of sequences. (e.g. 6)
pikdik.bonus	[float]	Multiplier for aligned "line" perfectly matching given consensus / guide (e.g. 1.15)
pikdik.level	[int]	Difficulty level of the puzzle in range [1, 9] (e.g. 2)

Output Data and Processing Scripts

DATA - PHASE 1

DATA - PHASE 1

Phase 1 - Borderlands Science Documentation

Processing output files

Files

Prerequisites

Code

Format

Puzzle description dictionary

pikdik: Puzzle Object