Genes and geography -- a bioinformatics project

Описание к видео Genes and geography -- a bioinformatics project

This is a full walkthrough of a bioinformatics project: Run PCA/TSNE on some population genotype data.

00:00 Intro
01:07 Hunting for data
04:55 Inspecting the VCF
06:02 Finding population labels for the samples
10:20 Parsing VCF with pysam
16:02 Going from alleles to numbers for a numpy array
21:47 When to work in colab versus python script
26:00 Saving data with pandas
28:42 Adding population labels from the panel file
33:33 To Colab!
36:54 PCA
40:17 First plot! Mission accomplished :)
42:03 Using Altair for plotting with labels
44:51 Second plot with population labels!
46:05 Merging with the igsr_population.tsv data
49:43 TSNE
53:36 Exercise: PCA on the SNPs
54:21 Conclusion and origin story for this project

* Download a VCF of population genotypes from the 1000 Genomes project.
* Use pysam to parse it and summarize it into a 2D numpy array to run PCA and save it as a pandas dataframe.
* Run PCA and tSNE on it and visualize the results with both matplotlib and Altair, coloring the points based on the ancestry labels.

Here is the project ideas videos where I mentioned this project first:    • Bioinformatics project ideas   -- if you're interested in the origin story.

All code including python script, download URLs for input files, and the Colab notebook: https://github.com/MariaNattestad/pca...

Комментарии

Информация по комментариям в разработке