At some point, I think shortly prior to things breaking, I had saved my output dataframe to a file to protect it, which let me see when I reran the unedited code that something had changed - one gene, OAS1, was missing. But the code didn't seem any different, so how could this gene have gone missing?
I checked OAS1 was indeed in the saved df - yep.
I tried to check in my other dataframe, which stores overlapping Neanderthal haplotypes with genes, for it, but OAS1 had never been in that dataframe because it has no overlapping haplotypes so that didn't help. This also meant I wouldn't be able to confirm what the right gene was by comparing the overlap lengths to the saved df (apart from checking that they're all zero/NaN in the relevant columns).
I checked that my inputs to the call to biomaRt were still asking for OAS1 as before, and indeed they were - yet it wasn't coming back from biomaRt.
Then, to check it wasn't a problem with my code or using biomaRt from R, I went to the browser and the Biomart website and manually typed in OAS1, OAS2 and OAS3 as my gene filters - that just returned results for OAS2 and OAS3. So the gene was missing from Biomart and thus presumably Ensembl.
I searched on the Ensembl website for OAS1 and just got random things like an antisense transcript over OAS1, 2 and 3, whereas when I searched OAS2 I got the OAS2 gene.
So then I thought maybe I had the wrong name for it (even though that wouldn't answer why it was there in the dataframe I made max a few days before) - but I looked at HGNC and OAS1 is apparently its approved symbol.
I also tried searching Biomart for OAS1 as symbol rather than gene name, but still no dice. HGNC listed OIASI and IFI-4 as alternative symbols for OAS1 so I tried those too, but still nothing.
Finally, just googling 'OAS1 ensembl' and clicking on the top link, a link to Ensembl with the title 'Gene: OAS1' and the stable ID, redirected me to a page with the gene called 'AC004551.1'.
I satisfied myself this was the right gene by:
- modifying my code to replace 'OAS1' with 'AC004551.1' in my call to Biomart and seeing that the output dataframe now had the right number of rows (4968 vs 4912 before) because the number of rows a gene gets depends on the gene (which GO term categories it's in).
- seeing it was not in the overlaps dataframe
- calling all.equal() on the new dataframe's categories for the gene AC004551.1 and the old df's categories for the gene OAS1.
- calling all.equal() on the new df's start position for that gene and the old df's start position for OAS1.
So that leaves me with the conclusion that Ensembl changed the gene name silently within a few days. So that's a thing that can happen.
- Now that I think about it, to make sure it actually exists in Ensembl and find the name Ensembl is calling it, I probably could've searched by its coordinates, which I had from my saved df.
- Turns out I need to use human genome build 37 instead of 38 because that's what my haplotype data is based on, and the gene is called OAS1 in 37 anyway, but there you go.
- Perhaps I should've used the stable IDs, but everything else I'm using uses gene names and I guess I preferred not to have to convert those.
- Ensembl seems to change gene names without warning. If you have another explanation, shoot me an email at loughrae/@/tcd/./ie!
Is this the most boring blog post of all time? Quite possibly. Oh well.