40 Comments

Oh that is very, very interesting. I guess Wu or anyone else involved in the 2020 paper is not providing any answers. Here's little old me thinking that one should be able to replicate the results in published papers.

To me, publishing three different sequences should end it all right there.

Thanks for doing this work and sharing. It's big stuff.

Expand full comment
Mar 27, 2023·edited Mar 27, 2023Liked by Ben

I now figured out a less hacky way to show the positions where all sequences in an alignment don't have an identical value. It also works with three or more sequences, so for example this shows all the differences between the three versions of Wuhan-Hu-1:

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.'{1,2,3}|mafft ->temp.aln

Rscript -e 'm=t(as.matrix(Biostrings::readDNAStringSet("temp.aln")));pick=which(rowSums(m!=m[,1])>0);write.table(`rownames<-`(m[pick,],pick),sep=";",quote=F,col.names=NA)'

Or the same in gawk:

awk 'NR>1{gsub("\t"," ")sub("\n","\t")gsub("\n","");print}' RS=\> temp.aln|awk -F\\t '{printf";%s",$1;l=length($2);for(i=1;i<=l;i++){x=substr($2,i,1);a[i][NR]=x;b[i][x]}}END{print"";for(i in b)if(length(b[i])>1){o=i;for(j=1;j<=NR;j++)o=o";"a[i][j];print o}}'

Expand full comment
author

BTW, another interesting thing, if you read the Wu et al 2020 paper precicelsy, they say:

"Sequencing reads were first adaptor and quality trimmed using the Trimmomatic program32. The remaining 56,565,928 reads were assembled [..]" - the last number is exactly the reads they've uploaded, hence maybe they are already quality trimmed after all?

Expand full comment
author

Thanks, I tried the Rscript, but in the output there are breaks between the AAA tail, so prob. that shouldnt be?

Expand full comment
Mar 27, 2023·edited Mar 27, 2023Liked by Ben

The gaps are there because it attempted to align the poly(A) tail of the third version against the segment of human DNA that was accidentally included at the end of the first version.

BTW I hadn't noticed earlier that there's one base change in the middle of the alignment, where the first and second versions have C but the third version has G. The nucleotide change is at position 29277 in MN908947.3. So I guess the different start and end are not the only differences between the three versions after all.

Edit: If you run MAFFT with the `--globalpair` option, then it won't add gaps in the middle of the poly(A) tail of the third version. It uses the Needleman-Wunsch algorithm to perform pairwise alignments but it's a lot slower than the default algorithm.

Expand full comment
author

thanks, that works well. with that I can align all five contigs, and compare them well.

Expand full comment

I did this manually because , its near impossible to work out how these aligners are actually working and with Bowtie it only came up with a 98% match

so to keep this away from invisible algorithms I manually searched for AGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

then

GAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

then

AATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

then

ATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This sequence which forms the end on mn908947.3 only appears in the middle of the 150bp reads , when it appears at the beginning it's obviously followed by nt so couldnt possibly be the end of a sequence .

When the searches get shorter it becomes possible , but this is down to pure algorithms and moves further from reality or probability .

When this sequence alignment was done in Bowtie , it actually can't create the tail and has a 2% mismatch rate over all , the read it does place on the end is 9235278

From: 29,862 U29,862 to 29,901 U29,901

Length: 40 U40 (6 mismatches)

Cigar: 40M

Read direction is REVERSE

>9235278

GGGGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAA

rather than the official ( mn908947.3)

AGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This read only appears once in the SRA file and is chopped down to fit , the full read is actually

>gnl|SRA|SRR10971381.27254830.227254830 Biological (Biological)

TTTG'GGGGAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAA'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The first four nucleotides could be adapters that would be trimmed so the beginning lines up , but its still not capable of creating the official sequence , so you have to ignore the adapters, the incorrect nucleotides , most of the remaining nucleotides from the read and it still get it wrong. Using shorter read aligners it stars becoming possible , but this is also where you can create hiv,ebola etc .

These papers need to publish there steps, from what I am seeing its algorithms and artistic rendering .

Expand full comment
author

Very interesting. Yes, I remember also one time searching the ends in the reads, and couldn't find it. Since you've found it, but only in the middle, is really another interesting finding. As you said, if it's only in the middle, it couldn't be the end of the genome :)

Expand full comment

I will run it with BWA , BBMAP , MapPacBio and compare the results , I would guess that once the insert/delete get higher and alignment length starts dropping below 30-40 bp it will start becoming possible , but I will post up the results , but if its not there with Bowtie , I would say it wasn't actually there as this one is slack and forgiving enough as it is . At least it will give me a rough idea and how to reverse engineer what rules they are following.

Expand full comment

I took a 20-base segment near the end of the third version of Wuhan-Hu-1 and I used `seqkit locate` to find all sequences which had an exact match to the segment in one of the two paired read files for SRR10971381:

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>sars2.fa

seqkit subseq -r -60:-41 sars2.fa|seqkit locate -Ff- SRR10971381_1.fastq|tee locateout

There were a total of 156 matches even though I only searched in one of the two paired read files and I only included exact matches. All matches were on the minus strand, so I took the reverse complement of the matching reads and I aligned them against MN908947.3:

cut -f1 locateout|seqkit grep -f- SRR10971381_1.fastq|seqkit fq2fa|seqkit seq -rp|cat - sars2.fa|mafft ->temp.aln

seqkit subseq -r-100:-1 -w0 temp.aln|awk '{getline x;print x,$0}'

The number of gaps at the end was 28 for 1 read, 29 for 1 read, 30 for 2 reads, 31 for 1 read, 32 for 5 reads, 33 for 4 reads, 34 for 1 read, 36 for 14 reads, 37 for 29 reads, 38 for 57 reads, and 39 for 41 reads. So no read came even close to including the full 33-base poly(A) tail of MN908947.3. So maybe the full tail isn't even included in the reads and you need RACE to sequence it.

I posted the full output of my shell commands here: https://pastebin.com/raw/cZ40waGn.

Here's one of the paired reads which includes 4 bases of the poly(A) tail: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR10971381&display=reads&search=20473872. You can try to copy the bottom read which ends with AAAA and paste it here and press BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn. It has 100% identity with one SARS 2 sequence published in 2021 where all 123 bases of the read are identical. There's 4^123 or about 1e74 possible sequences of 123 nucleotides.

Expand full comment
Mar 30, 2023Liked by Ben

Way out of my depth here, but daring to comment ...

Given sequencing this genome has problems in reproducing the original posted work scientifically, and sequencing the entire genome itself has remained elusive, plus the versions all differ when others try to sequence the genome...

Is this as suspicious as it sounds? Or is it just the nature of this genome is a tricky?

Is it a case of because,

"the corona virus is an RNA virus. Because RNA is only a single strand of genetic material as opposed to the double strand of DNA, but also very long, a virus making copies of itself using RNA as a blueprint leads to many copy errors. Between 90-99+% of the copies it makes of itself are replication incompetent"? (copy paste from another substack - Occam's Razor)

Is this why the large variance in sequencing?

What were we testing for in PCR tests of patients?

I find the article above intriguing, but am wanting to be more specific of the implications when discussing with others ... So I do not misinterpret or misrepresent your work

Expand full comment

In my opinion after diving into this data for over two years and being aware of Lanka's control experiments .

The virus never existed , but people do get ill for whatever reason.

The genome is a product of two rounds of PCR ( way out of spec ) and with a huge amount of primers which late during the PCR process with so many heating up/cooling down cycles will start attaching and creating sequences that were never present in the patient sample .

The data set contains huge amount of bacterial dna , human rna/dna , many sequences have no match whatsoever , to claim this genome was viral in the first place it should have been isolated(separated from everything else) with standard microbiology techniques like density gradient centrifugation , finding the right density band for particles of a given size/density and then those 'viral ' particles should be sequenced , not sequence everything and then determine via process of elimination that the leftover must have been a virus , because it was similar (89%) to another non isolated virus that was created in the same way , which was similar to a bat virus which was also created in the same way , none of the referenced foundational paper actually find a viral particle . So the read files should not be 99.8% other stuff

The Fan wu sars-cov2 genome from there own dataset can only be recreated with short read aligners , taking segments from reads and seeing if they match the reference genome chopping off the first 30 base pairs that dont fit and the last 70 and selecting the 50 in the middle that do , this is like picking words from shakespeare to write a poem and claiming is a shakespeare variant , you could write a lot of poems doing this , but it never came from Shakespeare himself , the problem with the short read aligners is that if I use the same technique with HIV,Zika, Ebola , measles I can also recreate those genomes from the same dataset ( above the 89% which convinced the they had a coronavirus ) There was also no step to go back and confirm with a gel that a sequence of 29.8-30.4k actually existed . There was nothing in the paper to suggest that there sequence had anything to do with making people ill or that it was the cause of any disease , just a misinterpretation of normal human sequences and claiming they are viral . There were also no control experiments , they did use water for PCR negative control , but no positive control where they take a healthy patients lung fluid and perform the same PCR and deep metagenomic sequencing .

Stefan Lanka repeated this procedure using yeast rna as a source of genetic material , same PCR protocol , same software and you can find any and every rna viral genome there is above 99% .

The PCR tests are testing for very short sequences within the 29.8k genome , usually 2-3 primers and if they bind a lot they get a higher fluorescence and they are declared positive , but with 45 cycles you can get positive results from water , the primers are also very short some are only 18 bp long , also they do not go back to check what sequence they amplified , when they do sequence and find out its different from what they expected , they say it's a variant and upload a new sequence which is why we have 15,328,837 different versions , many are incomplete and only partial sequences .

I can also find the sars-cov2 genome in datasets going back to 2016 using the same software . The sequences only appear on a screen , not in the physical world . I am open to change my mind about all of this , but we are not dealing with science here , it's definitely pseudoscience and compartmentalizing molecular biology with technology where people are just using apps to deal with really large datasets .

Expand full comment

Don't you think that bacteria exist though, and don't you think that metagenomic assemblers like MEGAHIT produce valid contigs for bacteria even if the sample contains other organisms? Then why would you need to separate out viruses from the sample?

USMortality's second-longest contig was 16,024 bases long, and had it had a 99.2% match with "Leptotrichia hongkongensis JMUB5056 DNA, complete genome" with 16,023 overlapping bases (so is that contig not valid either?):

brew install blast seqkit;curl https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa|seqkit sort -lr final.contigs.fa|seqkit range -r 2:2|blastn -db nt -remote -outfmt '6 qseqid sseqid pident length qlen slen qstart qend sstart send stitle'

You wrote: "The Fan wu sars-cov2 genome from there own dataset can only be recreated with short read aligners , taking segments from reads and seeing if they match the reference genome chopping off the first 30 base pairs that dont fit and the last 70 and selecting the 50 in the middle that do". However in this post, didn't MEGAHIT also recreate the genome apart from the last 30 bases of the poly(A) tail and the two extra bases at the beginning? It didn't need any reference genome.

You also wrote: "I can also find the sars-cov2 genome in datasets going back to 2016 using the same software". Can you post your code, and which datasets are you talking about?

Expand full comment

If you post up your email , i'll send you what I have ran through , some good apps , viewing software , commands etc . More eyes on this one the better .

Expand full comment

Yes , tomorrow I will get the datasets , but they where ebola deep metagenomic sra files 50-75 bp reads , yes I think the bacteria contigs are legit , I don't see too much of a problem with assembly and the reason we know these are bacterial sequences is because they have been properly isolated and characterised , so we definitely have a reference to say this sequence belongs to this organism , the difference with viruses is that we have never found one to correctly isolate( in the real meaning of the word) to say this sequence 100% comes from this organism , they are bypassing all isolation steps and going direct to sequencing assuming its there , designing primers to find what they are expecting and sort of working backwards to create the proof with really out of spec PCR , all with no control experiments , every step and every variable should be controlled , otherwise it's not science . I mean had they off looked in this dataset for ebola or hiv and designed primers for that , they would have had a better match than 89% . With 30-40 SRA files for sars-cov2 that I have run the longest contigs i've seen are 3.5k , there may be better examples but I haven't seen them yet , these where nanopore sequencing (1000-1400 bp reads) which it looked to me like some of these sequences did actually exist , but we have no idea what there origin is , I can claim they are human because at this point they definitely did come from a human swab , but if we are going to claim they are bacterial , viral or extra terrestrial then we need to find that organism first and then make sure the sequence definaltey does come from it , is repeatable and is controlled . The biggest problem I can see is with the alignment steps , this is cherry picking to create something you never had , then with the mpileup , sort of overlaying multiple incorrect sequences to fill in missing or incorrect nucleotides to fill in the blanks to match the reference . Megahit did create a 29802 sequence , but if you put them together they aren't that similar , but if we do a basic local alignment search then they align pretty well , the question should be , how did they go from 30.4k not being repeatable to the actual 29802 that was found to MN908947.3 , where did this come from ?

Expand full comment
Apr 3, 2023·edited Apr 3, 2023Liked by Ben

Maybe the 30,473-base sequence of MN908947.1 can't be reproduced because human reads were removed from the files uploaded to the SRA. Or maybe those reads that consist of only N letters were originally the human reads.

The publication date of the raw reads is listed as 2020-01-27 at the SRA, but at that time MN908947.2 had already been published so they had probably noticed the error they made in MN908947.1 (https://www.ncbi.nlm.nih.gov/sra/SRR10971381).

When I tried aligning the paired reads against MN908947.1, there was no read whose starting position was lower than 29,830:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/081/SRR10971381/SRR10971381_{1,2}.fastq.gz

curl -so MN908947.1.fa 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'

bwa index MN908947.fa;bwa mem -t 4 MN908947.fa SRR10971381_{1,2}.fastq.gz|awk '/^[^@]/{if($4>x)x=$4}END{print x}'

Expand full comment
author

paul & mongol, can we move this to a slack or telegram chat group?

We should discuss all this, in a better format, these comments are kind of getting out of hand ;)

Please let me know your preference., and i'll set it up..

Expand full comment
Apr 5, 2023·edited Apr 5, 2023

I was thinking earlier that maybe someone could start a Discord for COVID conspiracytard coders, or people who are writing scripts about COVID statistics or bioinformatics.

I've been posting some of my scripts on two SARS Discords but I don't think anyone there has even tried out to run my code.

Discord would be my preference, or oldschool forums like XenForo or Discourse. Discord is pretty good for posting code and images, and even if you have code that's longer than 2000 characters, you can post it as a text file that gets displayed with an expandable preview block, like in this example: https://i.ibb.co/M8gjF7F/20230405122308.png.

Expand full comment

yes , agree . Substack is hard to paste up charts ,commands and it have any meaning to the reader etc . Telegram would be my preference.

Expand full comment

You might think I am pulling your leg here , I had 12 sra files that I went through that where 98% +

installed lxd for clustering and its wiped my main PC , tried testdisk to recover the deleted files , but I havent found them yet , but I will run through similar searches again . First one I tried was https://www.ncbi.nlm.nih.gov/sra/SRR3647349 this came out with a 2% mismatch overall .

here is the end including the first error (N) Ngagtgtacagtgaacaatgctagggagagctgcctatatggaagagccctaatgtgtaaaattaattttagtagtgctatccccatgtgattttaatagcttcttaggagaatgacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Most of the errors are in the first 10 bases , quality controlled lower than 20 is set to N ( seqtk)

mapped with bbmap , no reformatting , consensus sequence generated with bwa. I will keep looking , there were better ones . Reads are chopped into pieces as small as 15bp up to 127 , which is why I don't think much of alignment.

Expand full comment

The problem is that there are still millions of humans reads in the SRA file , here are the contigs by abundance ( sorry it doesn't paste very well)

k141_5448 913 78.694 CP024724.1 Prevotella intermedia strain KCOM 2837 chromosome 2, complete sequence 91,730 919 0,000E+00 1.264

k141_24026 850 134.614 CP085934.1 Prevotella copri DSM 18205 strain FDAARGOS_1573 plasmid unnamed2, complete sequence 97,291 406 0,000E+00 688

k141_4059 812 132.074 CP072363.1 Prevotella jejuni strain F0697 chromosome 1, complete sequence 98,684 532 0,000E+00 942

k141_525 754 217.291 CP072333.1 Porphyromonas sp, oral taxon 275 strain W7780 chromosome, complete genome 98,802 501 0,000E+00 891

k141_8586 656 762.405 CP016205.1 Prevotella scopos JCM 17725 strain W2052 chromosome 2 genome 98,176 658 0,000E+00 1.147

k141_19969 639 157.783 MT242596.1 Homo sapiens isolate DH001 mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OL521838.1 Homo sapiens haplogroup I3 mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OK104093.1 Homo sapiens haplogroup H4a1a1a mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OK266950.1 Homo sapiens haplogroup H3i mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_19969 639 157.783 OK239657.1 Homo sapiens haplogroup H1b1d mitochondrion, complete genome 100,000 477 0,000E+00 881

k141_534 620 199.625 CP072361.1 Prevotella melaninogenica strain F0054 chromosome 1, complete sequence 98,387 620 0,000E+00 1.090

k141_7697 569 360.511 CP023863.1 Prevotella jejuni strain CD3:33 chromosome I, complete sequence 96,473 567 0,000E+00 935

k141_11094 532 62.843 CP072331.1 Prevotella veroralis strain F0319 chromosome 2, complete sequence 97,543 529 0,000E+00 904

k141_17668 508 202.095 CP085941.1 Prevotella melaninogenica strain FDAARGOS_1567 chromosome 2, complete sequence 99,018 509 0,000E+00 911

k141_11371 427 198.847 CP072360.1 Prevotella melaninogenica strain F0091 chromosome 1, complete sequence 99,532 427 0,000E+00 778

k141_9606 408 63.287 CP072330.1 Prevotella veroralis strain F0319 chromosome 1, complete sequence 98,529 408 0,000E+00 721

k141_25754 384 133.274 MN297237.1 Homo sapiens LHRI_LNC32,2 lncRNA gene, complete sequence 100,000 256 4,700E-129 473

k141_25754 384 133.274 MN297236.1 Homo sapiens LHRI_LNC32,1 lncRNA gene, complete sequence 100,000 256 4,700E-129 473

k141_25754 384 133.274 CP034492.1 Eukaryotic synthetic construct chromosome 14 100,000 256 4,700E-129 473

k141_25754 384 133.274 NG_050638.2 Homo sapiens ribosomal protein S29 (RPS29), RefSeqGene (LRG_1147) on chromosome 14 100,000 256 4,700E-129 473

When I remove the human sequences , these definitely do get removed , none of these reads remain . So there is no way the published SRA file can be the one which they used to create these genomes , none of the megahit results are repeatable , human reads remain , the amount of contigs they claim from both assemblers is over 10-12 times less , which points to filtering of data , lots of N's present which also points to filtering of data . Trinity 2.5.1 is so unrepeatable it is possible they got a length of 11k , I have done around 12 runs and the shortest i've gotten is 17k , so this version of this assembler isn't very good , same split SRA files , different results every time. But the tail end of mn908947.3 cannot be made without short read alignment . If you look at the amounts of primers used and there positions and CT , it is possible that this was just attaching smaller human rna sequences , the other problem is that if you look at the mapped reads in tablet , the amplification wasn't specific or equal , so it could well be that some of the sequences where created during the later stages of the PCR process , also why design primers for a genome of 29802 bp if the longest contig was 30.4 k? To me it looks like they knew what they were looking for and repeated an experiment to create it , added the poly a tail to create a bat story . All very uncontrolled .

Expand full comment

Thank you so much, great response (I'm saving this comment) I sometimes suspect the true disease, is what they injected, the rest may have been theatre waiting for the injection, or even using PCR to diagnose everything as covid, and not disputing maybe there was something out there making people sick (plus the death protocol of hospital treatment, and lack of early treatment) .

Given replication issues with RNA viruses as mentioned, a coronavirus Pandemic would be short lived and localised I'd think... The way to avoid replication issues is to give your body genetic instructions to make perfect spike proteins everytime? My mind changes all the time on this subject, but it was all about a needle in the arm, beyond all reason and sense around experimental tech, you MUST have the needle... It's all so dodgy.

Expand full comment

Yes it was mainly just normal seasonal illness rebranded , was there something going on in localised areas ? Possibly , but since the entire medical community is looking for viruses ( which don't exist) then they are never going to see what the real cause was , even 70 years after people knew polio was pesticide poisoning we will still doctors hailing the polio vaccine and its because they believe in the religion of infectious diseases. I also haven't seen any proper evidence that the mRNA injected actually does anything or creates any kind of protein , only that the lipids that carry the mRNA are extremely toxic and a self adjuvant and therefore will create new proteins through the body being poisoned . Your body is a miracle and already has all the information to create perfect proteins , but when your buy into the invisible enemy ( which has never been shown to exist) then you need specific antibodies to fight these imaginary viruses and that leads people to believe they are not perfect and they need intervention. The key is to avoid being poisoned knowingly or unknowingly and not many people have the knowledge to do this ( mainy allopathic doctors ) There are lots of sources of poor health causing chemicals in our everyday environment . People got ill pre 2019 and they also got ill in 2020 the overall mortality data shows they are very similar years globally , a lot of countries 2019 was actually worse. To me we where just dealing with the most intense propaganda campaign ever and mass testing where even a perfectly well person was considered ill and a case . I was lucky enough to meet Kevin Corbett in June 2020 who explained the whole hiv and testing scam and mentioned the Perth groups work on HIV and seeing the exact same playbook for covid . So I never got involved with masks, cases , measures , jabs , alternative treatments , from day one my question has been "does this virus exist" and i still haven't seen anything that would suggest it does . The worst part is finding out that these arguments about sars-cov2 apply to every virus. Then finding out that Stefan Lanka made a challenge around 2015 of 100k euros to anyone who could prove a measles virus exists , he won on appeal and they admitted there is no scientific proof in any of the presented papers , the papers that were rejected as proof where the foundational papers that virology , so this should have been over years ago.

decent video covering the case

https://odysee.com/@drsambailey:c/themeaslesmyth:0

Expand full comment

You hit a conundrum that I have harbored since the beginning of this pandemic. I am not even close to your level of understanding/experience, I have the ability to understand and reason and the absence of a complete genome always ate me up.

If the sequence is incomplete and the RT-PCR process is applied... is it possible that we could be testing the same thing over and over again but call it a different variant because it doesn’t mirror previous sequences? Would the inverse be true when short sequences are utilized that match previous short sequences and call those specimens alike?

Expand full comment
Apr 1, 2023·edited Apr 4, 2023

In a paper in 2020 titled "A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein", they described the RmYN02 sequence which was similar to some of the BANAL sequences that were published later in 2022: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7211627/. However the whole genome sequence of RmYN02 is not available from GenBank but only GISAID, so I tried to assemble it myself from the raw reads.

My MEGAHIT run produced fairly long contigs of three SARS-like bat viruses: a 29,601-base contig which had 97.4% identity with "Bat coronavirus isolate BANAL20247/Laos/2020" (because RmYN02 is not included at GenBank), a 7,111-base contig which had 99.4% identity with "Betacoronavirus sp. RsYN03 strain bat/Yunnan/RsYN03/2019", and a 4,492-base contig which had 97.2% identity with "Betacoronavirus sp. RmYN07 strain bat/Yunnan/RmYN07/2020". RsYN03 and RmYN07 were originally sequenced from the same pool of samples as my sample, which is why they had such high identity with my contigs: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8188299/#sec2title.

I now did the trimming with trim_galore, which is a bit more convenient than Trimmomatic because it automatically detects which adapters are being used and removes them:

brew install --cask miniconda;conda init ${SHELL##*/};conda -c bioconda install trim-galore

brew install blast seqkit

brew install -s megahit # use `-s` to compile from source because the bottle is compiled against GCC 9

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR124/009/SRR12432009/SRR12432009_{1,2}.fastq.gz # download about 4 GB of raw reads sequenced from a sample of bat shit

seqkit head -n10000 SRR12432009_1.fastq.gz|seqkit locate -p AGATCGGAAGAG # find reads which feature the Illumina universal adapter sequence or its reverse complement

trim_galore --paired SRR12432009_[12].fastq.gz --length 70 # remove reads shorter than 70 bases, remove Illumina universal adapter sequences, and trim reads with quality under 20

megahit -1 SRR12432009_1_val_1.fq.gz -2 SRR12432009_2_val_2.fq.gz -o megabat

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5' # download an aligned FASTA file of 335 SARS-like viruses

seqkit seq -g sarslike.fa>sarslike.ungap;makeblastdb -in sarslike.ungap -dbtype nucl # make a BLAST database of SARS-like viruses

update_blastdb.pl --showall tsv # list available BLAST databases

update_blastdb.pl --decompress ref_viruses_rep_genomes # download a small databases of viral genomes

<megabat/final.contigs.fa blastn -db 'ref_viruses_rep_genomes sarslike.ungap' -outfmt '6 qseqid sseqid pident length bitscore qlen slen stitle qstart qend sstart send'|tee blastout|awk '$3>=95&&$4>=500'|sort -rnk5|awk '!a[$1]++' # select matches with at least 95% identity and at least 500 bases overlap; show only best match for each read

In an earlier run where I didn't do any trimming, I actually got a bit longer contigs for RsYN03 and RmYN07, so maybe I did the trimming too aggressively. This plot shows the results of my earlier run: https://i.ibb.co/9Txbdgv/1.png. My plot only includes viruses where the match length was at least 500 bases and the identity percent was at least 95%. Most of the viruses I matched are bacteriophages that infect intestinal bacteria like E. coli. My plot only includes matches in two small BLAST databases for viruses, so it doesn't include the contigs which matched bacteria or bat DNA. In order to draw the colored clusters in my plot, I downloaded the sequences of all viruses from GenBank, I aligned them with MAFFT, I made a distance matrix of the alignment, I used the distance matrix as input for hierarchical clustering, and I cut the clustering tree at the height where it had 8 subtrees: `cutree(hclust(as.dist(1-bio3d::seqidentity(bio3d::read.fasta("input.fa")))),8)`.

The genome of RmYN02 might actually be fraudulent though. It was used to demonstrate that the furin cleavage site of SARS 2 has a natural origin, because RmYN02 contained a "PAA" sequence around the furin cleavage site which at that point had not been described in other SARS-like viruses even though it was later featured in some of the BANAL viruses published in 2022. In my alignment of the spike proteins, the region around the furin cleavage site is "QTNSPRRARSVA" in Wuhan-Hu-1, "PAA-------RV" in RmYN02, BANAL-20-116, and BANAL-20-103, "QTNS----RSVA" in BANAL-20-52 and BANAL-20-236, "ASIL----RSTS" in the Zhoushan bat virus ZC45). The point was that the "PAA" sequence would've evolved into the "PRRA" sequence, but it's weird that the "PAA" sequence is not featured in BANAL-52 which is much closer to SARS 2 than BANAL-20-116 or BANAL-20-103 is.

Here's how you can align the spike proteins:

brew install seqkit mafft brewsci/bio/snp-dists;conda install -c bioconda nextclade

nextclade dataset get --name sars-cov-2 --output-dir .

seqkit sort -lr megabat/final.contigs.fa|seqkit range -r 4:4|seqkit seq -rp|nextclade run --input-dataset sars-cov-2 --output-all cladeout

snp-dists sarslike.fa>sarslike.dist

awk -F\\t 'NR==1{for(i=2;i<=NF;i++)if($i=="NC_045512.2")break;next}$i<=5000{print$1}' sarslike.dist|curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta_cds_aa&id=$(paste -sd, -)"|seqkit grep -nrp spike\|surface|cat - cladeout/nextclade_gene_S.translation.fasta|mafft ->spike.aln

seqkit subseq -r 692:703 spike.aln|seqkit fx2tab|sed $'s/lcl|//;s/_prot_[^\t]*//'|awk -F\\t 'NR==FNR{a[$1]=$2;next}{print$2,$1" "a[$1]}' <(seqkit seq -n sarslike.fa|sed $'s/ /\t/;s/, complete genome//') -

Expand full comment

When I ran MEGAHIT without Trimmomatic for SRR10971381, my second-longest contig was 16,037 bases long and its best match at BLAST was "Leptotrichia hongkongensis JMUB5056 DNA, complete genome" with 99.18% identity and 99% query coverage. My third-longest contig was 13,657 bases long and its best match at BLAST was "Select seq LR778174.1 Veillonella parvula strain SKV38 genome assembly, chromosome: 1" with 99.82% identity. My fourth-longest contig was 11,777 bases long and its best match was "Leptotrichia sp. oral taxon 212 strain W10393, complete genome" with 99.24% identity. I didn't find any matches for my fifth-longest contig, but my sixth-longest contig was 8,062 bases long and its best match was "Select seq CP068263.2 Homo sapiens isolate CHM13 chromosome 15" with 99.89% identity.

From the "Analysis" tab of the NCBI's sequence read archive, you can see which organisms the reads are estimated to come from: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR10971381&display=analysis. For SRR10971381, 61% of reads are unidentified, 33% are classified under bacteria, 5% under eukaryotes, and only 0.2% under viruses. So the ratio of 33% bacteria to 5% eukaryotes is consistent with my BLAST searches, where three out of four reads were from bacteria and the fourth was a human read.

To get the second-longest contig from the new dataset in this Substack article, run: `curl https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa|seqkit sort -rl|seqkit range -r2:2`. Then paste the result here and press BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn.

Expand full comment
Mar 30, 2023·edited Mar 30, 2023

This finds MEGAHIT contigs which match a BLAST database for selected RNA reference sequences:

brew install blast

update_blastdb.pl --showall tsv # show name and size in GB of available databases (the `tsv` modifier is needed to print metadata about each database and not just identifiers)

update_blastdb.pl --decompress refseq_select_rna # download a tar file for a small 66 MB database and decompress it

blastdbcmd -db refseq_select_rna -metadata # shows that the database has about 60 thousand sequences with about 200 million total bases

wget https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa

alias blan='blastn -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen stitle"' # `-outfmt 6` uses TSV output; here I included the same fields that are included in `-outfmt 6` by default but I added fields for subject title, query length, and subject length (see `blastn -help` for list of available fields)

<final.contigs.fa blan -db refseq_select_rna|tee blastout

awk '$3>=95&&$4>=500' blastout|sort -rnk3|awk '!a[$1]++'|column -ts$'\t' # select matches that are at least 95% identitical and at least 500 bases long, print only highest-percentage match for each read

The code above returned 14 matches which were all sequences of human mRNA.

Next I downloaded a BLAST database for viruses: `update_blastdb.pl --decompress ref_viruses_rep_genomes;<final.contigs.fa blan -db ref_viruses_rep_genomes>blastout2;awk '$3>=95&&$4>=200' blastout2|sort -rnk3|awk '!a[$1]++'|column -ts$'\t'`. I included only matches with at least 95% identity and at least 200 bases. There was one contig which matched "Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1", four contigs which matched "Human endogenous retrovirus K113", and two contigs which matched "Streptococcus phage PH10".

Expand full comment

Ive got highly similar results, ive been running them with/without trimmomatic with fastp , ran the results through blastn and put it all in excel , I have never had the same results twice from Trinity 2.5.1 without fastp , it just doesn't paste very well in here . Also been running Spades , none of the results line up , but they mostly within 1-3% of each other

I like the command

$ blastn -db nt -remote -num_alignments=5 -query longestfirst.fa -out longest30.csv -outfmt "10 qseqid sacc stitle pident length evalue bitscore"

I dont have the space for databases

It keeps the results shorter and runs a bit quicker , if any sars-cov2 sequence is in there though I get errors from blast for cpu usage exceeded and it fails , so I have had to do that one manually from web blast and remove it from the input fasta file . When I removed human sequences with Bowtie, they definitely did disappear , so i'm not sure they performed ribosomal removal as claimed or why they are publishing reads with NNNNNNNN's alluding that these are there because they filtered out human reads , even with every possible way of treating the data I cannot repeat any of the results . Did you run Trinity 2.5.1 , it would be interesting to see your results for contigs by length , I have two with fastp which are the same , 3 without fastp which are all different and two with trimmomatic which are the same . longest contigs ranging from 13k to 27k , never 11k as claimed.

Expand full comment
Mar 31, 2023·edited Mar 31, 2023

I didn't even know you could use a remote database :facepalm:.

I tried to run your command with Ben's longest contig which was the sequence of SARS 2, but I got this error: "Error: [blastn] internal_error: (Severe Error) Blast search error: Details: search failed. # Informational Message: [blastsrv4.REAL]: Error: CPU usage limit was exceeded, resulting in SIGXCPU (24)." When I try to search for sequences of SARS-like viruses on the web BLAST, I also get an error which says "CPU usage limit was exceeded", but I figured out that I can avoid the error if I exclude SARS 2 from the search results by setting the organism to "SARS-CoV-2 (taxid:2697049)" and clicking the "Exclude" checkbox next to it. Apparently the equivalent of that with the command-line BLAST is `-negative_taxids 2697049`, but it gave me this error: `Argument "remote". Incompatible with argument: 'negative_taxids'`. I couldn't get this to work either: `-entrez_query 'NOT 2697049[taxID]'`.

So therefore I omitted the SARS 2 sequence and I only searched for the 2nd to 10th longest contigs:

curl https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa|seqkit sort -lr|seqkit range -r 2:10|blastn -db nt -remote -outfmt '6 qseqid sseqid pident length qlen slen qstart qend sstart send stitle'|tee out.tsv

It took forever to run though. In comparison, `<final.contigs.fa blastn -db refseq_select_rna -outfmt 6` took 12 seconds to run and returned 6529 results, but when I added the `-remote` flag, it had only returned 2 results after 18 minutes when I terminated it.

I now tried running Trinity without doing trimming: `trinity --seqType fq --left SRR10971381_1.fastq --right SRR10971381_2.fastq --max_memory 10G --output trinityout`. The first time I got the error `dyld: Library not loaded: /usr/ldyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/10/libgomp.ocal/opt/gcc/lib/gcc/10/libgomp.1.dylib`, so I ran `brew reinstall trinity -s` to compile Trinity from source in order to use GCC 11. The second time it failed because it required more than 40 GB of disk space so I ran out of free space. The third time it failed again at the Jellyfish stage, which may have been because I only gave it 5 GB of memory because it said: `If it indicates bad_alloc(), then Inchworm ran out of memory. You'll need to either reduce the size of your data set or run Trinity on a server with more memory available.` Next I'll try to give it more memory.

Expand full comment
Mar 31, 2023·edited Mar 31, 2023

Yes with remote commands it take around an hour to pull the data , I think the results are stored in memory and then dumped in the text file at the end . if you don't specify the max alignments you will get 100-200 results for each sequence and this can take ages, if the search goes on for longer than an hour blast terminates it , I think there are so many accessions for sars-cov2 that it just wont run , as you say you can even get the same errors on web blast around 25% of the time for me . But for 30-50 contigs I just leave it running for an hour and its got a good success rate if sars-cov2 is omitted . But you are reliant on a connection to the internet so databases can be quicker and more reliable , even though the searches might be smaller .

The output of Trinity is around 70-90gb , which is insane if you only want to look at a 70mb fasta file . I also needed to split the data from SRA differently

fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR10971381.sra

I cant get it to finish any other way , if you want to change GCC to 11 without brew or any docker type images , you can edit the makefile in Chrysalis ( around line 300-310) and just manually edit the GCC version there , this should build fine after that .

Just to add , removing some of the output option on blast remote also made it run quicker and more successful , so I just keep these at a minimum .

Expand full comment
Mar 28, 2023·edited Mar 28, 2023

I tried aligning 194 sequences of SARS 1:

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'

seqkit grep -nrp SARS\ coronavirus sarslike.fa|mafft --thread 4 ->sars1.aln

seqkit subseq -r-100:-1 -w0 sars1.aln|awk '{getline x;print x,$0}'

There was only one sequence which didn't have any gaps at the end, but the number of gaps at the end ranged from 8 to 242 in the other sequences, where in about half of the cases the only difference was the length of the poly(A) tail.

AY461660.1 (SARS coronavirus SoD) had a 24-base segment at the end which was not shared by any other sequence in my alignment (cacattagggctcttccatatagg). However when I ran `seqkit seq -g sars1.aln|seqkit locate -p cacattagggctcttccatatagg`, I found that the 24-base segment was the reverse complement of another 24-base segment near the end of the genome which started 63 bases earlier.

Another assembly error is that the beginning of AY559083.1 (SARS coronavirus Sin3408) includes two copies of a 29-base segment (ggaaaagccaaccaacctcgatctcttgt), where one copy starts at position 22 and another copy starts at position 57: `seqkit grep -nrp AY559083.1 sarslike.fa|seqkit seq -g|seqkit locate -p ggaaaagccaaccaacctcgatctcttgt`. `seqkit locate` doesn't ignore gaps so I used `seqkit seq -g` to remove gaps, but another way to remove gaps is `sed '/^>/!s/-//g'|grep .`.

Expand full comment

Hello Mongol , could you do me a favour if you have these pasted somewhere and put the command line with a brief description of what it does and mail me at pdevine999@gmail.com

Some of these nuggets can take hours of reading and then stumbling across what you're looking for , even simple ones took me hours days to find like sort a fasta file by length , or selecting the longest 50 contigs . Really simple when you know but they are like needles in big haystacks👍 I have nearly documented all of the ones I have been using , but I dont think you can get enough.

Expand full comment
Mar 28, 2023·edited Mar 28, 2023

I'm posting new code here as I learn it: https://anthrogenica.com/showthread.php?27891-Shell-and-R-scripts-for-genetic-analysis-of-SARS-like-viruses.

Seqkit is really convenient: https://bioinf.shenwei.me/seqkit/. For example this selects the 50 longest contigs: `seqkit sort -lr input.fa|seqkit head -n50`. Using basic shell utilities, you'd need to do something like `sed ':1;N;/>/!s/\n//;t1;P;D' input.fa|paste - -|awk -F\\t '{print length($2),$0}'|sort -rn|head -n50|cut -d' ' -f2-|tr \\t \\n`.

Expand full comment

Superb , I have been using seqkit , just not this in depth🤣

These will help me massively , especially when running lengthy blastn searches through CLI , rather than manually fishing out the data I need ( which is what I have been done previously )

There where alternatives with samtools and faidx , but you can burn through days working out how to get these commands assembled ( even with the --help )

Expand full comment