SARS-CoV-2 Genome Assembly

Ben

Oct 3, 2022

An attempt to regenerate the claimed SARS-CoV-2 genetic sequence.

Read →

75 Comments

Ocloman

Jan 29, 2023

No virus ever proven to exist directly in the fluids of a human. Are you just learning this now?

Expand full comment

Reply (1)

toni jean

Aug 19, 2023

www.VirusTruth.NET best simple site. im printing mini flyers called Lucky from the site and taping to gas pumps and putting on cars. easy to print 10 per color sheet. we need activists to get this info out into communities

Expand full comment

Truthseek

Apr 16, 2023

Thanks, Ben, I listened to the Twitterspace months ago. I'm not very familiar with the details. But what I remember is that there were ~30000 contigs built in ~350000 possible. PC generated many and they took the longest version. Is that correct? Shame on the commenters in that space. Please keep going!

Expand full comment

Paul Devine

Mar 26, 2023

Ive set it over my friend .

https://github.com/trinityrnaseq/trinityrnaseq/releases/tag/Trinity-v2.5.1

The make files in this version of Trinity points to an old compiler , so it often fails to build from source , but ive put instructions on how to modify it so it will build just fine .

You will just need to check your

$g++ --version

and make sure you edit the correct one

This version of Trinity was the one used in Fan wu

If you download the package it will probably be 2.13.2 which gives completely different results

but I am running it with

./Trinity --seqType fq --max_memory 40G --left SRR10971381_1.fastq --right SRR10971381_2.fastq --CPU 8 --output Trinity2-5.1

from within the trinity dir ( hence the ./ )

without that it will look at $PATH and use the latest packaged version .

The older version of megahit also needs a few tweaks to get running as well , which I will also add in the docs.

Expand full comment

Reply (1)

Paul Devine

Mar 28, 2023

Also forgot to add , to get Trinity 2.5.1 to run , I had to split the files differently ( when running with the command alone it did actually error and tell me , the files have ti be split with this command )

$fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRR10971381.sra

it would be nice to see what your output is from this version of Trinity as i've run it 4-5 times with the same commands and the results are not consistent . All of the others i've run give the same results every time . Trinity.fasta file also split contigs into nodes ( other possible sections that might belong to other contigs ) So even if you view it through bandage , this only shows nodes and not contigs , so your forced into mapping the reads to the contigs in the Trinity.fasta file to view it properly ( unless you like scrolling through a 70 mb text file) There is also no log file which gives you an easy briefing of contigs/lengths like megahit or spades does. I will add this to the doc, because its not repeatable.

Expand full comment

Reply (1)

Gold

Jun 11, 2023Edited

Paul, are you saying that the result was reproduced with the method that is used by Fan wu team by you?

Expand full comment

henjin

Mar 26, 2023

It seems to be common for viral genomes to be missing a short piece from the beginning or end of the genome. When you're calculating a distance matrix between a set of aligned sequences, one strategy to avoid overestimating the distance between one sequence that is missing a part from the end and another sequence that is not is to simply remove the last few bases from the aligned set of sequences. For example in a paper where they produced an aligned set of genomes of SARS 1, they wrote: "For network analysis, an 81-bp block at the 5'-end including gaps and a 77-bp block at the 3'-end including gaps and the poly-A tails were trimmed out of the alignment, and the final alignment contains 30,169 nucleotides." (https://www.sciencedirect.com/science/article/pii/S2590053621001075)

I tried aligning ten random genomes of SARS 1:

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'

Only three out of ten sequences included the poly(A) tail, but its length was really short in all of them: 4 bases in the first sequence and 2 bases in the second and third sequences. The sequence with the 4-base poly-A tail was the only sequence which didn't have any gaps at the end. Two sequences had 2 gaps at the end, three sequences had 4 gaps, two sequences had 24 gaps, one sequence had 45 gaps, and one sequence had 47 gaps. At the beginning of the aligned set of sequences, six sequences had no gaps, one sequence had 20 gaps, one sequence had 32 gaps, and two sequences had 40 gaps.

When I tried calculating a Hamming distance matrix between the aligned set of ten sequences, the maximum distance was 36 when I removed the last 47 and the first 40 bases of the alignment, but the maximum distance was 120 when I didn't remove any bases:

seqkit subseq -r 41:-48 random.fa>random.trim;Rscript -e 'library(Biostrings);max(stringDist(readAAStringSet("random.trim"),"hamming"))'

In order to not overestimate the distance between one sequence that is missing a piece at the end and another one that is not, another method is to only count nucleotide changes but to not count insertions or deletions (i.e. positions where one sequence has a gap and another sequence doesn't), like what's done by the snp-dists utility: `brew install brewsci/bio/snp-dists;snp-dists random.fa`.

I made the sarslike.fa file by doing a BLAST search for several SARS-like viruses, downloading a FASTA file for the 100 best matches of each virus from BLAST, and then concatenating the FASTA files and aligning the concatenated file with MAFFT: https://anthrogenica.com/showthread.php?27891-Shell-and-R-scripts-for-genetic-analysis-of-SARS-like-viruses.

Expand full comment

Reply (1)

Ben

Mar 26, 2023

Thank you. Fantastic thread you put together there on anthrogenica.

What stands out though is that after all, neither the fully claimed genome, nor the longest contig as described by Wu et al being 30,474nt can be reproduced as of this date.

Expand full comment

henjin

Mar 16, 2023Edited

I asked Kevin McKernan why your MEGAHIT script didn't reproduce the genome of Wuhan-Hu-1, but he replied: "He didn’t trim the reads and as a result had a different length poly A tail." (https://anandamide.substack.com/p/failure-of-the-linearization-reaction/comment/13626862) However when I tried aligning the reverse complement of your longest contiguous sequence with the third version of Wuhan-Hu-1, your sequence also had 69 missing bases before the poly(A) tail:

wget https://raw.githubusercontent.com/USMortality/Megahit-SARS-CoV-2/master/out/final.contigs.fa

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=MN908947.3&rettype=fasta'>sars2.fa

cat sars2.fa <(echo \>longest_contig_reverse_complement;awk 'length>max{max=length;o=$0}END{print o}' final.contigs.fa|sed y/ATCG/TAGC/|rev)>temp.fa

brew install clustal-w;clustalw temp.fa

Anyway, I now tried reproducing your pipeline but I also trimmed the reads using Trimmomatic:

brew install megahit -s # -s compiles from source

brew install trimmomatic sratoolkit

vdb-config

fasterq-dump SRR10971381

trimmomatic PE SRR10971381_{1,2}.fastq {1,2}{,un}paired.fastq.gz AVGQUAL:20 HEADCROP:12 LEADING:3 TRAILING:3 MINLEN:75 -threads 4

megahit -1 1paired.fastq.gz -2 2paired.fastq.gz -o out

# megahit -1 SRR10971381_1.fastq -2 SRR10971381_2.fastq -o out2 # try running MEGAHIT with no trimming

The parameters I used with Trimmomatic are probably not optimal, but I just copied them from this article: https://research.csc.fi/metagenomics_quality.

At first I tried to use fastq-dump instead of fasterq-dump, but it only created a single file called SRR10971381.fastq instead of splitting the reads into two files called SRR10971381_1.fastq and SRR10971381_2.fastq.

I used `brew install megahit -s` to compile MEGAHIT from source, because without the `-s` option I got the error `dyld: Library not loaded: /usr/local/opt/gcc/lib/gcc/9/libgomp.1.dylib` (because I had GCC 11 and not 9).

When I didn't use Trimmomatic, my longest contiguous sequence was identical to your results, so it was 29802 bases long and it was missing the last 102 bases of the third version of Wuhan-Hu-1 and it had one extra base at the beginning. When I did use Trimmomatic, my longest contiguous sequence was now 29875 bases long, it was missing the last 30 bases from Wuhan-Hu-1 (so the poly(A) tail was only 3 instead of 33 bases long), and it still had one extra base at the beginning. I probably used the wrong parameters with Trimmomatic though, or I'm still missing some steps in the pipeline.

BTW you wrote: "I have run Megahit 1.1.3 and 1.2.9, with the latter producing a significant shorter contig with only 28,802bp". However based on the final.contigs.fa and output.txt files of your MEGAHIT 1.2.9 run, actually your longest contig was 29,802 bases long : https://github.com/USMortality/Megahit-SARS-CoV-2/tree/master/out.

Expand full comment

Reply (1)

Ben

Mar 25, 2023Edited

Interesting though how Megahit 1.2.9 is able to produce 100% of the Wuhan-Hu-1 except the tail, right? So they must've updated meaghit, to do the trimming..?

Expand full comment

Reply (1)

henjin

Mar 26, 2023

Well the sequences I got with MEGAHIT 1.2.9 were not just missing the tail but they also had one extra base at the beginning.

BTW another paper where they did de-novo assembly for an early sample of SARS 2 was this paper Ren et al.: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7147275/. They wrote: "Quality control processes included removal of low-complexity reads by bbduk (entropy = 0.7, entropy-window = 50, entropy k = 5; version: January 25, 2018),[11] adapter trimming, low-quality reads removal, short reads removal by Trimmomatic (adapter: TruSeq3-SE.fa:2:30:6, LEADING: 3, TRAILING: 3, SLIDING WINDOW: 4:10, MINLEN: 70, version: 0.36),[12] host removal by bmtagger (using human genome GRCh38 and yh-specific sequences as reference),[13] and ribosomal reads removal by SortMeRNA (version: 2.1b).[14]" The reason why the first version of Wuhan-Hu-1 included the 618-base segment of human DNA at the end may have been that they didn't remove human reads before they ran MEGAHIT, or at least they didn't mention removing human reads in the methods section of the Wu et al. paper.

Compared to the third version of Wuhan-Hu-1 at GenBank, the EPI_ISL_402123/IPBCAMS-WH-01 sequence from Ren et al. is missing the last 4 bases of the poly(A) tail and it has 3 nucleotide changes in the region of the ORF1a protein:

$ curl https://download.cncb.ac.cn/gwh/Viruses/Severe_acute_respiratory_syndrome_coronavirus_2_IPBCAMS-WH-01_GWHABKF00000000/GWHABKF00000000.genome.fasta.gz|gzip -dc>liliren

$ curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>wuhanhu1

$ brew install mafft;cat liliren wuhanhu1|mafft --clustalout - # show aligned sequences in a human-readable format

$ cat liliren wuhanhu1|mafft -|awk '{printf(/^>/?"\n%s\n":"%s"),$0}'|grep .>temp.aln;paste <(sed -n 2p temp.aln|grep -o .) <(sed -n 4p temp.aln|grep -o .)|awk '$1!=$2{print NR,$1,$2}'

3778 g a

8388 g a

8987 a t

29900 - a

29901 - a

29902 - a

29903 - a

Expand full comment

Reply (1)

Ben

Mar 26, 2023

In the anthrogenica article you mentioned the SARS1 assembly. Have you tried running megahit on the raw reads? `ERR1091914`

Expand full comment

Reply (1)

henjin

Mar 26, 2023Edited

I have now. It only took about a minute to run. The longest contig was only 11889 bases long but its best match in my local BLAST database was "FJ882938.1 SARS coronavirus wtic-MB", with 99.7% identity:

fasterq-dump ERR1091914

megahit -r ERR1091914.fastq -o megafrench # -r reads a single-ended file

awk '{print length,$0}' megafrench/final.contigs.fa|sort -n|tail -n1

curl -Lso sarslike.fa 'https://drive.google.com/uc?export=download&id=1eq9q_6jPbjvfvKnpAWDkbIyjgky7cjv5'

makeblastdb -in sarslike.fa -dbtype nucl;awk 'length>max{max=length;o=$0}END{print">a\n"o}' megafrench/final.contigs.fa|blastn -db sarslike.fa

I got the identical longest contig when I tried using trim_galore: `trim_galore ERR1091914.fastq;megahit -r ERR1091914_trimmed.fastq`.

In the analysis tab of the NCBI's sequence read archive, there's a breakdown of what percentage of raw reads are estimated to be derived from different organisms. For the French influenza A sample ERR1091914, 0.32% of the reads were classified under Nidovirales which is the parent clade of Coronaviridae: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=ERR1091914&display=analysis. I wonder if there's a way to find other samples which are contaminated by coronaviruses, because maybe you could even find samples from before 2020 which were contaminated with SARS 2.

Expand full comment

Reply (1)

Ben

Mar 26, 2023Edited

Great, but this simply shows that there are huge issues with these assembly approaches, IMO. The SARS1 genome is 29,687 bp long, hence why do you only get a match of 1/3 of it?

Second, if there's indeed SARS1 in >2004, then this would mean that sars1 could have still been around, which would also mean that sars2 could simply be a natural evolution/mutation of sars1?!

Expand full comment

Reply (1)

henjin

Mar 27, 2023Edited

SARS 1 was only a contaminant in the influenza A samples so maybe there wasn't enough genetic material for MEGAHIT to assemble the complete sequence in one contiguous piece. The normal way to do assembly is to use something like `bwa mem` to align the raw reads against a reference genome, like what McKernan did in this Twitter thread: https://twitter.com/Kevin_McKernan/status/1472265137167994883.

From the tooltips at NextStrain, you can see that current strains of SARS 2 have around 80-100 nucleotide changes from Wuhan-Hu-1 plus around 50-60 gaps: https://nextstrain.org/ncov/gisaid/global/6m. So that translates to about 30 nucleotide changes per year if you don't count gaps.

However Wuhan-Hu-1 has 5941 nucleotide changes from the reference genome of SARS 1 (if you don't count positions where either sequence has a gap like what's done by snp-dists): `curl -s 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id='{NC_004718,NC_045512}|mafft -|snp-dists /dev/stdin`. At a rate of 30 changes per year, 6000 nucleotide changes would take 200 years.

BTW I think it's weird that the SARS 1 samples that were claimed to be taken from civets have only around 50-100 nucleotide changes from Tor2 (which is the reference genome of SARS 1 which is similar to other early human samples of SARS 1). It was even claimed that SARS 1 jumped from the civets to humans and not vice versa (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1212604/).

However the closest relative of SARS 2 at GenBank is BANAL-52 which has 905 nucleotide changes from Wuhan-Hu-1, which is about 10-16 times more than the distance between Tor2 and the civet SARS 1 samples at GenBank (which have between 57 and 95 nucleotide changes from Tor2 if you don't count gaps). However sequencing technology is now a lot more advanced than 20 years ago, and people have spent a lot more resources searching for zoonotic sources of SARS 2 than they spent searching for sources of SARS 1. So why was it so easy to find SARS 1 samples in civets which were almost identical to the human strains of SARS 1?

Expand full comment

Jo Waller

Nov 1, 2022

nice https://georgiedonny.substack.com/p/x-ray-crystallography-and-3d-computer 🐒🙏🏽 Jo

Expand full comment

Ocloman

Oct 4, 2022

Interesting work. Many people have called out the lack of “existence” for a long time now. Seems this reverse engineering may be another proof point?

Expand full comment

Ben Haskell

Feb 7, 2023Edited

Zika/ I am curious as to the length of Zika or AIDS sequences found in the Wuhan dataset. Are we talking about a few reads with something like 20nt segments, or contiguous sequences of 100s or 1000s nt?

Expand full comment

Reply (1)

Paul Devine

Feb 7, 2023

Its a petty you can't put pictures up here , Ive got a few length distribution charts here , HIVLC312715.1 is 8819 bp long and is made of nt 20-130 bp long with most of them being between 40-50 bp , I think I will definitely knock something up and document this .

Expand full comment

Reply (1)

Ben Haskell

Feb 8, 2023

OK, a 8819nt Zika or HIV contig generated by an assembly program from this dataset is worrisome. Either the clinical sample was from somebody that was very very sick, the sequencing lab (or hospital) was contaminated (eg., aerosols or dripping pipets), the sequencing machine was running multiplexed samples whose barcodes weren't properly sorted, or the assembly program sucks or isn't run with a high overlap stringency.

Otherwise, If we are talking about 20nt segments, the human genome should contain multiple matches, so that Zika wasn't really present, and I wouldn't care. There are 1E12 20nt segments possible, the haploid human genome is 3E9, so the odds of finding a single specific Zika 20nt segment in the human genome is about 1 per 300. There are about 2E4 Zika 20nt segments, so I expect about 60 potential matches with the human genome. I am assuming exact matches, each 20nt Zika segment (except the first one) is generated with a sliding window so that each overlaps 19nt with the previous one, and the DNA is close enough to random with each nucleotide at 25% abundance.

Expand full comment

Reply (2)

Paul Devine

Feb 8, 2023

a few corrections , its was from alignment not assembly, it won't assemble these complete sequences as they are not 100% complete , so this is just mapping reads as its done in industry , the contigs will be much shorter and filled in with shorter pieces that fit . If the paper was correct they guy was definitely sick yes , but all of these reference genomes they never found a virus and they were all made this way (insilico) . DNA is one thing but RNA is another The central dogma is that we have DNA which , dna transcripts to rna, RNA translates to protein and reverse transcription from RNA to DNA . IF we have 20k-25 genes and 80k-400k proteins that's an awful lots of rna that's not been counted for and won't be in the genes. You are correct though Zika , hiv,ebola , Hep where not present , but there was enough genetic material present to create the genome had they have looked for it and then we could be talking about a zika/hiv pandemic , these are certainly higher than the 89% match they got for sars-1 , to declare it some kind of new sars virus , so it really depends on what they want to find, its like rna fishing with really fast computer . I would say these are probably human sequences , some may be associated with illness , they are probably in minute quantity , not fixed alway changing and most probably never get translated , which is why its easy to create the appearance of an rna virus pandemic if you crank up the PCR , the majority of positive PCR tests have no symptoms whatsoever so they can't be the cause of any disease . I knew there would be hints that these samples would have to be from someone who had all of these viruses , which is the exact reason I looked for MN908947.1.2.3 in older datasets ( which haven't been subject to favourable primers with dirty pcr and yes you can find some. If this rna was declared to be the cause , the output of the PCR should have been enough to conduct causation studies , but they insist on injecting some of these sequences mixed with foreign cell culture material (vero cells/antibiotics/antifungals/feces/brains/lung cancer that can make you sick whether or not the add a human sample to it . If they want to show a respiratory virus , they inject a cup of it into an animal's lungs , if they want to show something like polio they inject a newborn mouse's brain etc . all with no valid control experiments .

I am about 40% through my document and I will try and detail as much as possible what you have looked at . You will probably read it and say this is a million miles away from science and you can't just stitch sequences together on a computer and declare it to be a new virus and the methodology is wrong ,I would definitely agree with this .

I have only seen one dataset actually assemble a sequence , every other needs alignment , apparently sars-cov2 SRR10971381 was the first in history where the genome was created in the assembly step , every other needs alignment .

We also see claims of multiple illnesses when different antibody tests register for all kinds of things and rather than admit they are not specific , we see claims of they just have had all of these conditions . If you have any of this software setup , I wouldn't mind a few people proof reading it . I have an editor at work for technical documentation and he always has quite a bit to do to make them presentable .

Send me an email on pdevine999@gmail.com

Expand full comment

Reply (1)

Ben Haskell

Feb 8, 2023

"....you can't just stitch sequences together.....". I'm sending you a private reply.

Expand full comment

Ben Haskell

Feb 8, 2023

While I'm doing the math.... This dataset has about 6E7 reads or 9E9 nucleotides. Contamination by a single human cell could yield 6E9 nucleotides (diploid genome size) even before PCR. Now that I think about it, this was perhaps a damn clean sample, and a single contaminating human cell could still swamp the dataset. There were not enough human reads to recover the human genome; there would be to many gaps.

Expand full comment

Reply (1)

Ben Haskell

Feb 8, 2023

Perhaps 200,000 virions per human contaminant, given a coronavirus genomic size of 3E4

Expand full comment

Reply (1)

Ben Haskell

Feb 8, 2023

Ballpark figures. I'm ignoring the bacterial contamination for now.

Expand full comment

Ben Haskell

Feb 6, 2023

PacBio 1 of 4, Intro/ First, I want to explore the Pacific Biosystems platform (and their Covid kit) in detail and how it confirms the validity of the Wuhan assembly, and second, use the PacBio literature as an example for my assertion that a major problem with assembly reproducibility deals with the very beginning and terminus of the virus but not its center.

PacBio has a specific kit for Covid: https://www.pacb.com/research-focus/microbiology/public-health/covid-19-sequencing-tools-and-resources/

For details: https://www.pacb.com/wp-content/uploads/Application-Note-HiFiViral-SARS-CoV-2-kit-enables-genomic-surveillance-of-COVID-19.pdf

Three YouTube videos: https://www.youtube.com/watch?v=U5fd3l56dqE , https://www.youtube.com/watch?v=VdXxvOF8QrU, and an older version of the kit with longer reads https://www.youtube.com/watch?v=S7dt0i9z820

For an application to Omicron, which I will discuss in detail later: https://www.pacb.com/blog/omicron-and-beyond-hifiviral-kit-provides-labs-with-a-future-proof-solution-for-emerging-covid-19-variants/

A collection of raw Omimcron reads if you want to compare assembly programs: https://downloads.pacbcloud.com/public/dataset/HiFiViral/Jan_2022/

The kit uses the Wuhan (or one of its revisions) as the reference (ie., template) for the very design of the kit, and for the assembly program (they normally don’t assemble from scratch, but they can substitute another assembly template besides Wuhan). I assert that if the assembly of Wuhan is way off, the design of the kit would have failed, and generation of consensus assemblies for new variants would also fail. The very success of this kit attests to tolerable accuracy of the Wuhan assembly. The kit and software is mismatch (mutation) tolerant.

Expand full comment

Reply (1)

Ben Haskell

Feb 6, 2023

PacBio 2 of 4, Details/ Using the Wuhan sequence, they design 1000 primer sets (they use the word “MIPS”) that generate 1000 unique circular DNAs each with unique 675nt viral inserts. These 1000 primer sets are designed to start every 30nt along the length of the virus. So a segment of of viral DNA should usually be covered by 22 unique types of read, and they recommend at least 4 copies of each type of read (ie., at least 4000 reads in total, but usually more). This scheme is resistant to mutations, if a mutation interferes with one primer set, that specific read would be lost, but a given segment of the virus would still be covered by 21 types of read, rather than the normal 22.

Expand full comment

Reply (3)

Ben Haskell

Feb 7, 2023

Edit: Clarification: "...mutation interferes with one primer set..." refers to a primer failing to anneal if the mutation is within the complementary (ie., target) sequence of that specific primer.

Expand full comment

Ben Haskell

Feb 7, 2023

Edit: Clarification: "Unique inserts" means "Unique but Overlapping inserts"

Expand full comment

Ben Haskell

Feb 6, 2023

PacBio 3 of 4, Assembly/ I previously asserted that the assembly program would have better luck with the center versus the very beginning and end of the virus. This can be seen with the diagrams in the PacBio literature. The very beginning, positions 1-675 would be covered by one primer set, a second primer set would cover positions 30-705, so that positions 30-675 are covered by two primer sets. Positions 60-735 are covered by the third primer set, so that positions 60-675 are now covered by three. With additional slides, the amount of redundacy quickly ramps up. This means that the very beginning and terminus are relatively rare compared to the middle.

A real world example is given in their Omicron page. 33 samples were taken. 4 were of poor quality. Of the good 29, only 11 recovered the entire length while the other 18 were missing just a little from beginning and/or terminus.

Part 4 will leave their Covid kit and explore their work on ultra long RNA reads, but I have to read a little more.

Expand full comment

Reply (1)

Paul Devine

Feb 7, 2023

This is one of the other problems , most of the time when they upload a new SRA dataset they dont give any clue of which 'variant' they where trying to find or what primers they used for detection . I will take a look at this though.

Expand full comment

henjin

Jul 30, 2023

Human reads may have been masked for ethical reasons, since the SRA's website says: "Human metagenomic studies may contain human sequences and require that the donor provide consent to archive their data in an unprotected database. If you would like to archive human metagenomic sequences in the public SRA database please contact the SRA and we will screen and remove human sequence contaminants from your submission." (https://www.ncbi.nlm.nih.gov/sra/docs/submit/)

The NCBI has a tool called Human Scrubber which is used to mask human reads with N bases in SRA submissions (https://ncbiinsights.ncbi.nlm.nih.gov/2023/02/02/scrubbing-human-sequences-sra-submissions/):

> The Human Read Removal Tool (HRRT; also known as the Human Scrubber) is available on GitHub and DockerHub. The HRRT is based on the SRA Taxonomy Analysis Tool (STAT) that will take as input a fastq file and produce as output a fastq.clean file in which all reads identified as potentially of human origin are masked with 'N'.

> You may also request that NCBI applies the HRRT to all SRA data linked to your submitted BioProject (more information below). When requested, all data previously submitted to the BioProject will be queued for scrubbing, and any future data submitted to the BioProject will be automatically scrubbed at load time.

> This tool can be particularly helpful when a submission could be contaminated with human reads not consented for public display. Clinical pathogen and human metagenome samples are common submission types that benefit from applying the Human Scrubber tool.

The usage message of Human Scrubber also says that human reads are masked with N bases by default: https://github.com/ncbi/sra-human-scrubber/blob/master/scripts/scrub.sh.

Expand full comment

henjin

Mar 8, 2023

I tried to use ClustalW2 to align the genomes of the three different versions of Wuhan-Hu-1: `brew install clustalw-w;curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id='MN908947.{1,2,3}>sars.fa;clustalw sars.fa;cat sars.aln`.

The 27 first bases were "CGGTGACGCATACAAAACATTCCCACC" in the first two versions but "-----------ATTAAAGGTTT-----" the current version, where a dash indicates a deletion, and 6 bases out of the 12-base sequence in the middle were also changed.

Apart from that, the only difference between the three versions was that the last 598 bases of the first version were deleted in the second version, and in the third version a new 44-base sequence was added to the end: "AGGAGAATGACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA". The sequence of multiple A bases at the end is known as a poly(A) tail: https://bioinformatics.stackexchange.com/questions/11227/why-does-the-sars-cov2-coronavirus-genome-end-in-aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. (The accepted answer at Stack Exchange said that "for coronaviruses in particular, we know that the poly(A) tail is required for replication", but I don't know if the tail is absolutely required in all coronaviruses, because it was missing from 9 out of the 55 coronavirus genomes listed on this page: https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=11118.)

For some reason the ORF10 protein was also missing from the first and second versions in GenBank. See this thread: https://twitter.com/BillyBostickson/status/1268514460349546497. Replies to the thread pointed out that the last 20 nucleotides of the second version (which are also included in the first and third versions but not at the very end) are "TGTGATTTTAATAGCTTCTT", which is identical to a segment of human chromosome 1, and the extra 598 nucleotides after it in the first version also match human chromosome 1:

- "...or it is simply a repetitive region and yes "automatic machines" (such as HiSeq 3000 / HiSeq 4000 systems) do make that kind of errors in particular near the 3'- (and also 5'-) ends w/o correction. it always helps taking a closer look" (https://twitter.com/ChrisDeZPhD/status/1290064988892274694)

- "No, not poly(A), rather a contamination from RP11-173E24 from human chromosome 1q31.1-31.3, but fraudulent? I don't think so." (https://twitter.com/ChrisDeZPhD/status/1290201009260654595)

- "Thanks I did read it. Still, looks like an obvious reading error at the 3' end by hybridisation with contaminated sample. can happen on Illumina. There is a 20 bp overlap btw CoV-2 3'-end and a complementary DNA-seq from h-chr 1q31.1-31.3. 99% match in 616 bp from nt 40414-41029." (https://twitter.com/ChrisDeZPhD/status/1290218272705531905)

If you do a BLAST search for the last 618 nucleotides of the first version, it's 99.68% identical to "Human DNA sequence from clone RP11-173E24 on chromosome 1q31.1-31.3, complete sequence". You can simply copy positions 29856 to 30473 from here: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.1. Then paste it here and press the BLAST button: https://blast.ncbi.nlm.nih.gov/Blast.cgi.

So I think that solves the question of why the first version was so much longer than the current version, but I still don't know why the beginning of the sequence was also changed, or why they added the sequence of 44 bases to the end of the third version that was missing from the second version.

BTW I think your MEGAHIT pipeline is missing steps like doing quality control with FastQC, merging reads with SeqPrep, and trimming reads with Trimmomatic: https://research.csc.fi/metagenomics_quality.

Expand full comment

Ben Haskell

Feb 7, 2023

Software & Hardware, 1 of 2/ First, I am curious about whether you (Paul) are downloading the programs and datasets and running them locally, or running them on a remote server. Apparently USM is running them on the Amazon cloud.

Its just my old fashioned mindset to think of local control. The last time I ran DNA programs, such as FASTA, was circa 1989, on a local 16bit IBM computer running the “DNA-Star” package. Heck, DNA-Star (or parts of it) were even running on the 8 bit Commodor 64.

Second, I am just amazed at the rapidity of it all, and rapidity (ie., “acceleration”) is the topic of the rest of this thread. I never used BLAST, but I remember reading about it (circa 1990) during the transition from the FASTA era to the current BLAST era. I got the impression that BLAST was precomputed- a “user” didn’t start from scratch. BLAST was constantly doing alignments and searches in the background each time someone uploaded a submission. This was certainly true for proteins: alignments of proteins (and their domains), and assignments into families were constantly updated.

Expand full comment

Reply (2)

Paul Devine

Feb 7, 2023

I work at a place where we have lots of machines not doing a great deal ,so I have lots of decent specced machines running individually but they are remote to me , so the datasets are being run locally and not through any cloud apart from remote access. I am painstakingly trying to put 3-5 of these machines in a cluster so they can share resources and hopefully get quicker , the blast search must have the most insanely specced machine on earth as every search is doing an alignment pretty much to everything on the database or its catalogued every kmer from every data set which would still be insane . I was hoping to have gone down the blastp route but haven't managed yet , if the nucleotide is wrong, I don't want to move on with misconceptions

Expand full comment

Ben Haskell

Feb 7, 2023

Software & Hardware, 2 of 2/ In another comment, I suggest that a rigorous assembly program would, for 6E7 reads, require 18E14 read comparisons, or 152100E14 comparisons of individual nucleotides (and this just for the first step). So I assume that there must be software shortcuts, and a tradeoff between speed and rigor. Some rigor may be retained if there is “acceleration”. Considering software acceleration, each nucleotide needs 2 bits, so that two 32nt segments can be aligned by testing the equality of two 64bit numbers. 32 times faster! But at the expense of loosing mismatch tolerance. A single mismatch ruins the equality and a potential alignment is lost. Considering hardware acceleration, perhaps we can have acceleration (without loss of rigor) by using graphics chips instead of CPUs? What is a sequence but an array, and what is an image but an array of arrays? The following wiki article specifically mentions sequencing alignments on graphics cards

https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units

If an assembly program accelerates via the use of graphics cards, is the program excessively hardware dependent? Is it even possible to download and test locally on your own computer? And so I come full circle, and now you know one reason why I bothered to ask if you downloaded the program.

Expand full comment

Reply (1)

Paul Devine

Feb 7, 2023

Most of the machines I have access to are headless servers so they are CPU/Ram directed , but I do have a few gaming machines , the bit that put me off GPU is that I have had a lot of these machines crypto mining and its cooked a few graphics cards and power supplies ,but it was far quicker and I would imagine it would bring down processing times a lot , the other pain in the arse is that the main machines I am using a Cisco UCS / Dell poweredge , so the graphics card compatibility is pretty much non existent , but it's good to know there is gpu software out there that can do this , I just need to pick up a cheap nvidia k20

Expand full comment

Ben Haskell

Jan 24, 2023

I think that I may have figured it out. Part 2 of 2. If reads near the ends are less abundant, how does this interact with the software? Rounding off to 6E6 reads, a rigorous program may require 36E12 comparisons. This would take forever, so the program may take shortcuts if the first comparison is started at random, the result may differ with each run. If reads near the edges are rarer, the shortcuts may miss them.

In the construction of evolutionary trees from protein or DNA sequences, the programs don't do all possible comparisons, but take shortcuts. So multiple trees are generated by running the program several times (maybe hundreds), and the one that gets published may not be "the one true tree", but the tree with the best score, or a consensus of the better scoring trees.

Expand full comment

Reply (1)

Ben Haskell

Jan 26, 2023

Edit: 6E7 reads, 36E14 comparisons. "Comparison" means "detection of overlap between two reads". Imagine a table of 6E7 rows and columns (each in the same order), and 36E14 cells with an "overlap score". A diagonal from upper left to lower right is self comparison and can be tossed out. The lower left triangle mirrors the upper right triangle, so we actually only need about 18E14 comparisons.

Expand full comment

Ben Haskell

Jan 24, 2023

I think that I may have figured it out. Part 1 of 2. If you repeat assembly several times (apparently you did it twice), do you have: 1) agreement in the center? 2) missing sequence in the ends (start and finish)? The PCR reaction probably uses a set of "random" primers, and to capture the ends, you have to be lucky that your set of primers contains a match for each end. So it is likely that you will have missing sequence at the very ends. Furthermore, the reads that are closer to the ends (but still captured), are likely to be less abundant than reads closer to the center.

Expand full comment

Ben Haskell

Jan 23, 2023

Two people that would be good to verify this:

1. Kevin McKernan on Twitter or https://kevinmckernan.substack.com

2. Flavinkins (= Dayou15, formerly on Twitter) https://gab.com/Flavinkins

Expand full comment

Reply (1)

Ocloman

Jan 29, 2023

McKernan uses cell cultures. How is that proof?

Expand full comment

Reply (1)

Ben Haskell

Jan 31, 2023Edited

I am saying that McKernan, given his background, should have greater expertise with the assembly program and should diagnose exactly why it gives different results when run multiple times with the exact same dataset. Your question focuses on the acquisition of a dataset, my comments focus on how the assembly program deals with a dataset after it is acquired. By the way, my comments are out of chronological order, so that I originally hoped that McKernan could verify that the program is not repeatable, not verify my theory ("rare ends" plus "taking shortcuts") as to why. I came up with my theory after posting that McKernan could verify that the program was not repeatable. If McKernan can also verify (or refute) my theory as to why, that would be great.

Expand full comment

OxAO

Oct 28, 2022

luc Montagnier and Jean-Claude Perez virologist analysts. July 2020. They sequenced isolate="Wuhan-Hu-1"

Quoting from this paper I was permanently banned from twitter. Perez is on twitter if you want to talk to him (he speaks French) @JCPEREZCODEX

https://www.researchgate.net/publication/342926066_COVID-19_SARS_and_Bats_Coronaviruses_Genomes_Peculiar_Homologous_RNA_Sequences_Jean_Claude_perez_Luc_Montagnier

Expand full comment

Reply (1)

Ben Haskell

Jan 23, 2023

I don't think that they independently sequenced, or doublechecked the assembly, but just analyzed the published sequence.

Expand full comment