Databanks setup
From GBWiki
Contents |
How to prepare Databanks for phenyx (Phenyx version up to 2.5)
NCBI databanks
To prepare databanks nr, est_others, est_human etc. from NCBI, you must first transform them into fasta (with enriched headers) then call the prepare scripts.
nr
- Remove the var/phenyx/databases/NCBInr directory or copy it for your archives
- Mirror the latest fasta & gi_ taxonomy file in a temporary directory
cd /tmp wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz gunzip nr.gz gunzip gi_taxid_prot.dmp.gz
- Convert the original NCBI fasta into a phenyx fasta from CPAN InSilicoSpectro::Databanks scripts (installed in /usr/bin or /usr/share/phenyx/perl)
ncbinr2phenyxfasta.pl --in=/path/to/nr --taxo=gi_taxid_prot.dmp --out=/tmp/NCBInr.fasta
- check You can check that fasta header are correctly composed
head NCBInr.fasta; tail NCBInr.fasta
- Launch the preparation process
- Note: variable yyyymmdd is the date you want to see in the GUI as version name/number
prepareDb.pl --dbname=NCBInr --src=NCBInr.fasta --seqtype=AA --dbrelease=yyyymmdd --verbose
nr without taxonomy
In case you can not retrieve the taxonomy file, you can transform your ncbi derived fasta file directly in the phenyxready fasta (to be passed to preparedDb.pl)
perl -pi -e 's/^>((?:\w+)\|(?:(\d+)\w*))\|(\S+)\|\s*(.*)/>$1 \\ID=$3 \\DE=$4/' yourfile.fasta
EST
- mirror the latest fasta & gi_ taxonomy file in a temporary directory
wget -O est_human.gz ftp://ftp.ncbi.nih.gov/blast/db/FASTA/est_human.gz wget -O gi_taxid_nucl.dmp.gz ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
- uncompress the files
gunzip est_human.gz gunzip gi_taxid_nucl.dmp.gz
- from CPAN InSilicoSpectro::Databanks scripts (installed in /usr/bin or /usr/share/phenyx/perl)
ncbinr2phenyxfasta.pl --in=/path/to/est_human --taxo=gi_taxid_nucl.dmp --out=/tmp/est_human.fasta
- check you can check that fasta header are correctly composed
head est_human.fasta; tail est_human.fasta
- launch the preparation process
prepareDb.pl --dbname=est_human --src=est_human.fasta --seqtype=AA --dbrelease=yyyymmdd
Decoyed sequences data banks
The goal is to produce decoyed data banks based on an original bank (.fasta or .dat). Decoy can be reverse, hmm, shuffled, shuffled without seing any peptides that were in the original banks etc. etc.)
You should install the perl module InSilicoSpectro::Databanks from CPAN.
The script to decoy data banks is fasta-decoy.pl, documentation can be found here
Examples
Reversing a fasta file
A fasta file of the same size is created, each sequence is simply reversed. AC is prepended with REV_
fasta-decoy.pl --in=/path/to/orig.fasta --out=/tmp/a.fasta --method=reverse
Shuffling a fasta file
A fasta file of the same size is created, each sequence is shuffled, AC is prepended with SHFL_
fasta-decoy.pl --in=/path/to/orig.fasta --out=/tmp/a.fasta --method=reverse
Shuffling, but no enzymatic peptide (of size>=6) must be found in the decoyed sequences
All the sequences are first read to make a dictionary of all enzymatic peptide (default is trypsin, but a regular expression can be passed to specify any enzyme). Then each sequence is shuffled, taking care not to reproduce any of the enzymatic peptide
fasta-decoy.pl --method=shuffle --shuffle-reshufflecleavedpeptides-crc=32 --shuffle-reshufflecleavedpeptides --in=/path/to/orig.fasta--out=/tmp/a-shuffled.fasta
NB typical time to revert uniprot_trembl is approx 1.5h on a decent computer.
NB' To manage memory concerns with large bank, a parameter can be passed to have a fix memory space for the dictionary
Decoying a .dat file
A uniprot .dat file can be used as input, producing derivate sequences (var splices, chains, pro peptides...) into a fasta one prior to decoying. Documentation can be found here
uniprotdat2fasta.pl --in=/path/to/orig.dat --out=/path/to/out.fasta
NB of course all this command be pe piped one to the others, default input and output are STDIN and STDOUT
Decoying a Phenyx installed databank
Suppose you have an installed databank ipi.RAT and want to build a decoyed one, shuffling any entries and name it decoy.ipi.RAT
fasta-decoy.pl --in=/var/phenyx/databases/ipi.RAT/ipi.RAT.fasta --out=/tmp/a.fasta --method=shuffle /usr/share/phenyx/perl/database/prepareDb.pl --src=/tmp/a.fasta --dbname=decoy.ipi.RAT rm /tmp/a.fasta
