Split fasta file using csplit - csplit

I need to split a big fasta file into smaller ones. I am trying the following command:
csplit -z input.fasta '/>/' '{*}'
but it is generating lots of files (for each ">"). Is there a way to ask to create only two smaller files?
Thank you

According to it's manpage csplit splits a file at each occurence of the given pattern - thus it's generating one file for each sequence.
If you want to split the file by file size ignoring it's content, you might have a look at split with -C-parameter.
Nevertheless you might not get two valid fasta-files, because the sequence block in the middle of your file might be split.

Related

Find Special Character Sequence in file

I have a file that uses NUL's and SOH as markers. I need to look at the pattern of those special characters in order to parse out what I need. For example, when the file is viewed in Notepad++:
NULNULNUL Boom BoomSOHNULNULNULNULDLENULNULNULSIJohn Lee HookerSOH
I would like to extract the "Boom Boom" and the "John Lee Hooker". Those values will change (these are music files) with each file.
I was thinking of usiing the "NULNULNUl" pattern to find the first section and the "NULSI" to find the second part.
I tried a FileStream to read in the bytes, but i don't know how to detect the special characters.

How to read a file and write to other file in tcl with replacing values

I have three files: Conf.txt, Temp1.txt and Temp2.txt. I have done regex to fetch some values from config.txt file. I want to place the values (Which are of same name in Temp1.txt and Temp2.txt) and create another two file say Temp1_new.txt and Temp2_new.txt.
For example: In config.txt I have a value say IP1 and the same name appears in Temp1.txt and Temp2.txt. I want to create files Temp1_new.txt and Temp2_new.txt replacing IP1 to say 192.X.X.X in Temp1.txt and Temp2.txt.
I appreciate if someone can help me with tcl code to do same.
Judging from the information provided, there basically are two ways to do what you want:
File-semantics-aware;
Brute-force.
The first way is to read the source file, parse it to produce certain structured in-memory representation of its content, then serialize this content to the new file after replacing the relevant value(s) in the produced representation.
Brute-force method means treating the contents of the source file as plain text (or a series of text strings) and running something like regsub or string replace on this text to produce the new text which you then save to the new file.
The first way should generally be favoured, especially for complex cases as it removes any chance of replacing irrelevant bits of text. The brute-force way me be simpler to code (if there's no handy library to do this, see below) and is therefore good for throw-away scripts.
Note that for certain file formats there are ready-made libraries which can be used to automate what you need. For instance, XSLT facilities of the tdom package can be used to to manipulate XML files, INI-style file can be modified using the appropriate library and so on.

Extracting data from a large file with regex

I have a close to 800 MB file which consists of several (header followed by content).
Header looks something like this M=013;X=rast;645.jpg while content is binary of the jpg file.
So the file looks something like this
M=013;X=rast;645.jpgNULœDüŠˆ.....M=217;X=rast;113.jpgNULÿñÿÿ&åbÿås....M=217;X=rast;1108.jpgNUL]_ÿ×ÉcË/...
The header can occur in one line or across two lines.
I need to parse this file and basically pop out the several jpg images.
Since this is too big a file, please suggest an efficient way? I was hoping to use StreamReader but do not have much experience with regular expressions to use with it.
RegEx:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=(?1)|$))/gs *with recursion (not supported in .NET)
.NET RegEx workaround:
/(M=.+?;X=.+?;.+?\.jpg)(.+?(?=M=.+?;X=.+?;.+?\.jpg|$))/gs
replaced the (?1) recursion group with the contents inside the 1st capture group
Live demo and Explanation of RegExp: http://regex101.com/r/nQ3pE0/1
You'll want to use the 2nd capture group for binary contents, the 1st group will match the header and the expression needs it to know where to stop.
*edited in italic

how to write the last line of file

I have a file data.txt. data.txt contains text line by line as:
one
two
three
six
Here I need to write data in file as:
one
two
three
four
five
six
I dont know how to write file like this!!
Generally, you have to re-write the file when inserting - because text files have variable length rows.
There are optimizations you could employ: like extending a file and buffering and writing, but you may have to buffer an arbitrary amount - i.e. inserting a row at the top.
If we knew more about your complete scenario, we would be more able to help usefully.
Loop through your text file and put lines as array. Modify the array and save it back to file. But it's not a good idea if you have some other text file, for this particular example it can work no problem.

Split CSV files into exact 1gb files or little less? [closed]

Every month we receive a invoice file that is always bigger then 2GB, our print house has a 1.1GB limitation and we currently do all these process by hand.
The first step in this application would be to be able to split those HUGE 2GB files into limited 1GB files in a way it won't break each CSV entry and that each files will be readable from the start to the end without breaking any data.
How could I split the file to me the above requirements ?
Are there any libraries for this such of process on CSV files ?
How about just copying the first 1 GB of data from the source into a new file, then searching backward for the last newline, and truncating the new file after that. Then you know how large the first file is, and you repeat the process for a second new file from that point to 1 GB later. Seems straightforward to me in just about any language (you mentioned C#, which I haven't used recently, but certainly it can easily do the job).
You didn't make it clear whether you need to copy the header line (if any) to each of the resulting files. Again, should be straightforward--just do it prior to the copying of data into each of the files.
You could also take the approach of just generically splitting the files using tar on Unix or some Zip-like utility on Windows, then telling your large-file-challenged partner to reconstruct the file from that format. Or maybe simply compressing the CSV file would work, and get you under the limit in practice.
There are just a few things you need to take care of:
Keep the line breaks: split the file on a new line (algorithmically said split the file on the previous line to that where the 1GB limit occured minus the header line size)
Copy the header to the beginning of the new file and then paste the rest
Preserve the encoding.
In a bash/terminal prompt, write:
man split
.. then
man wc
.. simply count the number of lines in the file, divide it by X, feed the number to split and you have X files less than 1.1GB (if x = filesize/1.1)

Resources