Saya memiliki tabel yang dipisahkan tab yang tiga baris pertamanya terlihat seperti ini - satu baris header dan dua entri pertama:

Geneid  Chr Start   End Strand  Length  Feature_count   contig_ID   MAG_id  RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein   G1_719_cleanedcontig_v2_1580    346495  347049  +   555 68733   NODE_28_length_349332_cov_12.741083 ag0r3_bin.39    11455.58033225708
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical protein  G1_719_cleanedcontig_v2_1582    147164  151051  -   3888    61026   NODE_113_length_189623_cov_11.186889    ag0r3_bin.39    1451.8890393965803

Saya ingin mengekstrak, untuk setiap baris, informasi antara "ID" dan titik koma pertama (misalnya untuk baris pertama, "G1_719_cleanedcontig_v2_1582_130", dan meletakkannya di kolom di sebelah kanan. Bagaimana saya melakukannya menggunakan Bash atau Python atau kombinasi keduanya?

0
kdickson 4 September 2020, 03:21

2 jawaban

Jawaban Terbaik

Diasumsikan kerangka data adalah

                                             text
0  Geneid  Chr Start   End Strand  Length  Featur...
1  ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...
2  ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...

Cukup ekstrak karakter antara ID= DAN ;

df['newcolumn']=df.text.str.extract('(?<=[ID]\=)(.*?)(?=\;)')




                                            text  \
0  Geneid  Chr Start   End Strand  Length  Featur...   
1  ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=...   
2  ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=...   

                          newcolumn  
0                               NaN  
1  G1_719_cleanedcontig_v2_1580_319  
2  G1_719_cleanedcontig_v2_1582_130  
0
wwnde 4 September 2020, 00:32
In [4]: df = pd.DataFrame({'Geneid': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'protein',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'protein'},
   ...:  'Chr': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'G1_719_cleanedcontig_v2_1580',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'G1_719_cleanedcontig_v2_1582'},
   ...:  'Start': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 346495,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 147164},
   ...:  'End': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 347049,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 151051},
   ...:  'Strand': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': '+',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': '-'},
   ...:  'Length': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 555,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 3888},
   ...:  'Feature_count': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 68733,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 61026},
   ...:  'contig_ID': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'NODE_28_length_349332_cov_12.7410
   ...: 83',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'NODE_113_length_189623_cov_11.186889'},
   ...:  'MAG_id': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'ag0r3_bin.39',
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 'ag0r3_bin.39'},
   ...:  'RPKM': {'ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G1_719_cleanedcontig_v2_1580_319;contig_length=349332;orf_length=554;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 11455.580332257081,
   ...:   'ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G1_719_cleanedcontig_v2_1582_130;contig_length=189623;orf_length=3887;partial=00;sourcedb=None;annotvalue=0;ec=;product=hypothetical': 1451.8890393965805}})

In [5]: 

In [5]: df
Out[5]: 
                                                     Geneid                           Chr   Start     End Strand  Length  Feature_count                             contig_ID        MAG_id          RPKM
ID=G1_719_cleanedcontig_v2_1580_319;locus_tag=G...  protein  G1_719_cleanedcontig_v2_1580  346495  347049      +     555          68733   NODE_28_length_349332_cov_12.741083  ag0r3_bin.39  11455.580332
ID=G1_719_cleanedcontig_v2_1582_130;locus_tag=G...  protein  G1_719_cleanedcontig_v2_1582  147164  151051      -    3888          61026  NODE_113_length_189623_cov_11.186889  ag0r3_bin.39   1451.889039

In [6]: pd.Series(df.index).apply(lambda x:x[x.index("ID=")+3:].split(";")[0])
Out[6]: 
0    G1_719_cleanedcontig_v2_1580_319
1    G1_719_cleanedcontig_v2_1582_130
dtype: object

0
Amir saleem 29 November 2020, 09:40