At work we have contact with more or less xml file, its format is very regular, but read because there are too many labels (<>), is not clear, for example, you configure the following paragraph:
<configuration> <artifactItems> <artifactItem> <groupId>zzz</groupId> <artifactld>aaa</artifactld> </artifactItem> <artifactItem> <groupId>xxx</groupId> <artifactld>yyy</artifactld> </artifactItem> </artifactItems>
This case needs to be extracted from the above artifactld groupId and XML text, and outputs the following format:
artifactItem:groupId:zzz artifactItem:artifactld:aaa artifactItem:groupId:xxx artifactItem:artifactld:yyy
A knowledge point: tips about XML
XML ( Extensible Markup Language), Chinese called: Extensible Markup Language. Like XML and HTML, it is a markup language. XML is mainly used to carry data transfer and information, not for show, so read a little obstacle.
There are many service configuration file is an XML text, define the corresponding configuration in the XML text, like on this case is an example of a text configuration file. The main effect that the storage of XML data, which is stored in plain text format, thus providing a method of storing data that is independent of software and hardware. This allows to create different applications can share data easier. Since the format of the XML text is fixed, whether it is Windows, Linux or MAC and other operating systems, it can be identified, so it's a good compatibility.
One thing we need to know, is not as XML is a markup language that, unlike HTML needs to be resolved, perform and show beautiful web, meaning it exists only structured, storage and transmission of information.
Knowledge Point two: the interception of two key documents in the middle of the line
Demand is included in the text section 123 and the intermediate abc print out, assuming 123 abc above. If you are using sed, a command can be realized:
# sed -n '/abc/,/123/p' 1.txt
But this still abc and line 123, in order to get rid of them, it is very simple:
# sed -n '/abc/,/123/p' 1.txt |sed '/abc/d;/123/d'
If there are more than 123 text and abc will simultaneously all qualified rows to print them all, provided the soil below a stupid way to help exercise logical thinking.
mysed.sh
! # / bin / the bash # abc and 123 to obtain the line number of the line egrep -n 'abc | 123' 1.txt | awk -F ':' 'Print $ {}. 1'> /tmp/line_number.txt # Calculation abc and comprising a total number of 123 rows n-WC = `-l /tmp/line_number.txt|awk '{}. 1 Print $'` # abc calculated and a total number of 123 N2 = $ [$ n-/ 2] for I SEQ. 1 $ `n2` in do # two rows per treatment cycle it should, for the first time is 1, 2, 3 and 4 is the second, and so on M1 = $ [$ I * 2-1] M2 = $ [$ I * 2] # abc each pass to be acquired and the line number 123 NU1 Sed -n = `" $ M1 "P / tmp / line_number.txt` NU2 Sed -n =` "$ M2" P / tmp / line_number.txt` # abc acquired line number in the following NU3 = $ [$ + NU1. 1] # 123 acquires the above line number in NU4 = $ [$ nu2-1] # sed with the intermediate line 123 and abc Print out sed -n "$ nu3, $ nu4 " p 1.TXT # easy identification, adding row symbols are separated " p 1.txt echo "=============" done
Provide a test text 1.txt, reads as follows:
alskdfkjlasldkjfabalskdjflkajsd asldkfjjk232k3jlk2 alskk2lklkkabclaksdj skjjfk23kjalf09wlkjlah lkaswlekjl9 aksjdf 123asd232323 aaaaaaaaaa 222222222222222222 abcabc12121212 fa2klj slkj32k3j 22233232123 bbbbbbb ddddddddddd
Sed with treatment, the result is:
# Sed -N / abc /, / 123 / p 1.txt | sed / abc / d, / 123 / D ' skjjfk23kjalf09wlkjlah lkaswlekjl9 aksjdf fa2klj slkj32k3j
With mysed.sh process, the result is:
# sh mysed.sh skjjfk23kjalf09wlkjlah lkaswlekjl9 aksjdf ============= fa2klj slkj32k3j =============
case study
1) First, to find < artifactItem > and </ artifactItem > intermediate data segment, the data analysis for this part of
2) can be found in the XML document contains < artifactItem > and </ artifactItem > line number of the line, and then use sed part of this interception
3) taken out of the process data segment using sed, awk keywords, and the corresponding value taken
This case reference script
! # / bin / bash # required output XML content, this custom script is strong, not universal # Author: # Date: XML document name # suppose to be processed is test.xml # obtain and line number where the grep -n 'artifactItem>' the test.xml | awk '{}. 1 Print $' | Sed 'S /: //'> /tmp/line_number.txt # calculates a total number of rows and the row n = `wc -l / tmp / line_number .txt | awk '{Print $. 1}' ` # define getters keywords and their values the get_value () { # $. 1 and $ 2 as a function of two parameters, i.e., the next line and the line number (this operation on one line below ) # middle and the cut out, and then acquires keywords (e.g., the groupId) and its corresponding value is written /tmp/value.txt Sed -n "$. 1, $ 2" P the test.xml | awk -F '<' 'Print $ {2}' | awk -F '>' '. 1 {Print $, $ 2}'> /tmp/value.txt # traverses the entire document /tmp/value.txt cat / tmp / value.TXT | the while Read Line do #x is the key words, such as the groupId #y being a value for the keyword X = $ `echo Line | awk '{}. 1 Print $'` Line echo $ = `Y | awk 'Print $ {2}'` echo artifactItem: X $: $ Y DONE } # Because /tmp/line_number.txt appear in pairs, n2 is a total number of n2 = $ [$ n / 2] # for each pair, and the corresponding values of the print keyword for J in SEQ. 1 $ `n2` do # two rows per treatment cycle should, for the first time is 1, 2, 3 is the second time, 4, and so on M1 = $ [$ J * 2-1] M2 = $ [$ J * 2] # each iteration should obtain and line number nu1 = `sed -n" $ m1 "p / tmp / line_number.txt` NU2 Sed -n = `" $ M2 "P / tmp / line_number.txt` # line number in the following acquired NU3 = $ [$ + NU1. 1] # line number in the above acquired nu4 = $ [$ nu2-1] get_value $ NU3 $ NU4 DONE