[Java 기초문법] 정규식 다루기 | Regex | transcript에서 element 추출해보기, greedy operator 사용시 주의할점, finding method

[Java 기초문법] by Professional Java Developer Career Starter: Java Foundations @ Udemy

Real Text Parsing

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TranscriptParser {
    public static void main(String[] args) {
        String transcript = """
                Student Number:    1234598872       Grade:    11
                Birthdate:    01/02/2000       Gender:    M
                State ID:     8923827123
                                
                Cumulative GPA (Weighted)     3.82
                Cumulative GPA (Unweighted)    3.46
                """;
        Pattern pat = Pattern.compile("");
        Matcher mat = pat.matcher(transcript);
        if (mat.matches()){
            
        }
    }
}

먼저 studentnumber를 보자. 10개의 number를 세팅하고 capture group으로 네이밍을 해준다.

String regex = "Student Number: (?<StudentNum>\\d{10})";

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TranscriptParser {
    public static void main(String[] args) {
        String transcript = """
                Student Number:    1234598872       Grade:    11
                Birthdate:    01/02/2000       Gender:    M
                State ID:     8923827123
                                
                Cumulative GPA (Weighted)     3.82
                Cumulative GPA (Unweighted)    3.46
                """;
        String regex = "Student Number: (?<StudentNum>\\d{10})";
        Pattern pat = Pattern.compile(regex);
        Matcher mat = pat.matcher(transcript);
        if (mat.matches()){
            System.out.println(mat.group("studentNum"));

        }
    }
}

이때 아무것도 리턴하지 못함 = 그 뒤로 다른 것들이 같이 계산이 되어서

start text가 carrot return을 갖고 있기 때문에 단순히 .*로는 인식하지 못함.

Pattern pat = Pattern.compile(regex, Pattern.DOTALL);

=> return all that characters that normally expected to but also carriage returnand new lines

여전히 리턴 nothing

"student number"에서 띄어진 스페이스를 명시적으로 하드코딩해주면 match

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TranscriptParser {
    public static void main(String[] args) {
        String transcript = """
                Student Number:    1234598872       Grade:    11
                Birthdate:    01/02/2000       Gender:    M
                State ID:     8923827123
                                
                Cumulative GPA (Weighted)     3.82
                Cumulative GPA (Unweighted)    3.46
                """;
        String regex = "Student\\sNumber:\\s(?<studentNum>\\d{10}).*";
        Pattern pat = Pattern.compile(regex, Pattern.DOTALL);
        Matcher mat = pat.matcher(transcript);
        if (mat.matches()){
            System.out.println(mat.group("studentNum"));

        }
    }
}

return 1234598872

grade를 가져와보자

one or more space를 표현하기 위해 +를 기입

두자리수를 표현하기 위해 대괄호에 1,2를 기입

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TranscriptParser {
    public static void main(String[] args) {
        String transcript = """
                Student Number:    1234598872       Grade:    11
                Birthdate:    01/02/2000       Gender:    M
                State ID:     8923827123
                                
                Cumulative GPA (Weighted)     3.82
                Cumulative GPA (Unweighted)    3.46
                """;
        String regex = """
            Student\\sNumber:\\s(?<studentNum>\\d{10}).*            # Grab Student Number
            Grade:\\s+(?<grade>\\d{1,2}).*                       # Grab Grab
            Birthdate:\\s+(?<grade>\\d{1,2}).*                       # Grab Grab
            """;

        Pattern pat = Pattern.compile(regex, Pattern.DOTALL | Pattern.COMMENTS);
        Matcher mat = pat.matcher(transcript);
        if (mat.matches()){
            System.out.println(mat.group("studentNum"));
            System.out.println(mat.group("grade"));

        }
    }
}

birthdate를 가져와보자.

leading zero가 있으므로 대괄호를 flexible하게 할 필요 없이 2로 통일

year도 항상 4이므로 4로 통일

안에 백슬래시만 넣어주면 동일한 형식

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TranscriptParser {
    public static void main(String[] args) {
        String transcript = """
                Student Number:    1234598872       Grade:    11
                Birthdate:    01/02/2000       Gender:    M
                State ID:     8923827123
                                
                Cumulative GPA (Weighted)     3.82
                Cumulative GPA (Unweighted)    3.46
                """;
        String regex = """
            Student\\sNumber:\\s(?<studentNum>\\d{10}).*            # Grab Student Number
            Grade:\\s+(?<grade>\\d{1,2}).*                       # Grab Grab
            Birthdate:\\s+(?<birthMonth>\\d{2})/(?<birthDay>\\d{2})/(?<birthYear>\\d{4}).*                       # Grab Grab
            """;

        Pattern pat = Pattern.compile(regex, Pattern.DOTALL | Pattern.COMMENTS);
        Matcher mat = pat.matcher(transcript);
        if (mat.matches()){
            System.out.println(mat.group("studentNum"));
            System.out.println(mat.group("grade"));
            System.out.println(mat.group("birthMonth"));
            System.out.println(mat.group("birthDay"));
            System.out.println(mat.group("birthYear"));

        }
    }
}

젠더와 State id도 같은 맥락으로 가져온다.

gender에서 특이한 점은 boundary를 한정해주어 추가 space가 겹치지 않도록 막아줌

GPA를 가져와보자.

그냥 DIGIT이 아니라 x.xx로 되어 있기 때문에 .을 escape로 함께 처리해서 기입해주어야 한다.

Weighted\\)\\s+(?<weightedGpa>\\d+\\.)\\b.*

하지만 보다 더 flexible하게 하는 방법이 있다.

Weighted\\)\\s+(?<weightedGpa>[\\d\\.]+)\\b.*                                   # Grab W GPA

=> works

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TranscriptParser {
    public static void main(String[] args) {
        String transcript = """
                Student Number:    1234598872       Grade:    11
                Birthdate:    01/02/2000       Gender:    M
                State ID:     8923827123
                                
                Cumulative GPA (Weighted)     3.82
                Cumulative GPA (Unweighted)    3.46
                """;
        String regex = """
            Student\\sNumber:\\s(?<studentNum>\\d{10}).*                                     # Grab Student Number
            Grade:\\s+(?<grade>\\d{1,2}).*                                                   # Grab Grade
            Birthdate:\\s+(?<birthMonth>\\d{2})/(?<birthDay>\\d{2})/(?<birthYear>\\d{4}).*   # Grab Birthdae
            Gender:\\s+(?<gender>\\w+)\\b.*                                                  # Grab Gender
            State\\sID:\\s+(?<stateId>\\d+)\\b.*                                             # Grab Student Number
            Weighted\\)\\s+(?<weightedGpa>[\\d\\.]+)\\b.*                                    # Grab W GPA
            Unweighted\\)\\s+(?<unweightedGpa>[\\d\\.]+)\\b.*                                # Grab UW GPA
            
            """;

        Pattern pat = Pattern.compile(regex, Pattern.DOTALL | Pattern.COMMENTS);
        Matcher mat = pat.matcher(transcript);
        if (mat.matches()){
            System.out.println(mat.group("studentNum"));
            System.out.println(mat.group("grade"));
            System.out.println(mat.group("birthMonth"));
            System.out.println(mat.group("birthDay"));
            System.out.println(mat.group("birthYear"));
            System.out.println(mat.group("gender"));
            System.out.println(mat.group("stateId"));
            System.out.println(mat.group("weightedGpa"));
            System.out.println(mat.group("unweightedGpa"));

        }
    }
}

greedy operator

그런데 *은 greedy한 operator로 너무 광범위하다.

위의 코드는 분명 working 하지만 너무 specific 하지 못하다는 단점이 있음.

public static void main(String[] args) {
    String transcript = """
            Student Number:    1234598872       Grade:    11
            Birthdate:    01/02/2000       Gender:    M
            State ID:     8923827123
                            
            Cumulative GPA (Weighted)     3.82
            Cumulative GPA (Unweighted)    3.46
            """;
    String regex = """
        Student\\sNumber:\\s(?<studentNum>\\d{10}).*                                     # Grab Student Number
        Grade:\\s+(?<grade>\\d{1,2}).*                                                   # Grab Grade
        Birthdate:\\s+(?<birthMonth>\\d{2})/(?<birthDay>\\d{2})/(?<birthYear>\\d{4}).*   # Grab Birthdae
        Gender:\\s+(?<gender>\\w+)\\b.*                                                  # Grab Gender
        State\\sID:\\s+(?<stateId>\\d+)\\b.*                                             # Grab Student Number
        Cumulative.*(?<weightedGPA>[\\d\\.]+)\\b.*                                             # Grab Student Number
        
        
        
        #Weighted\\)\\s+(?<weightedGpa>[\\d\\.]+)\\b.*                                    # Grab W GPA
        #Unweighted\\)\\s+(?<unweightedGpa>[\\d\\.]+)\\b.*                                # Grab UW GPA
        
        """;

예를 들어 .*의 명령은

State\\sID:\\s+(?<stateId>\\d+)\\b.* 
Cumulative.*(?<weightedGPA>[\d\.]+)\b.*

Stateid가 끝나는 지점부터 해서 Cumultative를 만나면 그만 멈추는 게 아니라

grab the entire rest of the string => all the way to the string > then go backward => until matches expressed expression

탐색하고 끝내는 영역

그 다음에 선택되는 영역이

결론적으로 6이 return 된다

how can we make it less greedy of regex engine ?

asterisk 뒤에 물음표를 넣어주기

State\\sID:\\s+(?<stateId>\\d+)\\b.*?                                             # Grab Student Number

이렇게 했을 때 탐색 영역 :

마찬가지로 Cumulative .* 뒤에 ?를 넣으면

Cumulative.*?(?<weightedGPA>[\\d\\.]+)\\b.*

dot starts from Cumulative, including this space, unti it meets digits or dot, and the grap all of that.

이 스크립트에서처럼 반복되는 표현 (Cumulative)이 나오면 greedy operator를 사용할 때 조심해야 한다.

Finding Multiple Matches

아래와 같은 상황에서 regex를 이용해 추출한다고 해보자.

public class PeopleMatching {
    private  String people = """
            Flinstone, Fred, 1/1/1990
            Rubble, Barney, 2/2/1905
            Flinstone, Wilma, 3/3/1910
            Rubble, Betty, 4/4/1915
            """;
}

string 추출하는 regex


    String regex = "(?<lastName>\\w+),\\s*(?<firstName>\\w+),\\s*(?<dob>\\d{1,2}/\\d{1,2}/\\d{4})\\n";
}

지금은 하나하나씩 match를 해주는 코드를 짰는데

데이터가 늘어날 경우 좀 더 압축적인 표현이 필요하지 않을까?

find()를 쓰면

if one regex matches of these lines properly => it should help us to zero in each of these four lines.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class PeopleMatching {

    public static void main(String[] args) {
        String people = """
            Flinstone, Fred, 1/1/1990
            Rubble, Barney, 2/2/1905
            Flinstone, Wilma, 3/3/1910
            Rubble, Betty, 4/4/1915
            """;

        String regex = "(?<lastName>\\w+),\\s*(?<firstName>\\w+),\\s*(?<dob>\\d{1,2}/\\d{1,2}/\\d{4})\\n";
        Pattern pat = Pattern.compile(regex);
        Matcher mat = pat.matcher(people);

        mat.find();
        System.out.println(mat.group("lastName"));
        System.out.println(mat.group("firstName"));
        System.out.println(mat.group("dob"));
    }
}

match가 성공하면 그 다음것으로 넘어가기 때문에 한번 더 늘리면 그 다음것을 찾아준다.

we can tell a certain number to start from

리턴

인덱스 추출 기능

mat.find(35);
System.out.println(mat.group("lastName"));
System.out.println(mat.group("firstName"));
System.out.println(mat.group("dob"));

System.out.println(mat.start("firstName"));
System.out.println(mat.end("firstName"));

Flinstone
Wilma
3/3/1910
62
67

'Programming > Java, Spring' 카테고리의 다른 글

[Java 기초문법] 자바 수학 연산 및 math 관련 함수들, 랜덤함수, 원의 너비 구하기, 구심력 구하기, BigDecimal, Calculting Compound Interest, formattin (0)	2022.10.18
[Java 기초문법] 숫자 다루기 \| Data types , double and floats 사용시 주의할 점, 진법 표현들 (0)	2022.10.18
[Java 기초문법] 정규식 다루기 \| Regex \| 기초 표현, Capture Groups , Comments, 여러가지 표현들 (0)	2022.10.18
[Java 기초문법] 텍스트 다루기 \| concatenation, string length, indexOf, split, startsWith & endsWith, contentEquals (0)	2022.10.17
[Java 기초문법] 텍스트 다루기 \| creating strings, 대소문자 변환, isEmpty vs isBlank, replace, strip(), charAt(), contains() (0)	2022.10.17