데이터 증강 및 변환

데이터 증강

데이터 증강(Data Augmentation)이란 데이터가 가진 고유한 특징을 유지한 채 변형하거나 노이즈를 추가해 데이터세트의 크기를 늘리는 방법이다. 데이터 증강은 모델의 과대적합을 줄이고 일반화 능력을 향상시킬 수 있다.

너무 많은 변형이나 노이즈를 추가한다면 기존 데이터가 가진 특징이 파괴될 수 있으므로 주의해야 한다.

텍스트 데이터

삽입 및 삭제

삽입은 의미 없는 문자나 단어, 또는 문장 의미에 영향을 끼치지 않는 수식어 등을 추가하는 방법이다. 임의의 단어나 문자를 기존 텍스트에 덧붙여 사용한다. 삭제는 삽입과 반대로 임의의 단어나 문자를 삭 제해 데이터의 특징을 유지하는 방법이다.

ContextualWordEmbsAug 클래스는 BERT 모델을 활용해 단어를 삽입하는 기능을 제공한다. action으로는 insert, substitute, swap, delete가 가능하다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import nlpaug.augmenter.word as naw

texts = [
    "Those who can imagine anything, can create the impossible.",
    "We can only see a short distance ahead, but we can see plenty there that needs to be done.",
    "If a machine is expected to be infallible, it cannot also be intelligent.",
]

aug = naw.ContextualWordEmbsAug(model_path="bert-base-uncased", action="insert")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
    print(f"src : {text}")
    print(f"dst : {augmented}")
    print("------------------")

1
2
3
4
5
6
7
8
src: Those who can imagine anything, can create the impossible.
dst: those scientists who can simply imagine seemingly anything, can create precisely the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : we probably can still only see a short distance ahead, but we can nonetheless see about plenty from there that just needs to be properly done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : if a logic machine is expected either to necessarily be infallible, subsequently it cannot also be highly intelligent.

교체 및 대체

교체는 단어나 문자의 위치를 교환하는 방법이다. ‘문제점을 찾지 말고 해결책을 찾으라’라는 문장에서 교체를 적용한다면 ‘해결책을 찾으라 문제점을 찾지 말고’로 변경될 수 있다. 교체는 무의미하거나 의미상 잘못된 문장을 생성할 수 있으므로 데이터의 특성에 따라 주의해 사용해야 한다.

대체는 단어나 문자를 임의의 단어나 문자로 바꾸거나 동의어로 변경하는 방법을 의미한다. ‘사과’라는 단어를 ‘바나나’와 같이 유사한 단어로 변경하거나 ‘해’를 ‘태양으로 바꿔 뜻이 같은 말로 바꾸는 작업이 다. 단어나 문장을 대체하면 다른 증강 방법보다 비교적 데이터의 정합성(Consistency)이 어긋나지 않아 효율적으로 데이터를 증강할 수 있다. 하지만 조사를 바꿔주진 않는다.

RandomWordAug 클래스를 통해 무작위로 단어를 교체할 수 있다. action으로는 insert, substitute, swap, delete, crop이 가능하다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import nlpaug.augmenter.word as naw

texts = [
    "Those who can imagine anything, can create the impossible.",
    "We can only see a short distance ahead, but we can see plenty there that needs to be done.",
    "If a machine is expected to be infallible, it cannot also be intelligent.",
]

aug = naw.RandomWordAug(action="swap")
augmented_texts = aug.augment(texts)

for text, augmented in zip(texts, augmented_texts):
    print(f"src : {text}")
    print(f"dst : {augmented}")
    print("------------------")

1
2
3
4
5
6
7
8
9
src : Those who can imagine anything, can create the impossible.
dst : Those who can imagine can anything create, the. impossible
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We see can only a short distance but ahead, can we see plenty that there needs to done be.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : A if is machine to expected be infallible, cannot also it be intelligent.
------------------

모델을 활용해 대체하는 경우 ContextualWordEmbsAug 클래스를 사용하거나, SynonymAug 클래스로 워드넷(WordNet) 데이터베이스나 의역 데이터베이스(The Paraphrase Database, PPDB)를 활용해 단어를 대체해 데이터를 증강할 수도 있다.

단어 집합을 미리 선언하고 그 중 하나로 대체하고 싶은 경우, ReservedAug를 사용할 수도 있다.

역번역

역번역(Back-translation)이란 입력 텍스트를 특정 언어로 번역한 다음 다시 본래의 언어로 번역하는 방법을 의미한다. 예를 들어 영어를 한국어로 번역한 다음 번역된 텍스트를 다시 영어로 번역하는 과정을 의미한다. 원래의 언어로 번역하는 과정에서 원래 텍스트와 유사한 텍스트가 생성되므로 패러프레이징(Paraphrasing)21 효과를 얻을 수 있다.

역번역은 번역 모델의 성능에 크게 좌우되기에, 모델의 성능을 평가하는 데 사용되기도 한다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import nlpaug.augmenter.word as naw

texts = [
    "Those who can imagine anything, can create the impossible.",
    "We can only see a short distance ahead, but we can see plenty there that needs to be done.",
    "If a machine is expected to be infallible, it cannot also be intelligent.",
]

back_translation = naw.BackTranslationAug(
    from_model_name="facebook/wmt19-en-de",
    to_model_name="facebook/wmt19-de-en"
)

augmented_texts = back_translation.augment(texts)

for text, augmented in zip(texts, augmented_texts):
    print(f"src : {text}")
    print(f"dst : {augmented}")
    print("------------------")

1
2
3
4
5
6
7
8
9
src : Those who can imagine anything, can create the impossible.
dst : Anyone who can imagine anything can achieve the impossible.
------------------
src : We can only see a short distance ahead, but we can see plenty there that needs to be done.
dst : We can only look a little ahead, but we can see a lot there that needs to be done.
------------------
src : If a machine is expected to be infallible, it cannot also be intelligent.
dst : If a machine is expected to be infallible, it cannot be intelligent.
------------------

이미지 데이터

1
2
3
4
5
6
transform = transforms.Compose(
    [
        transforms.Resize(size=(512, 512)),
        transforms.ToTensor()
    ]
)

이미지 데이터는 토치비전의 transforms 모듈을 이용하여 증강할 수 있다. 텐서화 클래스(transforms.ToTensor)는 PIL.Image 형식을 Tensor 형식으로 변환한다. 텐서화 클래스는 [0~255] 범위의 픽셀값을 [0.0~1.0] 사이의 값으로 최대 최소 정규화를 수행한다. 또한 입력 데이터의 형태를 [채널, 높이, 너비] 형태로 변환한다.

회전 및 대칭

학습 이미지를 회전하거나 대칭한다면 변형된 이미지가 들어오더라도 더 강건한 모델을 구축할 수 있으며 일반화된 성능을 끌어낼 수 있다.

1
2
3
4
5
6
7
transform = transforms.Compose(
    [
        transforms.RandomRotation(degrees=30, expand=False, center=None),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.RandomVerticalFlip(p=0.5)
    ]
)

위 코드는 이미지를 ±30° 사이로 회전시키면서, 수평 대칭과 수직 대칭을 50% 확률로 적용하는 예제이다. expand=True이면 확장되어 여백이 생기지 않는다. 중심점을 입력하지 않으면 좌측 상단을 기준으로 회전한다.

자르기 및 패딩

OD(Object Detection)과 같은 모델을 구성할 때, 학습 데이터의 크기가 일정하지 않거나 주요한 객체가 일부 영역에만 작게 존재할 수 있다. 이러한 경우 불필요한 부분을 자르거나, 패딩을 주어 크기를 맞출 수 있다.

1
2
3
4
5
6
transform = transforms.Compose(
    [
        transforms.RandomCrop(size=(512, 512)),
        transforms.Pad(padding=50, fill=(127, 127, 255), padding_mode="constant")
    ]
)

padding_mode가 constant면 fill=(127, 127, 255)로 테두리가 생성된다. reflect나 symmetric이라면 입력한 RGB는 무시되며, 이미지의 픽셀값을 이용하여 생성한다. RandomCrop에도 자를 때 발생하는 여백 공간에 대한 패딩을 줄 수 있다.

크기 조정

이미지 처리 모델 학습을 위해, 학습 데이터에 사용되는 이미지의 크기는 모두 일정해야 한다.

1
2
3
4
5
transform = transforms.Compose(
    [
        transforms.Resize(size=(512, 512))
    ]
)

size를 정수로 입력하는 경우, 높이나 너비 중 더 작은 값에 비율을 맞추어 크기가 수정된다.

변형

아핀 변환(Affine Transformation)이나 원근(Perspective Transformation) 변환과 같은 기하학적 변환을 사용한다.

1
2
3
4
5
6
7
8
transform = transforms.Compose(
    [
        transforms.RandomAffine(
            degrees=15, translate=(0.2, 0.2),
            scale=(0.8, 1.2), shear=15
        )
    ]
)

아핀 변환은 각도(degrees), 이동(translate), 척도(scale), 전단(shear)을 입력해 이미지를 변형한다.

색상 변환

이미지 데이터의 특징은 픽셀값의 분포나 패턴에 크게 좌우되는데, 앞선 변형들은 색상을 변경하진 않는다. 특정 색상에 편향되지 않도록 정규화하면 모델을 더 일반화시킬 수 있다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
transform = transforms.Compose(
    [
        transforms.ColorJitter(
            brightness=0.3, contrast=0.3,
            saturation=0.3, hue=0.3
        ),
        transforms.ToTensor(),
        transforms.Normalize(
            mean = [0.485, 0.456, 0.406],
            std = [0.229, 0.224, 0.225]
        ),
        transforms.ToPILImage()
    ]
)

노이즈

특정 픽셀값에 편향되지 않도록, 임의의 노이즈를 추가하는 것은 좋은 방법이다. 학습에 사용되지 않더라도, 테스트 데이터에 노이즈를 주어 Robustness를 평가하는 데 사용하기도 한다.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class IaaTransforms:
    def __init__(self):
        self.seq = iaa.Sequential([
            iaa.SaltAndPepper(p=(0.03, 0.07)),
            iaa.Rain(speed=(0.3, 0.7))
        ])
    
    def __call__(self, images): 
        images = np.array(images)
        print(images.shape, images.dtype)
        augmented = self.seq.augment_image(images)
        return Image.fromarray(augmented)


transform = transforms.Compose([
    IaaTransforms()
])

컷아웃 및 무작위 지우기

컷아웃은 임의의 ROI의 픽셀값을 0으로 채우는 것이고, 무작위 지우기는 랜덤 픽셀값으로 채우는 것이다.

1
2
3
4
5
6
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.RandomErasing(p=1.0, value=0),
    transforms.RandomErasing(p=1.0, value='random'),
    transforms.ToPILImage()
])

일부 영역이 누락되거나, 폐색 영역에 대해 모델을 더욱 견고하게 만들어준다.

컷믹스

컷믹스(CutMix)는 이미지 패치 영역에 다른 이미지를 덮어씌우는 방법이다. 패치 위에 새로운 패치를 덮어씌워 자연스러운 이미지를 구성한다. 패치 영역의 크기와 비율을 고려해 덮어쓴다.

Label($y$)은 이미지가 얼마나 기여하였는지($\lambda$)를 이용하여 아래 공식과 같이 계산된다.

$$ \tilde{y}=\lambda y_a + (1-\lambda)y_b $$