Skip to content

nasciiboy/regexp4

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Recursive Regexp Raptor (regexp4)

regexp3 (C-lang, Go-lang) and regexp4 (C-lang, Go-lang)

lang: es

raptor-book (draft (spanish)) : here

benchmarks ==> here

Characteristics

  • Easy to use.
  • No error checking.
  • only regexp
  • The most compact and clear code in a human regexp library.
  • Zero dependencies. Neither the standard GO library is present PURE GO.
  • Count matches
  • Catchs
  • Replacement catch
  • Placement of specific catches within an array
  • Backreferences
  • Basic Support for UTF8

Introduction

Recurseve Regexp Raptor is a library of search, capture and replacement of regular expressions written in GO language from the C version of regexp4, trying to achieve what following:

  • Having most of the features present in any other regexp library.
  • Elegant Code: simple, clear and endowed with grace.
  • Avoid using any external libraries, including the standard library.
  • Be a useful learning material.

Motivation

The original development in C was due to the non-existence of a standar library of regular expressions for that language, although there are several implementations, such as pcre, the regexp.h library of the GNU project, regexp (Plan 9 OS), and some other more, the author of this work (which is a little retard) found in all, far-fetched and mystical code divided into several files full of macros, scripts low and cryptic variables. Unable to understand anything and after a retreat to the island of onanista meditacion, the author intended to make your own library with casinos and Japanese schoolgirls.

Development and Testing

Has been used GNU Emacs (the only true operating system), go 1.7.5, konsole and fish, running in Freidora 25.

You can get a copy, clone the repository directly

git clone https://github.com/nasciiboy/regexp4.git ~/go/src/github.com/nasciiboy/regexp4

or through go get

go get github.com/nasciiboy/regexp4

To perform the test (inside the repository)

go test

or

go test github.com/nasciiboy/regexp4

Use

To include Recursive Regexp Raptor in their code, just need to include the library

import "github.com/nasciiboy/regexp4"

To use the library you must create an object of type RE, like this:

var re regexp4.RE

o

re := new( regexp4.RE )

o

re := regexp4.Compile( "regexp" )

The available methods are

// copy regexp, including string and captures
re.Copy() *RE

// compile regexp
re.Compile( re string ) *RE

// search, return number of matches
re.MatchString( txt string ) int

// search, return boolean resulta
re.FindString ( txt string ) bool

// compilation and search, return number of matches
re.Match( txt, re string ) int

// compilation and search, return boolean result
re.Find ( txt, re string ) bool

// return number of matches
re.Result() int

// return number of catches
re.TotCatch() int

// return a catch by its index
re.GetCatch( index int ) string

// return the start position of the catch or 0 (?)
re.GpsCatch( index int ) int

// returns the length of the catth or 0 (?)
re.LenCatch( index int ) int

// replaces the contens of a capture with rplStr, by its id
// returns the resulting string
re.RplCatch( rplStr string, id int ) string

// Create a string with the captions and text indicated in pText
// returns the resulting string
re.PutCatch( pText string ) string

Syntax

  • Text search in any location:
    re.Match( "Raptor Test", "Raptor" )
        
  • Multiple search options “exp1|exp2”
    re.Match( "Raptor Test", "Dinosaur|T Rex|Raptor|Triceratops" )
        
  • Matches any character ‘.’
    re.Match( "Raptor Test", "R.ptor" )
        
  • Zero or one coincidences ‘?’
    re.Match( "Raptor Test", "Ra?ptor" )
        
  • One or more coincidences ‘+’
    re.Match( "Raaaptor Test", "Ra+ptor" )
        
  • Zero or more coincidences ‘*’
    re.Match( "Raaaptor Test", "Ra*ptor" )
        
  • Range of coincidences “{n1,n2}”
    re.Match( "Raaaptor Test", "Ra{0,100}ptor" )
        
  • Number of specific matches ‘{n1}’
    re.Match( "Raptor Test", "Ra{1}ptor" )
        
  • Minimum Number of matches ‘{n1,}’
    re.Match( "Raaaptor Test", "Ra{2,}ptor" )
        
  • Sets.
    • Character Set “[abc]”
      re.Match( "Raptor Test", "R[uoiea]ptor" )
              
    • Range within a set of characters “[a-b]”
      re.Match( "Raptor Test", "R[a-z]ptor" )
              
    • Metacaracter within a set of characters “[:meta]”
      re.Match( "Raptor Test", "R[:w]ptor" )
              
    • Investment character set “[^abc]”
      re.Match( "Raptor Test", "R[^uoie]ptor" )
              
  • Coinciding with a character that is a letter “:a”
    re.Match( "RAptor Test", "R:aptor" )
        
  • Coinciding with a character that is not a letter “:A”
    re.Match( "R△ptor Test", "R:Aptor" )
        
  • Coinciding with a character that is a number “:d”
    re.Match( "R4ptor Test", "R:dptor" )
        
  • Coinciding with a character other than a number “:D”
    re.Match( "Raptor Test", "R:Dptor" )
        
  • Coinciding with an alphanumeric character “:w”
    re.Match( "Raptor Test", "R:wptor" )
        
  • Coinciding with a non-alphanumeric character “:W”
    re.Match( "R△ptor Test", "R:Wptor" )
        
  • Coinciding with a character that is a space “:s”
    re.Match( "R ptor Test", "R:sptor" )
        
  • Coinciding with a character other than a space “:S”
    re.Match( "Raptor Test", "R:Sptor" )
        
  • Coincidence with utf8 character “:&”
    re.Match( "R△ptor Test", "R:&ptor" )
        
  • Escape character with special meaning “:character”

    the characters ‘|’, ‘(‘, ‘)’, ‘<’, ‘>’, ‘[‘, ‘]’, ‘?’, ‘+’, ‘*’, ‘{‘, ‘}’, ‘-‘, ‘#’ and ‘@’ as a especial characters, placing one of these characters as is, regardless one correct syntax within the exprecion, can generate infinite loops and other errors.

    re.Match( ":#()|<>", ":::#:(:):|:<:>" )
        

    The special characters (except the metacharacter) lose their meaning within a set

    re.Match( "()<>[]|{}*#@?+", "[()<>:[:]|{}*?+#@]" )
        
  • Grouping “(exp)”
    re.Match( "Raptor Test", "(Raptor)" )
        
  • Grouping with capture “<exp>”
    re.Match( "Raptor Test", "<Raptor>" )
        
  • Backreferences “@id”

    the backreferences need one previously captured expression “<exp>”, then the number of capture is placed, preceded by ‘@’

    re.Match( "ae_ea", "<a><e>_@2@1" )
        
  • Behavior modifiers

    There are two types of modifiers. The first affects globally the exprecion behaviour, the second affects specific sections. In either case, the syntax is the same, the sign ‘#’, followed by modifiers,

    modifiers global reach is placed at the beginning, the whole and are as follows exprecion

    • Search only the beginning ‘#^exp’
      re.Match( "Raptor Test", "#^Raptor" )
              
    • Search only at the end ‘#$exp’
      re.Match( "Raptor Test", "#$Test" )
              
    • Search the beginning and end “#^$exp”
      re.Match( "Raptor Test", "#^$Raptor Test" )
              
    • Stop with the first match “#?exp”
      re.Match( "Raptor Test", "#?Raptor Test" )
              
    • Search for the string, character by character “#~”

      By default, when a exprecion coincides with a region of text search, the search continues from the end of that coincidence to ignore this behavior, making the search always be character by character this switch is used

      re.Match( "aaaaa", "#~a*" )
              

      in this example, without modifying the result it would be a coincidence, however with this switch continuous search immediately after returning character representations of the following five matches.

    • Ignore case sensitive “#*exp”
      re.Match( "Raptor Test", "#*RaPtOr TeSt" )
              

all of the above switches are compatible with each other ie could search

re.Match( "Raptor Test", "#^$*?~RaPtOr TeSt" )

however modifiers ‘~’ and ‘?’ lose sense because the presence of ‘^’ and/or ‘$’.

one exprecion type:

re.Match( "Raptor Test", "#$RaPtOr|#$TeSt" )

is erroneous, the modifier after the ‘|’ section would apply between ‘|’ and ‘#’, with a return of wrong

local modifiers are placed after the repeat indicator (if there) and affect the same region affecting indicators repetition, ie characters, sets or groups.

  • Ignore case sensitive “exp#*”
    re.Match( "Raptor Test", "(RaPtOr)#* TeS#*t" )
        
  • Not ignore case sensitive “exp#/”
    re.Match( "RaPtOr TeSt", "#*(RaPtOr)#/ TES#/T" )
        

Captures

Catches are indexed according to the order of appearance in the expression for example:

<   <   >  | <   <   >   >   >
= 1 ==========================
    = 2==    = 2 =========
                 = 3 =

If the exprecion matches more than one occasion in the search text index is increased according to their appearance that is:

<   <   >  | <   >   >   <   <   >  | <   >   >   <   <   >  | <   >   >
= 1 ==================   = 3 ==================   = 5 ==================
    = 2==    = 2==           = 4==    = 4==           = 6==    = 6==
coincidencia uno         coincidencia dos         coincidencia tres

The method GetCatch makes a copy of a catch into an string, here its prototype:

re.GetCatch( index int ) string
index
index of the grouping (1 to n).

function returns string to the capture terminated. An index incorrect return a empty string.

to get the number of catches in a search, using TotCatch:

re.TotCatch() int

returning a value of 0 a n.

Could use this and the previous function to print all catches with a function like this:

func printCatch( re regexp4.RE ){
  for i := 1; i <= re.TotCatch(); i++ {
    fmt.Printf( "[%d] >%s<\n", i, re.GetCatch( i ) )
  }
}

Place catches in a string

re.PutCatch( pStr string ) string

pStr argument contains the text with which to form the new chain as well as indicators which you catch place. To indicate the insertion a capture, place the ‘#’ sign followed the capture index. for example pStr argument could be

pStr := "catch 1 >>#1<< catch 2 >>#2<< catch 747 >>#747<<"

to place the character ‘#’ within the escape string ‘#’ with ‘#’ further, ie:

"## Comment" -> "# comment"

Replace a catch

Replacement operates on an array of characters in which is placed the text search modifying a specified catch by a string text, the method in charge of this work is rplCatch, its prototype is:

re.RplCatch( rplStr string, id int ) string
rplStr
replacement text capture.
id
Capture identifier after the order of appearance within regular exprecion. Spend a wrong index, puts a unaltered copy of the search string.

in this case the use of the argument id unlike method GetCatch does not refer to a “catch” in specific, that is no matter how much of occasions that has captured a exprecion, the identifier indicates the position within the exprecion itself, ie:

   <   <   >  | <   <   >   >   >
id = 1 ==========================
id     = 2==    = 2 =========
id                  = 3 =
capturing position within the exprecion

The amendment affects so

<   <   >  | <   >   >       <   <   >  | <   >   >      <   <   >  | <   >   >
= 1 ==================       = 1 ==================      = 1 ==================
    = 2==    = 2==               = 2==    = 2==              = 2==    = 2==
capture one                  "..." two                   "..." Three

Metacharacters search

:d
digit from 0 to 9.
:D
any character other than a digit from 0 to 9.
:a
any character is a letter (a-z, A-Z)
:A
any character other than a letter
:w
any alphanumeric character.
:W
any non-alphanumeric character.
:s
[ \t-\r]
:S
[^ \t-\r]
:b
[ \t]
:B
[^ \t]
:&
no ascii character (>= 128)
:|
Vertical bar
:^
Caret
:$
Dollar sign
:(
Left parenthesis
:)
Right parenthesis
:<
Greater than
:>
Less than
:[
Left bracket
:]
Right bracket
:.
Point
:?
Interrogacion
:+
More
:-
Less
:*
Asterisk
:{
Left key
:}
Right key
:#
Modifier
::
Colons

additionally use the proper c syntax to place characters new line, tab, …, etc. Similarly you can use the Go syntax for “placing” especial characters.

Examples of use

regexp4_test.go file contains a wide variety of tests that are useful as examples of use, these include the next:

re.Match( "07-07-1777", "<0?[1-9]|[12][0-9]|3[01]><[/:-\\]><0?[1-9]|1[012]>@2<[12][0-9]{3}>" )

captures a date format string, separately day, stripper, month and year. The separator has to coincider the two occasions that appears

re.Match( "https://en.wikipedia.org/wiki/Regular_expression", "(https?|ftp):://<[^:s/:<:>]+></[^:s:.:<:>,/]+>*<.>*" )

capture something like a web link

re.Match( "<mail>nasciiboy@gmail.com</mail>", "<[_A-Za-z0-9:-]+(:.[_A-Za-z0-9:-]+)*>:@<[A-Za-z0-9]+>:.<[A-Za-z0-9]+><:.[A-Za-z0-9]{2}>*" )

capture sections (user, site, domain) something like an email.

Hacking

algorithm

Flow Diagram

     ┌────┐
     │init│
     └────┘
        │◀───────────────────────────────────┐
        ▼                                    │
 ┌──────────────┐                            │
 │loop in string│                            │
 └──────────────┘                            │
        │                                    │
        ▼                                    │
 ┌─────────────┐  no   ┌─────────────┐       │
<│end of string│>────▶<│search regexp│>──────┘
 └─────────────┘       └─────────────┘ no match
        │ yes                 │ match
        ▼                     ▼
┌────────────────┐     ┌─────────────┐
│report: no match│     │report: match│
└────────────────┘     └─────────────┘
        │                     │
        │◀────────────────────┘
        ▼
      ┌───┐
      │end│
      └───┘

search regexp version one

                                                        ┌──────────────────────────────┐
┏━━━━━━━━━━━━━┓                                         ▼                              │
┃search regexp┃                                  ┌───────────┐                         │
┗━━━━━━━━━━━━━┛                                  │get builder│                         │
                                                 └───────────┘                         │
                                                        │                              │
                                                        ▼                              │
                                                ┌───────────────┐  no  ┌────────────┐  │
                                               <│we have builder│>────▶│finish: the │  │
                                                └───────────────┘      │path matches│  │
                                                        │ yes          └────────────┘  │
                              ┌────────┬─────┬──────────┼────────────┬──────────┐      │
                              ▼        ▼     ▼          ▼            ▼          ▼      │
                        ┌───────────┐┌───┐┌─────┐┌─────────────┐┌─────────┐┌────────┐  │
                        │alternation││set││point││metacharacter││character││grouping│  │
                        └───────────┘└───┘└─────┘└─────────────┘└─────────┘└────────┘  │
                              │        │     │          │            │          │      │
                              ▼        └─────┴──────────┼────────────┘          └──────┤
                     ┌────────────────┐                 │                              │
            ┌────────│ save position  │                 ▼                              │
            │        └────────────────┘          ┌─────────────┐  no match             │
            │        ┌────────────────┐         <│match builder│>──────────┐           │
            ▼◀───────│restore position│◀────┐    └─────────────┘           │           │
     ┌──────────────┐└────────────────┘     │           │ match            │           │
     │loop in paths │                       │           ▼                  ▼           │
     └──────────────┘                       │   ┌─────────────────┐ ┌───────────────┐  │
            │                               │   │advance in string│ │finish, the    │  │
            ▼                               │   └─────────────────┘ │path no matches│  │
      ┌────────────┐ yes  ┌─────────────┐   │           │           └───────────────┘  │
     <│we have path│>───▶<│search regexp│>──┘           └──────────────────────────────┘
      └────────────┘      └─────────────┘ no match
            │ no          match │
            ▼                   ▼
┌───────────────────────┐ ┌────────────┐
│finish, without matches│ │finish, the │
└───────────────────────┘ │path matches│
                          └────────────┘

search regexp version two

               ┌─────────────┐
               │save position│                             ┏━━━━━━━━━━━━━┓
               └─────────────┘                             ┃search regexp┃
        ┌────────────▶│                                    ┗━━━━━━━━━━━━━┛
        │             ▼
        │      ┌──────────────┐
        │      │loop in paths │
        │      └──────────────┘
        │             │                       ┌────────────────────────────────┐
        │             ▼                       ▼                                │
        │       ┌────────────┐   yes    ┌───────────┐                          │
        │      <│we have path│>────────▶│get builder│                          │
        │       └────────────┘          └───────────┘                          │
        │             │ no                    │                                │
        │             ▼                       ▼                                │
        │  ┌───────────────────────┐   ┌───────────────┐ no  ┌─────────────┐   │
        │  │finish: without matches│  <│we have builder│>───▶│finish: the  │   │
        │  └───────────────────────┘   └───────────────┘     │path matches │   │
        │                                     │ yes          └─────────────┘   │
        │                    ┌─────┬──────────┼────────────┬─────────┐         │
        │                    ▼     ▼          ▼            ▼         ▼         │
┌────────────────┐        ┌───┐┌─────┐┌─────────────┐┌─────────┐┌────────┐     │
│restore position│        │set││point││metacharacter││character││grouping│     │
└────────────────┘        └───┘└─────┘└─────────────┘└─────────┘└────────┘     │
        ▲                    │     │          │            │         │         │
        │                    └─────┴──────────┼────────────┘         │         │
        │                                     ▼                      ▼         │
 ┌───────────────┐      no match       ┌─────────────┐        ┌─────────────┐  │
 │finish: the    │◀────────┬──────────<│match builder│>  ┌───<│search regexp│> │
 │path no matches│         │           └─────────────┘   │    └─────────────┘  │
 └───────────────┘         │                  │          │           │         │
                           └────────────────┈┈│┈┈────────┘           │         │
                                              ▼  match               │ match   │
                                     ┌─────────────────┐             └────────▶│
                                     │advance in string│                       │
                                     └─────────────────┘                       │
                                              │                                │
                                              └────────────────────────────────┘

License

This project is not “open source” is free software, and according to this, use the GNU GPL Version 3. Any work that includes used or resulting code of this library, you must comply with the terms of this license.

Contact, contribution and other things

mailto:nasciiboy@gmail.com

About

regexp4 engine (Go-lang)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages