Wednesday, June 15, 2016

Oracle 12c - Pattern Matching Part 1

Oracle 12c Pattern Matching





Big SQL :)

Before getting into Oracle 12c Pattern Matching - the set of questions which came across after looking at Oracle 12c Patern Matching - MATCH_RECOGNIZE are 
  1. With SQL we have many possible solution, what could be that one scenario which cannot be handled or may not be an optimal solution via SQL until Oracle 11g which lead to SQL Pattern Matching
  2. It looks big, Very BIG
  3. So many key words
  4. Complex
           Pattern Matching comes examples like V/W patterns in stock market, suspicious financial transaction and you will find much more online, But let us get into basic question why not in 11g, Demystifying the unknown usually helps so lets begin. Am going to walk you through multiple scenarios comparing regular expression and analytical SQL. 

Assume we have data stored in a comma separated format in a column like below

with tbl_reg as
(
select 'Linda' as customer, 'TV,BIKE,MOBILE,CAR' as Buying_Pattern from dual
union
select 'Lorena' as customer, 'TV,BIKE,MOBILE' as Buying_Pattern from dual
union
select 'Mark' as customer, 'TV,BIKE,BIKE,MOBILE' as Buying_Pattern from dual
)
SELECT * from tbl_reg;

CUSTOM BUYING_PATTERN
------ -------------------
Linda  TV,BIKE,MOBILE,CAR
Lorena TV,BIKE,MOBILE
Mark   TV,BIKE,BIKE,MOBILE

Scenario/Pattern 1- Data Format = CSV


           Now if we have to look out for buying pattern  like 'TV,BIKE,MOBILE,CAR', then would straight away jump to regular expressions,
SQL
with tbl_reg as
(
select 'Linda' as customer, 'TV,BIKE,MOBILE,CAR' as Buying_Pattern from dual
union
select 'Lorena' as customer, 'TV,BIKE,MOBILE' as Buying_Pattern from dual
union
select 'Mark' as customer, 'TV,BIKE,BIKE,MOBILE' as Buying_Pattern from dual
)
SELECT * FROM TBL_REG WHERE REGEXP_LIKE(REPLACE(Buying_Pattern,',',' '),'TV BIKE MOBILE CAR');


CUSTOM BUYING_PATTERN
------ -------------------
Linda  TV,BIKE,MOBILE,CAR


          With proper Data Modeling in place the data is not gonna be stored in CSV format, so let us re-iterate the requirement to identifying the pattern

Scenario/Pattern 1 - Data Format = Rows

                    With Anaytical function - LEAD the same can be achived 
SQL
WITH PRDT_ORDER AS
(SELECT
BP.*
,LEAD(PRODUCT) OVER (PARTITION BY CUSTOMER ORDER BY BUY_DT) NEXT_PRDT
,LEAD(PRODUCT,2) OVER (PARTITION BY CUSTOMER ORDER BY BUY_DT) SECOND_NEXT_PRDT
,LEAD(PRODUCT,3) OVER (PARTITION BY CUSTOMER ORDER BY BUY_DT) THIRD_NEXT_PRDT
FROM BUYING_PATTERN BP
)
SELECT CUSTOMER,BUY_DT,PRODUCT FROM PRDT_ORDER
WHERE
PRODUCT='TV'
AND NEXT_PRDT = 'BIKE'
AND SECOND_NEXT_PRDT = 'MOBILE'
AND THIRD_NEXT_PRDT = 'CAR';

CUSTOMER   BUY_DT      PRODUCT
-------- ------------------------------ -----------
LINDA     15-JUN-16 07.51.48.812559 AM     TV



Scenario 2 - Pattern 1 + Pattern 2 - Data Format = CSV


       If we want to look out for another pattern along with existing one , .ie along with pattern 1 = TV,BIKE,MOBILE,CAR, check for pattern 2 = TV,BIKE,BIKE,MOBILE  it is just a simple change to the existing SQL

with tbl_reg as
(
select 'Linda' as customer, 'TV,BIKE,MOBILE,CAR' as Buying_Pattern from dual
union
select 'Lorena' as customer, 'TV,BIKE,MOBILE' as Buying_Pattern from dual
union
select 'Mark' as customer, 'TV,BIKE,BIKE,MOBILE' as Buying_Pattern from dual
)
SELECT * FROM TBL_REG WHERE REGEXP_LIKE(REPLACE(Buying_Pattern,',',' '),'TV BIKE MOBILE CAR|TV BIKE BIKE MOBILE');

CUSTOM BUYING_PATTERN
------ -------------------
Linda  TV,BIKE,MOBILE,CAR
Mark   TV,BIKE,BIKE,MOBILE


Now the same in rows

Scenario 2 - Pattern 1+Pattern 2 - Data Format = Rows


pattern 1 = TV,BIKE,MOBILE,CAR,
pattern 2 = TV,BIKE,BIKE,MOBILE


WITH PRDT_ORDER AS
(SELECT 
BP.*
,LEAD(PRODUCT) OVER (PARTITION BY CUSTOMER ORDER BY BUY_DT) NEXT_PRDT
,LEAD(PRODUCT,2) OVER (PARTITION BY CUSTOMER ORDER BY BUY_DT) SECOND_NEXT_PRDT
,LEAD(PRODUCT,3) OVER (PARTITION BY CUSTOMER ORDER BY BUY_DT) THIRD_NEXT_PRDT
FROM BUYING_PATTERN BP
)
SELECT CUSTOMER,BUY_DT,PRODUCT FROM PRDT_ORDER 
WHERE 
PRODUCT='TV'
AND NEXT_PRDT = 'BIKE'
AND SECOND_NEXT_PRDT = 'MOBILE'
AND THIRD_NEXT_PRDT = 'CAR'
union
SELECT CUSTOMER,BUY_DT,PRODUCT FROM PRDT_ORDER 
WHERE 
PRODUCT='TV'
AND NEXT_PRDT = 'BIKE'
AND SECOND_NEXT_PRDT = 'BIKE'
AND THIRD_NEXT_PRDT = 'MOBILE';

USTOMER         BUY_DT    PRODUCT
----------- -----------------------------------  ---------------------
LINDA       15-JUN-16 07.51.48.812559 AM           TV
MARK       15-JUN-16 07.51.48.812559 AM           TV


Scenario 3 - Pattern 3 - Data Format = CSV

       Pattern 3 is some like when you think sky is the limit i.e really not sure how many of them could repeat within a pattern. For an example - All 1,2,3 from the below picture are of same pattern XYZA


with tbl_reg as
(
select 'Linda' as customer, 'TV,BIKE,MOBILE,CAR' as Buying_Pattern from dual
union
select 'Lorena' as customer, 'TV,BIKE,MOBILE' as Buying_Pattern from dual
union
select 'Mark' as customer, 'TV,BIKE,BIKE,MOBILE' as Buying_Pattern from dual
union
select 'RAJINI' as customer, 'TV,TV,BIKE,BIKE,BIKE,MOBILE,CAR' as Buying_Pattern from dual
)
SELECT customer, buying_pattern  FROM TBL_REG WHERE REGEXP_LIKE(REPLACE(Buying_Pattern,','),'(TV(.*)BIKE(.*)MOBILE(.*)CAR(.*))');

CUSTOM BUYING_PATTERN
------ -------------------------------
Linda  TV,BIKE,MOBILE,CAR
RAJINI TV,TV,BIKE,BIKE,BIKE,MOBILE,CAR


Scenario 3 - Pattern 3 -  Data Format = ROWS

                  We have two sets of data, one has a fixed pattern while other has got a subset, ie, Looking out for buying pattern, matching the order - "TV(.*) BIKE(.*) MOBILE CAR". Both 1 and 2 in the below picture belong to the same pattern "TV BIKE MOBILE CAR"
























               Untill 11g if this has to be written in a single SQL, it becomes pretty tough to have it in a single optimal SQL. Oracle 12c - Pattern Matching has got the answer, when you want to perform a pattern recognition in a sequence of rows - it is MATCH_RECOGNIZE

Here it is
SELECT *
FROM BUYING_PATTERN
MATCH_RECOGNIZE ( PARTITION BY CUSTOMER ORDER BY BUY_DT
  MEASURES
    FIRST(A.PRODUCT) first_ITEM,
    FIRST(A.BUY_DT) first_ITEM_DT,
    NVL(LAST(C.PRODUCT),LAST(B.PRODUCT)) LAST_ITEM_BEFORE_X,
    LAST(D.PRODUCT) LAST_ITEM,
    LAST(D.BUY_DT) LAST_ITEM_DT
    ,COUNT(*) CNT    
  one ROW PER MATCH   
  PATTERN (A* B* C D)
  DEFINE   
  A AS A.PRODUCT='TV',
  B AS B.PRODUCT='BIKE',
  C AS C.PRODUCT='MOBILE',
  D AS D.PRODUCT='CAR'
);


CUSTOMER    FIRST_ITEM    FIRST_ITEM_DT     LAST_ITEM_BEFORE_X     LAST_ITEM LAST_ITEM_DT                CNT
----------- ------------------ -------------------------  ------------------   -------------- -------------------------------- ----
LINDA     TV      15-JUN-16 07.51.48.812559 AM     MOBILE     CAR  15-JUN-16 07.54.48.812559 AM 4
RAJINI     TV      15-JUN-16 03.22.24.354595 PM     MOBILE     CAR  15-JUN-16 03.25.24.354595 PM 7


Once we know why it is required, where we can apply,  we are half way through in Pattern Matching - MATCH_RECOGNIZE. Rest to follow in upcoming post.