Query Generation

John Mount

2019-09-15

This is a very brief place-holder example (not executed, as SQLite does not have the needed window functions). For more details please see the fuller note: rquery README.

The primary purpose of rquery is SQL query generation. We demonstrate this below.

Note: in examples we use rq_copy_to() to create data. This is only for the purpose of having easy portable examples. With big data the data is usually already in the remote database or Spark system. The task is almost always to connect and work with this pre-existing remote data and the method to do this is db_td(), which builds a reference to a remote table given the table name.

# produce a hande to existing table
d <- db_td(my_db, "d")

scale <- 0.237

dq <- d %.>%
  extend(.,
         probability :=
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale)),
         count := count(1),
         partitionby = 'subjectID') %.>%
  extend(.,
         rank := rank(),
         partitionby = 'subjectID',
         orderby = c('probability', 'surveyCategory'))  %.>%
  rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
  select_rows(., rank == count) %.>%
  select_columns(., c('subjectID', 
                      'diagnosis', 
                      'probability')) %.>%
  orderby(., 'subjectID')

class(my_db)
#> [1] "SQLiteConnection"
#> attr(,"package")
#> [1] "RSQLite"

Presentation format (see also op_diagram()):

  mk_td("`d`", c(
    "subjectID",
    "surveyCategory",
    "assessmentTotal",
    "irrelevantCol1",
    "irrelevantCol2")) %.>%
   extend(.,
    probability := exp(assessmentTotal * 0.237) / sum(exp(assessmentTotal * 0.237)),
    count := count(1),
    partitionby = c('subjectID'),
    orderby = c(),
    reverse = c()) %.>%
   extend(.,
    rank := rank(),
    partitionby = c('subjectID'),
    orderby = c('probability', 'surveyCategory'),
    reverse = c()) %.>%
   rename_columns(.,
    c('diagnosis' = 'surveyCategory')) %.>%
   select_rows(.,
     rank == count) %.>%
   select_columns(., c(
     "subjectID", "diagnosis", "probability")) %.>%
   order_rows(.,
    c('subjectID'),
    reverse = c(),
    limit = NULL)

to_sql() SQL (see also materialize()):

  SELECT * FROM (
   SELECT
    `subjectID`,
    `diagnosis`,
    `probability`
   FROM (
    SELECT * FROM (
     SELECT
      `subjectID` AS `subjectID`,
      `surveyCategory` AS `diagnosis`,
      `probability` AS `probability`,
      `count` AS `count`,
      `rank` AS `rank`
     FROM (
      SELECT
       `subjectID`,
       `surveyCategory`,
       `probability`,
       `count`,
       rank ( ) OVER (  PARTITION BY `subjectID` ORDER BY `probability`, `surveyCategory` ) AS `rank`
      FROM (
       SELECT
        `subjectID`,
        `surveyCategory`,
        exp ( `assessmentTotal` * 0.237 ) / sum ( exp ( `assessmentTotal` * 0.237 ) ) OVER (  PARTITION BY `subjectID` ) AS `probability`,
        count ( 1 ) OVER (  PARTITION BY `subjectID` ) AS `count`
       FROM (
        SELECT
         `subjectID`,
         `surveyCategory`,
         `assessmentTotal`
        FROM
         `d` LIMIT 1000
        ) tsql_21099240144076170614_0000000000
       ) tsql_21099240144076170614_0000000001
     ) tsql_21099240144076170614_0000000002
    ) tsql_21099240144076170614_0000000003
    WHERE `rank` = `count`
   ) tsql_21099240144076170614_0000000004
  ) tsql_21099240144076170614_0000000005 ORDER BY `subjectID`