# Associate Rule Mining 关联规则挖掘

# Market Basket Analysis 购物车分析

Associate Rule Mining 最早被应用于 Market Basket Analysis

  • Goal of MBA is to find associations (affinities) among groups of items occurring in a transactional database
    MBA 的目标是在交易数据库中出现的一组项目之间找到联系 (亲缘关系)

    • has roots in analysis of point of sale data, as in supermarkets
      根源在于销售点数据的分析,比如在超市
    • but, has found applications in many other areas
      但是,已经应用在许多其他领域
  • Association Rule Discovery 关联规则发现

    • most common type of MBA technique
      最常见的 mba 技术
    • Find all rules that associate the presence of one set of items with that of another set of items.
      找到所有将一组项目的存在与另一组项目的存在联系起来的规则。
    • Example: 98% of people who purchase tires and auto accessories also get automotive services done
      例子:98% 的人购买轮胎和汽车
    • We are interested in rules that are
      我们对以下规则感兴趣。
      • non trivial (and possibly unexpected)
        非同小可 (可能出乎意料)。
      • actionable
        可操作的。
      • easily explainable
        易于解释

# What Is Association Mining? 什么是关联挖掘?

  • Association rule mining searches for relationships between items in a data set:
    关联规则挖掘搜索数据集中项目之间的关系:
    • Finding association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, etc.
      在事务数据库、关系数据库等中查找项目或对象集合之间的关联、相关性或因果结构。

不是通过给定某个值来衡量关联,而是制定一条规则,来告诉你原因是什么,结果是什么
所有的规则都要有左右两边,左边是原因,右边是结果,规则后面还要加上两个数字来描述它,分别是 support 和 confidence。

  • Rule form:
    规则表单:

    • Body → Head[support, confidence]
    • Body and Head can be represented as sets of items or as predicates
      BodyHead 可以表示为项集或谓词。
  • Examples:

    • {diaper, milk, Thursday} → {beer} [0.5%, 78%]
    • buys(x, "bread") → buys(x, "milk") [0.6%, 65%]
    • major(x, "CS") /\takes(x, "DB") → grade(x, "A") [1%, 75%]
    • age(X,30-45) /\income(X, 50K-75K) → buys(X, SUVcar)
    • age="30-45", income="50K-75K" → car="SUV"

It can be considered as an unsupervised learning process.
可以认为是一个非监督式学习过程
Because we have no idea about what kind of patterns we can find
因为不知道能找到什么样的模式

# Different Kinds of Association Rules 不同类型的关联规则

Boolean vs. Quantitative 布尔 vs. 定量
associations on discrete and categorical data vs. continuous data
离散和分类数据 vs. 连续数据的关联
Single vs. Multiple Dimensions 单维 vs. 多维空间
  • one predicate = single dimension; multiple predicates = multiple dimensions
    单谓词 = 单维;多谓词 = 多维
  • buys(x, "milk") → buys(x, "butter")
  • age(X,30 45) / income(X, 50K 75K) → buys(X, SUVcar)
Single level vs. multiple level analysis 单层次分析 vs. 多层次分析
  • Based on the level of abstractions involved
    基于涉及的抽象级别
  • buys(x, "bread") → buys(x, "milk")
  • buys(x, "wheat bread") → buys(x, 2% milk)
Simple vs. constraint based 简单 vs. 基于约束
Constraints can be added on the rules to be discovered
可以在要发现的规则上添加约束

# Basic Concepts

  • We start with a set II of items and a set DD of transactions
    我们从一组II 项目和一组DD 交易开始

    • I={i1,i2,,im}I = \left \{ i_{1}, i_{2}, \dots , i_{m} \right \}
    • DD is all of the transactions relevant to the mining task
      DD 是与挖掘任务相关的所有事务
  • A transaction TT is a set of items (a subset of II): TIT \subseteq I
    交易 TT 是一组项目 (II 的子集)

  • An Association Rule is an implication on itemsets XX and YY , denoted by X → Y , where
    关联规则意味着对XXYY 关系的暗示,用 X → Y 表示,其中

    XI,YI,XY=X \subseteq I, Y \subseteq I, \quad X \cap Y=\varnothing

  • The rule meets a minimum confidence of cc, meaning that cc% of transactions in DD which contain XX also contain YY
    该规则满足 cc% 的最小置信度,这意味着 DD 中有 cc% 的包含 XX 的交易也包含了 YY

    cXY/Xc \geq|X \cup Y| /|X|

  • In addition a minimum support of ss is satisfied
    此外,最小支持度 ss 满足

    sXY/Ds \geq|X \cup Y| /|D|

# Support and Confidence 支持度和置信度

  • Find all the rules X→Y with minimum confidence and support
    以最小置信度和支持度,找到所有 X→Y 的规则
Support 支持度
= probability that a transaction contains X,Y{X,Y}
= 交易包含X,Y{X,Y} 的概率
i.e., ratio of transactions in which XX, YY occur together to all transactions
例如,XXYY 一起出现的交易占所有交易的比率
Confidence 置信度
= conditional probability that a transaction having XX also contains YY
= 具有XX 的交易也包含YY条件概率
i.e., ratio of transactions in which XX, YY occur together to those in which XX occurs.
XXYY 一起出现的交易占出现XX 的交易的比率。
  • In general confidence of a rule LHS → RHS can be computed as the support of the whole itemset divided by the support of LHS :
    一般来说,规则 LHS→RHS 的置信度可以计算为整个项集的支持度除以 LHS 的支持度:

    Confidence (LHSRHS)=Support(LHSRHS) / Support(LHS)\text { Confidence (LHS } \Rightarrow \text { RHS) }=\text { Support(LHS } \cup \text { RHS) / Support(LHS) }

# Example

Transaction IDItems Bought
1001A, B, C
1002A, C
1003A, D
1004B, E, F
1005A, D, F

Itemset {A, C} has a support of 2/5 = 40%

同时有 A 和 C 的只有 1001 和 1002 两个,总共有 5 个交易

Rule {A} → {C} has confidence of 50%

{A} → {C} 置信度是指给定 A 的 C 的概率
先找到有 A 的交易包括了,1001、1002、1003、1005 四个
在这些交易中,包含了 C 的交易有 1001 和 1002 两个

Rule {C} → {A} has confidence of 100%

反过来, {C} → {A} 置信度是指给定 C 的 A 的概率
先找到有 C 的交易包括了,1001、1002 四个
在这些交易中全部包含了 A

Support for {A, C, E} ?

同时有 A、C 和 E 的交易没有,所以 support = 0
Support for {A, D, F} ?
同时有 A、D 和 F 的交易有 1005 一个,所以 support = 1/5 = 20%

Confidence for {A, D} → {F} ?

{A, D} → {F} 置信度是指给定 A 和 D 的 F 的概率
先找到同时包含 A 和 D 的交易包括了,1003、1005 两个
在这些交易中,包含了 F 的交易有 1005 一个,故置信度为 50%

Confidence for {A} → {D, F} ?

{A} → {D, F} 置信度是指给定 A 的 同时包含 D 和 F 的交易的概率
先找到有 A 的交易包括了,1001、1002、1003、1005 四个
在这些交易中,同时包含了 D 和 F 的交易有 1005 一个,故置信度为 1/4 = 25%

# Improvement (Lift) 优化值

  • High confidence rules are not necessarily useful
    高置信度规则不一定有用

    不能只看置信度而忽视支持度,两者都很重要

    • what if confidence of {A, B} → {C} is less than Pr(C)Pr(C)?
      如果 {A, B} → {C} 的置信度小于Pr(C)Pr(C),该怎么办?
    • improvement gives the predictive power of a rule compared to just random chance:
      与随机概率相比,优化值可以提供规则的预测能力:

      improvement=Pr(resultcondition)Pr(result)=confidence(rule)support(result)\text { improvement }=\frac{\operatorname{Pr}(\text { result } \mid \text { condition })}{\operatorname{Pr}(\text { result })}=\frac{\text { confidence }(\text { rule })}{\text { support }(\text { result })}

      Lift value = 规则的置信度除以结果的支持度

Itemset {A} has a support of 4/5
Rule {C} → {A} has confidence of 2/2

通过置信度除以支持度来计算优化值。
Improvement = 5/4 = 1.25

Itemset {A} has a support of 4/5
Rule {B} → {A} has confidence of 1/2

通过置信度除以支持度来计算优化值。
Improvement = 5/8 = 0.625

# Steps in Association Rule Discovery 关联规则发现的步骤

# Find the frequent itemsets 查找常用项集

Frequent item sets 频繁项集
  • are the sets of items that have minimum support
    是指支持度最低的项集
  • a subset of a frequent itemset must also be a frequent itemset
    频繁项集的子集也必须是频繁项集
    • if {A, B} is a frequent itemset , both {A} and {B} are frequent itemsets
      如果 {A, B} 是频繁项集, {A}{B} 都是频繁项集
    • this also means that if an itemset that doesn't satisfy minimum support, none of its supersets will either (this is essential for pruning search space)
      这也意味着,如果一个项集不满足最小支持度,那么它的任何超集都不会满足(这对于修剪搜索空间至关重要)

# Apriori Algorithm: Find Frequent Itemset Apriori 算法:寻找频繁项集

  • CkC_{k}: Candidate itemset of size kk
    大小为kk 的候选项集

  • LkL_{k}: Frequent itemset of size kk
    大小为kk 的频繁项集

L1L_{1} = { frequent items };
for(kk = 1; Lk!=L_{k} != \varnothing ; kk++) do begin // 从kk = 1 开始循环检查
Ck+1C_{k+1} = candidates generated from LkL_{k};
for each transaction tt in database do
increment the count of all candidates in Ck+1C_{k+1} that are contained in tt
增加Ck+1C_{k+1} 中包含在tt 中的所有候选项的计数
Lk+1L_{k+1} = candidates in Ck+1C_{k+1} with min_support
end
return kLk\cup_{k} L_{k};

Join Step: 连接步骤
CkC_{k} is generated by joining Lk1L_{k-1} with itself
CkC_{k} 是通过将 Lk1L_{k-1} 与其自身连接而生成的
Prune Step: 修剪步骤
Any (k1)(k-1)-itemset that is not frequent cannot be a subset of a frequent kk-itemset
任何不常见的 (k1)(k-1) 项集都不能成为常见的 kk 项集的子集

# Example of Generating Candidates

  • L3L_{3} = {abc, abd, acd, ace, bcd}

  • Self joining: L3×L3L_{3} \times L_{3}.

    • abcd from abc and abd
    • acde from acd and ace
  • Pruning:

    • acde is removed because ade is not in L3L_{3}.
  • C4C_{4} = {abcd}

# Apriori Algorithm - An Example

Assume minimum support = 2
假定最小支持度为 2

Database D 中一共有 5 种 items {1,2,3,4,5},即 C1C_{1}
第一步计算C1C_{1} 每个值的支持度,比如值 1,在 database D 中出现了 2 次,值 2 则出现 3 次,以此类推。
按照最小支持度为 2 的假定,支持度低于 2 的需要移除,因此得到 L1L_{1} = {1,2,3,5},移除了 {4},因此 L1L_{1} 就是大小为 1 的频繁项集。

第二步,将 L1L_{1} 中的项进行混合,成为大小为 2 的新集合,即C2C_{2}
继续计算C2C_{2} 每个值的支持度,并移除支持度低于 2 的,得到L2L_{2},因此 L2L_{2} 就是大小为 2 的频繁项集。

第三步,混合 L2L_{2} 中的项,试着找到大小为 3 的项集,即C3C_{3}
C2C_{2} 中,{1,2} 已经不是频繁项集了,所以C3C_{3} 中也就不应存在 {1,2,3};同理,{1,5} 也不是频繁项集,因此 {1,2,5} 和 {1,3,5} 也不是。
计算支持度,得到 L3L_{3}

接下来是kk = 4,在这种情况下,需要混合L3L_{3} 中的项目,以获得大小为 4 的项集,但已经没有如此多的项,故运算结束。

如果一个项集不满足最小支持度,那么它的任何超集都不会满足

The final “frequent” item sets are those remaining in L2L_{2} and L3L_{3}.
最后的 “频繁” 项集是那些剩余的 L2L_{2}L3L_{3}

However, {2,3} , {2,5} , and {3,5} are all contained in the larger item set {2, 3, 5} .
然而, {2,3} , {2,5}{3,5} 都包含在较大的项集 {2, 3, 5} 中。
Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5} .
因此,Apriori 报告的最后一组项集是 {1,3}{2,3,5}
These are the only item sets from which we will generate association rules.
这是唯一的项集,我们将从中生成关联规则。

# Use the frequent itemsets to generate association rules 使用频繁项集生成关联规则

  • Only strong association rules are generated
    只产生强关联规则
  • Frequent itemsets satisfy minimum support threshold
    频繁项集满足最小支持度
  • Strong rules are those that satisfy minimum confidence threshold
    强规则满足最小置信度

confidence(AB)=Pr(BA)=support(AB)support(A)\operatorname{confidence}(A \rightarrow B)=\operatorname{Pr}(B \mid A)=\frac{\operatorname{support}(A \cup B)}{\operatorname{support}(A)}

For each frequent itemset, ff, generate all non-empty subsets of ff
For every non-empty subset ss of ff do
if support(ff)/support(ss) ≥ min_confidence then
output rule s → (f-s)
end

# Example Continued

  • Item sets: {1,3} and {2,3,5}

  • Recall that confidence of a rule LHS → RHS is Support of itemset (i.e. LHSRHSLHS \cup RHS) divided by support of LHS .

Candidate rules for {1,3}Candidate rules for {2,3,5}
RuleConf.RuleConf.RuleConf.
{1}→{3}2/2 = 1.0{2,3}→{5}2/2 = 1.00{2}→{5}3/3 = 1.00
{3}→{1}2/3 = 0.67{2,5}→{3}2/3 = 0.67{2}→{3}2/3 = 0.67
{3,5}→{2}2/2 = 1.00{3}→{2}2/3 = 0.67
{2}→{3,5}2/3 = 0.67{3}→{5}2/3 = 0.67
{3}→{2,5}2/3 = 0.67{5}→{2}3/3 = 1.00
{5}→{2,3}2/3 = 0.67{5}→{3}2/3 = 0.67

Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}→{3} , {3,5}→2 , {5}→{2} and {2}→{5} .
假设最小置信度为 75%,则报告的最终规则集为 {1}→{3} , {3,5}→2 , {5}→{2} and {2}→{5}

建议先减少支持度,不要一开始就降低置信度

# Extension 扩展

# Multiple-Level Rules 多级规则

  • Items often form a hierarchy
    物品往往形成等级制度
    • Items at the lower level are expected to have lower support
      较低级别的项目预计支持度较低
    • Rules regarding itemsets at appropriate levels could be quite useful
      关于适当级别项集的规则可能相当有用
    • Transaction database can be encoded based on dimensions and levels
      事务数据库可以根据维度和级别进行编码

  • Pros: find finer-grained rules
    优点:找到更细粒度的规则
  • Cons: support may be low
    缺点:支持率可能很低

为了能更好的计算,可以适当降低支持度的要求,比如从 50% 降到 30%

# Quantitative Rules 定量规则

  • Handling quantitative rules may require mapping of the continuous
    处理定量规则可能需要映射的连续
  • variables into Boolean or categorical ones
    变量转换成布尔值或范畴值

# Web Mining 网络挖掘

# What is Web Mining

  • From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident
    从一开始,从网络中提取有价值知识的潜力就相当明显

  • Web mining is the collection of technologies to fulfill this potential.
    网络挖掘是实现这种潜力的技术集合。

Web Mining
application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.
应用数据挖掘和机器学习技术从网络资源的内容、结构和使用中提取有用的知识。

# Types of Web Mining

Web Mining
Web Content MiningWeb Usage MiningWeb Structure Mining
Applications
  • document clustering or categorization
    文档聚类或分类
  • topic identification / tracking
    话题识别 / 跟踪
  • concept discovery
    概念发现
  • focused crawling
    聚焦爬行
  • content based personalization
    基于内容的个性化
  • intelligent search tools
    智能搜索工具
  • user and customer behavior modeling
    用户和客户行为模型
  • Web site optimization
    网站优化
  • e-customer relationship management
    电子客户关系管理
  • Web marketing
    网络营销
  • targeted advertising
    定向广告
  • recommender systems
    推荐系统
  • document retrieval and ranking (e.g., Google)
    文献检索和排名 (例如,google)
  • discovery of “hubs” and “authorities”
    发现” 中心” 和” 当局”
  • discovery of Web communities
    网络社区的发现
  • social network analysis
    社会网络分析

# Web Logs

# Usage Data Preprocessing

  • Data Cleaning
  • User/Session Identification
  • Page View Identification
  • Path Completion

Example
IPTimeURLReferrerAgent
1www.aol.com08:30:00A#Mozilla/5.0; Win NT
2www.aol.com08:30:01BEMozilla/5.0; Win NT
3www.aol.com08:30:01CBMozilla/5.0; Win NT
4www.aol.com08:30:02B#Mozilla/5.0; Win 95
5www.aol.com08:30:03CBMozilla/5.0; Win 95
6www.aol.com08:30:04F#Mozilla/5.0; Win 95
7www.aol.com08:30:04BAMozilla/5.0; Win NT
8www.aol.com08:30:05GBMozilla/5.0; Win NT

# Two major challenges in PreProcessing 预处理过程中的两个主要挑战

Identification of Users 用户识别
  • Log data have mixed info of users and transactions
    日志数据混杂了用户和交易的信息
  • Some times, a user may not login the system
    有时,用户可能无法登录
Identification of Sessions 会话系统辨识
  • A user may visit a same site for several times
    用户可能多次访问同一个网站
  • A user may leave the computer for a while
    用户可能离开计算机一段时间
  • User may have different intents in different sessions
    用户可能在不同的会话中有不同的意图

购物时,交易中有 item;网络中,会话记录了访问的 web page

# Mechanisms for User Identification 用户识别机制

MethodDescriptionPrivacy ConcermAdvantagesDisadvantages

IP Address & Agent
IP 地址 & 代理

Assume each unique IP address/Agent pair is a unigue user.
假设每个唯一的 ip 地址 / 代理对都是一个 unigue 用户。

Low

Always available. No additional technology required.
随时可用。不需要额外的技术。

Not guaranteed to be unique. Defeated by random or rotating IP.
不一定会是独一无二的。被随机或者旋转的 ip 击败。

Embedded Session ID
嵌入会话 ID

Use dynamically generated pages to insent ID into every link.
使用动态生成的页面将 ID 内置到每个链接中。

Low / Medium
中 / 低

Always available. Independent of IP address.
随时可用。独立于 ip 地址。

No concept of a repeat visit. Requires fully dynamic site.
不会再来了。需要完全动态的站点。

Registration
注册

Users explicitly sign-in to site.
用户明确地登录到站点。

Medium
媒体

Can track single individuals, not just browsers.
可以跟踪单个个人,而不只是浏览器。

Not all users may be willing to register
不是所有用户都愿意注册

Cookie

Save an identifier on the client machine
在客户端机器上保存一个标识符

Medium/ High
中 / 高

Can track repeat visits.
可以跟踪重复访问。

Can be disabled. Negative public image.
可以被禁用。负面公共形象。

Software Agent
软件代理

Program loaded into browser that sends back usage data
程序加载到浏览器,发回使用数据

High

Accurate usage data for a single Web site.
准确的使用数据为一个网站。很可能被拒绝。负面的公众形象。

Likely to be refused. Negative public image.

Modified Browser
修改浏览器

Browser records usage data.
记录使用数据

Very High
非常高

Accurate usage data across entire Web
准确的使用数据跨整个网络

Users must explicitly ask for software.
用户必须明确要求软件

# Sessionization Heuristics 会话启发法

# Time Oriented Heuristics 时间导向启发法

  • h1 :

    • Total session duration may not exceed a threshold θ\theta.
      总的会话持续时间不能超过阈值 θ\theta
    • Given t0t_{0} , the timestamp for the first request in a constructed session SS , the request with timestamp tt is assigned to SS , iff tt0θt - t_{0} \le \theta
      给定 t0t_{0},构造会话 SS 中的第一个请求的时间戳,如果tt0θt - t_{0} \le \theta,带有时间戳 tt 的请求被分配给 SS
  • h2 :

    • Total time spent on a page may not exceed a threshold δ\delta.
      在一个页面上花费的总时间不能超过阈值δ\delta
    • Given t1t_{1}, the timestamp for request assigned to constructed session SS, the next request with timestamp t2t_{2} is assigned to SS , iff t2t1δt_{2} - t_{1} \le \delta
      给定分配给已构建会话SS 的请求的时间戳t1t_{1},如果t2t1δt_{2} - t_{1} \le \delta,将具有时间戳t2t_{2} 的下一个请求分配给SS

# Referrer Based Heuristic 基于来源启发法

  • href :
    • Given two consecutive requests pp and qq , with pp belonging to constructed session SS.
      给定两个连续的请求ppqq,其中pp 属于构造的会话SS
    • Then qq is assigned to SS , if the referrer for qq was previously invoked in SS
      然后qq 被分配给SS,如果qq 的来源先前在SS 中被调用

Note: in practice, it is often useful to use a combination of time- and navigation-oriented heuristics in session identification.
注意:在实践中,在会话识别中结合使用面向时间和导航的试探法通常是有用的。

Referrer Based Heuristic
IPTimeURLReferrerAgent
1www.aol.com08:30:00A#Mozilla/5.0; Win NT
2www.aol.com08:30:01BEMozilla/5.0; Win NT
3www.aol.com08:30:01CBMozilla/5.0; Win NT
4www.aol.com08:30:02B#Mozilla/5.0; Win 95
5www.aol.com08:30:03CBMozilla/5.0; Win 95
6www.aol.com08:30:04F#Mozilla/5.0; Win 95
7www.aol.com08:30:04BAMozilla/5.0; Win NT
8www.aol.com08:30:05GBMozilla/5.0; Win NT
  • Identified Sessions:
    S1S_{1}:# → A → B → G from references 1, 7, 8
    S2S_{2}:E → B → C from references 2, 3
    S3S_{3}:# → B → C from references 4, 5
    S4S_{4}:# → F from reference 6

  • Path Completion 路径完成

    • User's actual navigation path: A → B → D → E → D → B → C
    • What the server log shows: 服务器日志显示
    URLReferrer
    A--
    BA
    DB
    ED
    CB
    • Need knowledge of link structure to complete the navigation path.
      需要了解链接结构才能完成导航路径。
    • There may be multiple candidate for completing the path. For example consider the two paths : E → D → B → C and E → D → B → A → C.
      可能有多个候选项用于完成路径。例如,考虑这两条路径:E → D → B → C 和 E → D → B → A → C
    • In this case, the referrer field allows us to partially disambiguate.
      在这种情况下,referer 字段允许我们部分消除歧义。
      But, what about: E → D → B → A → B → C?
    • One heuristic: always take the path that requires the fewest number of “back” references.
      一个启发:总是选择需要最少 “返回” 引用的路径。
    • Problem gets much more complicated in frame-based sites.
      在基于框架的站点中,问题变得更加复杂。

# Sessionization Example

TimeIPURLRefAgent
0:011.2.3.4A-IE5;Win2k
0:091.2.3.4BAIE5;Win2k
0:102.3.4.5C-IE4;Win98
0:122.3.4.5BCIE4;Win98
0:152.3.4.5ECIE4;Win98
0:191.2.3.4CAIE5;Win2k
0:222.3.4.5DBIE4;Win98
0:221.2.3.4A-IE4;Win98
0:251.2.3.4ECIE5;Win2k
0:251.2.3.4CAIE4;Win98
0:331.2.3.4BCIE4;Win98
0:581.2.3.4DBIE4;Win98
1:101.2.3.4EDIE4;Win98
1:151.2.3.4A-IE5;Win2k
1:161.2.3.4CAIE5;Win2k
1:171.2.3.4FCIE4;Win98
1:251.2.3.4FCIE5;Win2k
1:301.2.3.4BAIE5;Win2k
1:361.2.3.4DBIE5;Win2k

首先要识别用户

  1. Sort users (based on IP+Agent) 对用户进行排序(基于 IP + 代理)
0:011.2.3.4A-IE5;Win2k
0:091.2.3.4BAIE5;Win2k
0:191.2.3.4CAIE5;Win2k
0:251.2.3.4ECIE5;Win2k
1:151.2.3.4A-IE5;Win2k
1:261.2.3.4FCIE5;Win2k
1:301.2.3.4BAE5;Win2k
1:361.2.3.4DBIE5;Win2k

0:102.3.4.5C-IE4;Win98
0:122.3.4.5BCIE4;Win98
0:152.3.4.5ECIE4;Win98
0:222.3.4.5DBIE4;Win98

0:221.2.3.4A-IE4;Win98
0:251.2.3.4CAIE4;Win98
0:331.2.3.4BCIE4;Win98
0:581.2.3.4DBIE4;Win98
1:101.2.3.4EDIE4;Win98
1:171.2.3.4FCIE4;Win98
  1. Sessionize using heuristics 使用启发式进行会话
0:011.2.3.4A-IE5;Win2k
0:091.2.3.4BAIE5;Win2k
0:191.2.3.4CAIE5;Win2k
0:251.2.3.4ECIE5;Win2k
1:151.2.3.4A-IE5;Win2k
1:261.2.3.4FCIE5;Win2k
1:301.2.3.4BAE5;Win2k
1:361.2.3.4DBIE5;Win2k

0:011.2.3.4A-IE5;Win2k
0:091.2.3.4BAIE5;Win2k
0:191.2.3.4CAIE5;Win2k
0:251.2.3.4ECIE5;Win2k

1:151.2.3.4A-IE5;Win2k
1:261.2.3.4FCIE5;Win2k
1:301.2.3.4BAE5;Win2k
1:361.2.3.4DBIE5;Win2k

The h1 heuristic (with timeout variable of 30 minutes) will result in the two sessions given above.
h1 启发式 (超时变量为 30 分钟) 将导致上述两个会话。

How about the heuristic href?
启发式的 href 怎么样?

How about heuristic h2 with a timeout variable of 10 minutes?
超时变量为 10 分钟的启发式 h2 怎么样?

  1. Sessionize using heuristics (another example)
0:221.2.3.4A-IE4;Win98
0:251.2.3.4CAIE4;Win98
0:331.2.3.4BCIE4;Win98
0:581.2.3.4DBIE4;Win98
1:101.2.3.4EDIE4;Win98
1:171.2.3.4FCIE4;Win98

In this case, the referrer-based heuristics will result in a single session, while the h1 heuristic (with timeout = 30 minutes) will result in two different sessions.
在这种情况下,基于引用的启发式将导致单个会话,而 h1 启发式 (超时 = 30 分钟) 将导致两个不同的会话。

How about heuristic h2 with timeout = 10 minutes?
超时 = 10 分钟的启发式 h2 怎么样?

  1. Perform Path Completion 执行路径补全
    A→C , C→B , B→D , D→E , C→F
    Need to look for the shortest backwards path from E to C based on the site topology.
    需要根据站点拓扑查找从 E 到 C 的最短反向路径。
    Note, however, that the elements of the path need to have occurred in the user trail previously.
    但是,请注意,路径的元素需要以前在用户跟踪中出现。
    E→D , D→B , B→C 需要加到 D→E , C→F 之间

# Web Mining by Association Rules

# Market Analysis vs Web Mining

Market Analysis 市场分析
  • We explore associations among items in transactional databases
    我们探索事务数据库中项目之间的关联
  • Items may show up together in different transactions, such as each receipt
    项目可能会一起出现在不同的交易中,例如每张收据
Web Mining 网络挖掘
  • We can explore the associations among Web pages or behaviors in Web logs
    我们可以探索 Web 日志中网页或行为之间的关联
  • Web pages or behaviors may show up together in different sessions
    网页或行为可能会一起出现在不同的会话中

# Web Usage Mining by Association Rules 基于关联规则的 Web 使用挖掘

Web Association Rule Mining Web 关联规则挖掘
  • The process is similar to association rule mining, but you need to apply the rule mining per sessions
    该过程类似于关联规则挖掘,但您需要在每个会话中应用规则挖掘
  • Examples
    • 60% of clients who accessed /products/ , also accessed /products/software/webminer.htm
      60% 访问了... 也访问了... 的客户
    • 30% of clients who accessed /special-offer.html , placed an online order in /products/software
      30% 的客户访问... 在... 中在线下单
Web Sequential Mining Web 序列挖掘
  • In association rule mining, the sequence does not matter.
    在关联规则挖掘中,顺序无关紧要。
    But on the Web, the sequence takes a key role.
    但在网络上,序列扮演着关键角色。
    For example, {A → B → C} → {D} may be very different from {B → A →C} → {D}
  • The process is similar to the association rule mining, but you need to consider sequences when you calculate support and confidence values·
    该过程类似于关联规则挖掘,但在计算支持和置信值时,需要考虑序列。

顺序不一致,不能算到一起

# Web Log Data

If you’d like to work on Web mining…

  • NASA Web Logs, http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
  • Wikipedia Web Logs, http://opensource.indeedeng.io/imhotep/docs/sample-data/
  • MSNBC.com Web Data, http://archive.ics.uci.edu/ml/datasets/MSNBC.com+Anonymous+Web+Data
  • Microsoft Web Data, http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data
  • DePaul CTI Web Logs, http://facweb.cs.depaul.edu/mobasher/classes/ect584/lectures/cti-april2003-clean-log.zip
阅读次数

请我喝[茶]~( ̄▽ ̄)~*

Ruri Shimotsuki 微信支付

微信支付

Ruri Shimotsuki 支付宝

支付宝

Ruri Shimotsuki 贝宝

贝宝